Bulk CSV Upload via AWS S3

For high volume users, we offer a bulk CSV upload option utilising AWS S3.

Note: bulk upload is only available on the Delayed queue and is a whitelisted feature (you will need to request access first).

Pre-requisites

First, you will need to create a new IAM user under your own AWS account. Once created, please ensure you attach a new policy, AmazonS3FullAccess, to enable the user to interact with S3 buckets.

We strongly recommend you create a unique user with no other permissions for use solely with this feature.

Make a note of the User ARN, it will look something like the following:

arn:aws:iam::[your-AWS-id]:user/user_name

Open a support ticket with your User ARN and you will be notified when access has been granted. The upload bucket to use in your requests is delayed-bulk.

Note: the upload bucket is a RequesterPays bucket, and therefore you must include the x-amz-request-payer header parameter with your requests. You must also ensure your uploads have the ACL bucket-owner-full-control.

File Naming Structure

Uploaded files must adhere to a strict format in order to be processed (note the use of the "in/" folder):

in/[your-API-key]_YYYYMMDD_[your-unique-identifier].csv

We recommend you use something unpredictable as your unique identifier for each upload and it may contain alphanumerics and hyphens only. YYYYMMDD must correspond to the current date or the file will go unprocessed.

CSV Input Format

The upload format should be in the form of comma-separated values: keyword, engine_code, location, device, customer_id.

  • keyword and engine_code are the only required fields.
  • customer_id is your internal request ID, and will be returned in the output CSV.

For example:

"first query",google_en-us,,
"pizza delivery",google_en-us,90210,mobile
Note: keywords should be UTF-8 encoded and contain no line breaks. If they contain non-CSV safe characters then they should also be URL-encoded. Each CSV upload can contain a maximum of 100,000 rows.

CSV Output Format

Once processed, a corresponding CSV file will be created in the "out/" folder of the same bucket:

out/[your-API-key]_YYYYMMDD_[your-unique-identifier].csv

It will contain the following format:

keyword, engine_code, location, device, status, delayed_id, modified_keyword, check_id, customer_id

status will contain either ok for accepted or error for not accepted.

delayed_id will contain a unique delayed queue ID on success, or an error message on failure.

modified_keyword will be empty unless the keyword string has been modified before the SERP was fetched.

check_id will contain a unique check_id where the SERP crawl was successful or failed where not.

Example output

"cheap   holidays",google_en-us,,,ok,50185500ebe44205ca000004,"cheap holidays",50185500ebe44205cb000102
"flights to nyc",,,,error,"Required field 'engine_code' missing",,

In this example, in row 1 the superfluous whitespace was removed before the query was processed. Row 2 contains a failure due to no engine_code being present.

Important:
  • Input CSV's should not contain header rows. Output CSV's will not contain header rows.
  • Due to the way we split and parallelize processing of uploads, the output will not necessarily be in the same row order as the upload.
  • As we retry failed requests on the delayed queue up to 3 times, in some instances where a keyword is repeatedly failing this can significantly delay the availability of the final processed CSV. We requeue a failed keyword after no less than 30 minutes has elapsed from the previous attempt.
  • You should aim to maximize uploads with the full 100,000 rows within each CSV whenever possible. You should observe a soft-minimum of 5,000 rows as small uploads will cause significant delays in fulfilment on your account.
  • Should we detect that a large proportion of your uploads are below the 5,000 minimum, we may begin to programatically enforce this on your account.

Comments