FORMAT: 1A HOST: https://api.captaindata.co/v1 # Captain Data Welcome to Captain Data API allowing users to view spiders, jobs and schedule spiders. If you're having any trouble using this API, shoot us an email at [support@captaindata.co](support@captaindata.co). The [developers portal](https://developers.captaindata.co) also contains additionnal information. # Authentication Authorization is straight forward, you just need to add the 'key' query parameter to all requests URL. > Never share your credentials! Every following examples takes this key parameter into account. To access your project's UID and your API key, head over to [Your Project's Settings](https://app.captaindata.co/settings) where you can copy & paste appropriate values. # Group Spiders ## Spiders Collection [/{project_uid}/spiders{?key}] + Parameters + key (string) - Your API key + project_uid (string) - Your project UID ### List All Spiders [GET] This endpoint lists all the spiders you created. + Response 200 (application/json) + Attributes (array[Spider]) + Response 400 (application/json) { "error": "The project uid you provided is not UUID." } + Response 404 (application/json) { "error": "Spider not found." } ## Spiders Scheduling [/{project_uid}/spiders/{spider_uid}/schedule{?key}] + Parameters + key (string) - Your API key + project_uid (string) - Your project UID + spider_uid (string) - The specific spider you want to schedule ### Schedule A Spider [POST] You may schedule a specific spider using this endpoint. + Request (application/json) { "start_urls": ["url.com", "url2.com"], "input_parameters": {"key": "value"} } + Response 200 (application/json) + Attributes (ScheduleResponse) + Response 400 (application/json) { "error": "start_urls must be a list OR You must supply a list of start_urls (even empty)." } + Response 404 (application/json) { "error": "Spider not found." } ## Start URLs Upload [/{project_uid}/spiders/{spider_uid}/upload{?key}] + Parameters + key (string) - Your API key + project_uid (string) - Your project UID + spider_uid (string) - The specific spider you want to schedule ### Upload Start URLs [POST] You may upload a CSV containing start URLs. + Request + Headers Content-Type: multipart/form-data; boundary=BOUNDARY + Body --BOUNDARY Content-Disposition: form-data; name="csvFile" + Response 200 (application/json) { "message": "CSV uploaded." } + Response 400 (application/json) { "error": "No file part. (OR) No file selected." } + Response 404 (application/json) { "error": "Can't find this spider." } # Group Jobs ## Jobs Collection [/{project_uid}/spiders/{spider_uid}/jobs{?key}] + Parameters + key (string) - Your API key + project_uid (string) - Your project UID + spider_uid (string) - The specific spider you want to schedule ### List All Jobs [GET] Lists all the jobs available for a specific spider. + Response 200 (application/json) + Attributes (array[Job]) + Response 404 (application/json) { "error": "Spider not found." } + Response 400 (application/json) { "error": "Might be a specific error. Contact us if it persits." } ## Job Results [/{project_uid}/jobs/{job_uid}/results{?key}] + Parameters + key (string) - Your API key + project_uid (string) - Your project UID + job_uid (string) - The specific job you want to retrieve ### Get Job Results [GET] This endpoint returns all results for a specific job. + Response 200 (application/json) + Attributes (Pagination) + Response 404 (application/json) { "error": "Job not found." } + Response 400 (application/json) { "error": "Might be a specific error. Contact us if it persits." } ## Spider Last Results [/{project_uid}/spiders/{spider_uid}/results/last{?key}] + Parameters + key (string) - Your API key + project_uid (string) - Your project UID + spider_uid (string) - The specific spider you want to retrieve the last jobs from ### Get Last Spider Job Results [GET] This endpoint returns the last finished job for a specific spider. + Response 200 (application/json) + Attributes (Pagination) + Response 404 (application/json) { "error": "Last spider job not found. Make sure you run the spider once and that it finishes." } + Response 400 (application/json) { "error": "Might be a specific error. Contact us if it persits." } ## Last Results [/{project_uid}/spiders/results/last{?key}] + Parameters + key (string) - Your API key + project_uid (string) - Your project UID ### Get Last Job Results [GET] This endpoint returns the last finished job results for all your spiders. + Response 200 (application/json) + Attributes (Job) + Response 404 (application/json) { "error": "Project not found." } # Group Errors Common errors. * 200 `OK` - the request was successful (some API calls may return 201 instead). * 400 `Bad Request` - the request could not be understood or was missing required parameters. * 401 `Unauthorized` - authentication failed or user doesn't have permissions for requested operation. * 403 `Forbidden` - access denied. * 404 `Not Found` - resource was not found. * 405 `Method Not Allowed` - requested method is not supported for resource. # Data Structures ## Spider (object) + uid: `c4d22d2a-b3a9-4346-8a31-a0b466e3fc43` + created_at: Tue, 11 Sep (string) + name: Great Bot (string) + permalink: great-bot (string) + description: Great bot that does a lot of automation (string) + jobs array(array[Job]) - Jobs linked to this spider. + schemas(array[Schema]) - The output schema linked to the spider. ## Schema (object) + name: Generic Schema (string) + schema (JSON Schema) ## Job (object) + uid: `c4d22d2a-b3a9-4346-8a31-a0b466e3fc43` + spider_permalink: great-bot (string) + status: finished (string) + finish_reason: finished (string) + downloader_request_count: 742 (number) + downloader_response_count: 742 (number) + finish_time (Date) + start_time (Date) + is_scheduled: false (boolean) + item_scraped_count: 147 (number) + log_count_ERROR: 0 (number) + log_count_debug: 45035 (number) + log_count_info: 13057 (number) ## JSON Schema (object) + key: value + other_key: value ## Date (object) + $date: 1540167197906 ## Pagination (object) + items_count: 1 (number) + limit: 1000 (number) + pages: 1 (number) + paging (Paging) + results(array[Job]) ## Paging (object) + next: https://api.captaindata.co/v1/uuid/jobs/uuid/results?page=2&key=uuid (string) + previous: null ## ScheduleResponse (object) + job_uid: `c4d22d2a-b3a9-4346-8a31-a0b466e3fc43` + message: Spider successfully scheduled. (string) + spider (Spider) ## StartURLs (object) + start_urls : url1, url2 (array, required)