How to use Spotless Data?

Jobs

A job is the basic unit of work in Spotless. Every time you submit a file to be created it will be uploaded to spotless and processed. The processing runs according to the workflow shown to the right.

  1. A job is submitted for processing.
  2. Each field is checked against the rule specified in the plan.
  3. If the file is already clean, it is automatically sent back to the user and the job is completed.
  4. If the file is not clean, then it is passed for automatic cleansing.
  5. After the automatic cleansing, the file is checked again.
  6. If the file is now clean the it is sent back to the user and the job is completed.
  7. If it is not clean, then the fallback options for each rule in the plan is implemented and the file is sent back to the user, however the job is kept open.
  8. At this stage, the job is marked for escalation to the Spotless data science team who will review the results, implement improvements in the Spotless algorith and re-run automatic cleansing until the file has become spotless.

The user will receive an email every time a new clean file is available and API clients should additionally check whether jobs are marked with processing_complete when they read the status of a job. If the processing_complete is not set to True then the client should call the API again to check the status of the job at future intervals to get an update to the file.

When the job is complete, the field can_delete should be set and at this stage all data associated with the job will be deleted from Spotless' servers.

 

Job Fields

 

Fields submitted with a job are:

  • plan - this is the id of the plan that the file should be cleansed with
  • original_file - this is the file for cleansing. The CSV file should be uploaded as part of the JSON request

When the job is returned it includes the following additional fields:

  • processed_file - a link to download the processed file
  • processing_complete - set to true if the processing is complete and false if further processing is still in progress
  • can_delete - this is initially set to false and the client should PUT an update setting to true when the job is completed and all files should be deleted

 

Example Jobs

 

An example job is the job created in the getting started page.

This job returns JSON as follows:

{
    "profile": "https://spotlessdata.com/api/plans/47381c9a-4363-4964-93d0-0871d257f2d8/",
    "original_file": "https://spotlessdata.com/uploads/original/source/ip_list_Utrd5TD.csv",
    "can_delete": false,
    "url": "https://spotlessdata.com/api/jobs/bf386633-84ae-4e47-a2f5-547d95250b88/",
    "processed_file": "http://spotlessdata.com/uploads/complete/dirty/bf386633-84ae-4e47-a2f5-547d95250b88-cleaned.csv",
    "processing_complete": true
}

 

Creating Jobs

 

The following example extends the getting started code to fully process the job:

import requests

# create a UTF-8 encoded file with the IP list
with open("ip_list.csv", "wt") as f:
    f.write(u"192.168.0.1,12b4\n")
    f.write(u"192.16.8.0.2,65\n")
    f.write(u"192.168.0.3,42\n")

token = "<YOUR TOKEN GOES HERE>"

# create a plan to validate
response = requests.post(
    "https://spotlessdata.com/api/plans/",
    headers={"Authorization": "Token " + token},
    json={
        "name": "my first spotless plan",
        "rules": [
            {"rule": "ip-address-validator-regex-rule-c0e8fc58-e6c4-11e5-9730-9a79f06e9478",
             "source_field": 0},
            {"rule": "positive-integer-validator-regex-rule-c0e8e0ce-e6c4-11e5-9730-9a79f06e9478",
             "source_field": 1}
        ]
    }
)
plan = response.json()["id"]

# create a job
response = requests.post(
    "https://spotlessdata.com/api/jobs/",
    headers={"Authorization": "Token " + token},
    data={"plan": plan},
    files={"original_file": open('ip_list.csv', 'rb')}
)

job = response.json()
file = requests.get(job["processed_file"])
#do something with the file here...

while not job["is_complete"]:
    wait_ten_minutes()
    job = requests.get(
        job["url"],
        headers={"Authorization": "Token " + token}).json()
    file = requests.get(job["processed_file"])
    #do something with the file here...

requests.put(
    job["url"],
    headers={"Authorization": "Token " + token},
    data={"can_delete": True},
    )

print(response.text)