How to use Spotless Data?

Getting Started with the API

You can get started by signing up for an account here. You can log in with Facebook or Github. Once you have logged in, you can start to browser the RESTful API here.

At the top of the page your will see your API key. This key is unique to you and is needed to make any calls to the RESTful API.

If you are a python developer you may want to use our python client, otherwise read on for the in-depth details.

To start with, let’s consider a simple task of validating a list of IP addresses that have a number of requests associated with them:

IP Address, Number Requests
192.168.0.1, 124
192.168.0.2, 65
192.168.0.3, 42

In order to validate that this file is well formed we need to do two checks:

  1. Check that the first column is a well formed IP address
  2. Check that the second column is a positive integer

Luckily, rules to check these are included in the Spotless database. You can see the most popular rules here and this list typically includes the two rules we want to use here:

In order to use these rules, we need to create a plan that will validate our file. In this case we want to apply the IP address validation rule to column 0, and the positive integer validator to column 1. Note that for CSV files, the fields are specified by the column number, where the first column is column 0.

In order to create this plan we need to POST the JSON structure to API:

{
    "name": "my first spotless plan",
    "rules": [
        {"rule": "ip-address-validator-regex-rule-c0e8fc58-e6c4-11e5-9730-9a79f06e9478",
         "source_field": 0},
        {"rule": "positive-integer-validator-regex-rule-c0e8e0ce-e6c4-11e5-9730-9a79f06e9478",
         "source_field": 1}]
}

Let’s consider each of the fields in this JSON object in turn:

  1. The name is simply used for humans to identify the plan, once it’s created it will be given a unique URL to reference it
  2. The rules are a list of which rules should be applied, and which field in the source data they should be applied to

We create the rule by posting to the API as shown below using Python code:

import requests

token = "<YOUR TOKEN GOES HERE>"

response = requests.post(
    "https://spotlessdata.com/api/plans/",
    headers={"Authorization": "Token " + token},
    json={
        "name": "my first spotless plan",
        "rules": [
            {"rule": "ip-address-validator-regex-rule-c0e8fc58-e6c4-11e5-9730-9a79f06e9478",
             "source_field": 0},
            {"rule": "positive-integer-validator-regex-rule-c0e8e0ce-e6c4-11e5-9730-9a79f06e9478",
             "source_field": 1}
        ]
    }
)

This will return a JSON response that includes your plan, it will look something like this:

{
  "name": "my first spotless plan",
  "id": " "my-first-spotless-plan-de650d75-df3d-4df0-882f-7339ab2c59f6",
  "url": "https://spotlessdata.com/api/plans/de650d75-df3d-4df0-882f-7339ab2c59f6/",
  "rules": [
    {
      "rule": "ip-address-validator-regex-rule-c0e8fc58-e6c4-11e5-9730-9a79f06e9478",
      "name": "IP address validator",
      "source_field": 0
    },
    {
      "rule": "positive-integer-validator-regex-rule-c0e8e0ce-e6c4-11e5-9730-9a79f06e9478",
      "name": "Integer Validator",
      "source_field": 1
    }
  ]
}

In order to use this plan you need to use the extract of the parameter: 

"id": " "my-first-spotless-plan-de650d75-df3d-4df0-882f-7339ab2c59f6"

We can now submit a file for cleaning:

# create a job
response = requests.post(
    "https://spotlessdata.com/api/jobs/",
    headers={"Authorization": "Token " + token},
    data={"plan": plan},
    files={"original_file": open('ip_list.csv', 'rb')}
)

Where the value of plan comes from the profile URL from your previous call.

When the job first comes back it includes a URL you can use to retrieve the job:

#wait for the response to come through...
while response.json()["processed_file"] is None:
    time.sleep(1)
    response = requests.get(response.json()["url"], headers={"Authorization": "Token " + token},)

We then return a JSON file that includes a link to the cleaned file:

{
    "profile": "https://spotlessdata.com/api/plans/47381c9a-4363-4964-93d0-0871d257f2d8/",
    "original_file": "https://spotlessdata.com/uploads/original/source/ip_list_Utrd5TD.csv",
    "can_delete": False,
    "url": "https://spotlessdata.com/api/jobs/bf386633-84ae-4e47-a2f5-547d95250b88/",
    "processed_file": "http://spotlessdata.com/uploads/complete/dirty/bf386633-84ae-4e47-a2f5-547d95250b88-cleaned.csv",
    "processing_complete": True
}

We can then download and view the file referenced in process_file. As our original file was valid we are returning an almost identical file in return:

IP Address, Number Requests
192.168.0.1,124
192.16.8.0.2,65
192.168.0.3,42

However, what happens if we have a file with a problem in it? Let’s consider the following file:

IP Address, Number Requests
192.168.0.1,124
192.16.8.0.2,65
192.168.0.3,42

If we look online at the online detail for the IP validation rule: /api/rules/regex_rules/c0e8fc58-e6c4-11e5-9730-9a79f06e9478/

We can see that the option fallback_mode in set to remove_record, and this means that any invalid rows are removed. When we run the cleansing on this file, we get this back:

IP Address, Number Requests
192.168.0.1,124
192.168.0.3,42

You can see the data is becoming spotless!

Continue reading for more detail on how to create your own rules, scheduled manual review of data, and on receiving incremental updates to your data.

The full code in Python 3 for this example is:

import requests

# create a UTF-8 encoded file with the IP list
with open("ip_list.csv", "wt") as f:
    f.write(u"192.168.0.1,12b4\n")
    f.write(u"192.16.8.0.2,65\n")
    f.write(u"192.168.0.3,42\n")


token = "<YOUR TOKEN GOES HERE>"

# create a plan to validate
response = requests.post(
    "https://spotlessdata.com/api/plans/",
    headers={"Authorization": "Token " + token},
    json={
        "name": "my first spotless plan",
        "rules": [
            {"rule": "ip-address-validator-regex-rule-c0e8fc58-e6c4-11e5-9730-9a79f06e9478",
             "source_field": 0},
            {"rule": "positive-integer-validator-regex-rule-c0e8e0ce-e6c4-11e5-9730-9a79f06e9478",
             "source_field": 1}
        ]
    }
)
plan = response.json()["id"]

# create a job
response = requests.post(
    "https://spotlessdata.com/api/jobs/",
    headers={"Authorization": "Token " + token},
    data={"plan": plan},
    files={"original_file": open('ip_list.csv', 'rb')}
)

#wait for the response to come through...
while response.json()["processed_file"] is None:
    time.sleep(1)
    response = requests.get(response.json()["url"], headers={"Authorization": "Token " + token},)


response = requests.get(response.json()["processed_file"])

print(response.text)