How to use Spotless Data?

Reference Rules

Reference rules are one of the most powerful types of rules available in Spotless. They check that a field only contains a specific set of values provided in a separate file. This can be typically used to:

  1. Validate database foreign keys between two different systems
  2. Check that human input of address information validates against actual address information
  3. Check that natural keys between two different systems are of the same format

 

Fields specific to reference rules

Reference rules have the following fields (as well as the usual fields common to every rule):

  • original_reference - this is a UTF-8 encoded CSV file that contains a set of reference values that the field will be matched against.
  • reference_field - this specifies the field in the original_reference file that contains the reference values. For CSV files, it is column based with the first column having a value of 0.
  • can_add_new_values - this specifies whether the automatic cleansing algorithm can match against files that are not present in the reference file or only against this specific list. If the reference file is based on a dataset that you know is complete, like database foreign keys, then you should set this to False. If it is based on a dataset that may be incomplete, like a list of cities in Canada, then you should set this to True.

 

Example reference rules

There are many example reference rules available in the API, shown here. An example JSON for a regex rule is shown below:

{
    "name": "my first spotless reference rule",
    "is_private": true,
    "fallback_mode": "use_closest",
    "default_value": "banana",
    "original_reference": open('fruit_list.csv', 'rb'),
    "reference_field": 0,
    "can_add_new_values": true,
}

 

Creating reference rules

Reference rules can be created using the browseable API or in code.

When creating a reference rule, the reference CSV file needs to be submitted as part of the JSON request. The syntax for implementing this varies from language to language and the example below shows how to implement this in Python:

import requests

response = requests.post(
    "https://spotlessdata.com/api/rules/rule_reference/",
    headers={"Authorization": "Token " + token},
    data={"name": "my first spotless reference rule",
          "is_private": true,
          "fallback_mode": "use_closest",
          "default_value": "banana",
          "reference_field": 0,
          "can_add_new_values": true},
    files={"original_reference": open('fruit_list.csv', 'rb')}
)

You can create as many reference rules as you want but please note that unused reference rules are deleted after 90 days.