How to use Spotless Data?

Duplication Rules

Duplication rules provide a simple means to remove duplicates in Spotless. They check that only one row in a column contains a single value. This can be typically used to:

  1. Remove duplicate email addressses
  2. Validate that unique keys in a database are unique
  3. Remove multiple entries from different systems
 

Fields specific to duplication rules

Duplication rules have the following fields (as well as the usual fields common to every rule):

  • duplication_type - this can be set to one of the following three values
    • check_this_column_only - duplicates are only searched on this column and others are excluded
    • check_all_columns - all columns are searched and duplicates are only set if all columns are the same
    • check_other_columns - only the other columns are searched for duplicates. This column is ignoreed
  • use_last_value - this is a boolean value that specifies whether the last value should be retained. The default is for the first of the duplicate values to be retained

The duplication rule always removes the duplicate rows - there is no option to set a fallback mode to use a default value or the closest match.

Example duplication rules

There are many example duplication rules available in the API, shown here. An example JSON for a regex rule is shown below:

{
    "id": "duplication-rule-1---check-this-column-only-duplicationrule-c03194d2-ebad-41be-9a80-65f363c3fcef",
    "url": "https://spotlessdata.com/api/rules/duplication_rules/c03194d2-ebad-41be-9a80-65f363c3fcef/",
    "name": "Duplication Rule 1 - Check this column only",
    "description": "",
    "is_private": false,
    "duplication_type": "check_this_column_only",
    "use_last_value": false
}

 

Creating duplication rules

Duplication rules can be created using the browseable API here or in code.

Here is example code to create a duplication rule in Python:

import requests

response = requests.post(
    "https://spotlessdata.com/api/rules/rule_reference/",
    headers={"Authorization": "Token " + token},
    data={
    "id": "duplication-rule-1---check-this-column-only-duplicationrule-c03194d2-ebad-41be-9a80-65f363c3fcef",
    "url": "https://spotlessdata.com/api/rules/duplication_rules/c03194d2-ebad-41be-9a80-65f363c3fcef/",
    "name": "Duplication Rule 1 - Check this column only",
    "description": "",
    "is_private": false,
    "duplication_type": "check_this_column_only",
    "use_last_value": false
}
)

You can create as many duplication rules as you want but please note that unused duplication rules are deleted after 90 days.