How to use Spotless Data?

Duplication Rules

Duplication rules provide a simple means to remove duplicates in Spotless. They check that only one row in a column contains a single value. This can be typically used to:

  1. Remove duplicate email addressses
  2. Validate that unique keys in a database are unique
  3. Remove multiple entries from different systems
 

Fields specific to duplication rules

Duplication rules have the following fields (as well as the usual fields common to every rule):

  • unique_fields - this contains a comma separated list of the columns to be checked for duplication.
  • use_last_value - this is a boolean value that specifies whether the last value should be retained. The default is for the first of the duplicate values to be retained

The duplication rule always removes the duplicate rows - there is no option to set a fallback mode to use a default value or the closest match.

Example duplication rules

There are many example duplication rules available in the API, shown here. An example JSON for a regex rule is shown below:

{
    "id": "duplication-rule-1---check-this-column-only-duplicationrule-c03194d2-ebad-41be-9a80-65f363c3fcef",
    "url": "https://spotlessdata.com/api/rules/duplication_rules/c03194d2-ebad-41be-9a80-65f363c3fcef/",
    "name": "Duplication Rule on Time and Channel",
    "description": "",
    "is_private": false,
    "unique fields": "channel,airdate",
    "use_last_value": false
}

 

Creating duplication rules

Duplication rules can be created using the browseable API here or in code.

Here is example code to create a duplication rule in Python:

import requests

response = requests.post(
    "https://spotlessdata.com/api/rules/rule_reference/",
    headers={"Authorization": "Token " + token},
    data={
    "id": "duplication-rule-1---check-this-column-only-duplicationrule-c03194d2-ebad-41be-9a80-65f363c3fcef",
    "url": "https://spotlessdata.com/api/rules/duplication_rules/c03194d2-ebad-41be-9a80-65f363c3fcef/",
    "name": "Duplication Rule on channel and time",
    "description": "",
    "is_private": false,
    "unique_fields": "channel,airdate",
    "use_last_value": false
}
)

You can create as many duplication rules as you want but please note that unused duplication rules are deleted after 90 days.