How to use Spotless Data?

Creating your own rules

Rules are the core of the Spotless API as they define how fields in files are cleansed. As well as using the standard rules that are available in Spotless, and rules that other users have created, you can also create your own rules using the API. Rules can either be created using the browseable API or by making calls to the API from your favoured programming language.

In the browseable API you can browse all rules of each type. For example you can view and create regex rules at /api/rules/regex_rules/.

All rules have a few common fields that are shared by every type of rule:

  • name - the name is used to identify the rule. It doesn't need to be unique but should describe what the rule does.
  • description - the description should contain more information on what the rule does so it's clear for yourself, for other users, and most importantly, for our data science team, should they need to optimise the automatical cleansing algorithm for your rule.
  • is_private - this is a boolean that specifies whether other users can use your rule or if you want to keep it private. Other users will never be able to edit your rules so unless your rule contains sensitive information we recommend making all rules public.
  • fallback_mode - this specifies what should happen if data is found that does not meet the rule. More details on this below.
  • default_value - one option for handling invalid data is to replace it with a default value and in this case you should set the default value here.

 

Handling data that does not meet the rule

When Spotless finds a record that does not match the specified rule it uses its proprietary, patent pending, data cleansing algorithm to choose a suitable alternative. If Spotless cannot immediately spot the fix, then it’s escalated to our data science team for investigation so your job may take slightly longer to complete.

Even if it’s escalated, you always get a first pass of your data back immediately, but you can go back later to get the improved dataset. The more data it cleans the more the Spotless algorithm improves for every rule so we strongly recommend that users share their rules with other users in order to get the best possible performance. In order to best guide our data science team we recommend providing as detailed a description for each of your rules as possible.

If data cannot be corrected automaticaly there are four different options available:

  • remove_record - in this case the entire record is removed from the file. This is most suitable when there is no obvious way to resolve invalid data.
  • use_default - in this case the record is replaced with the default record. This is most suitable when there is a clear default: for example if a rule is checking for positive integers, any integer lower than zero could be set to zero by default.
  • use_closest - in this case the closest matching record is used. For reference rules this is the closest matching record in the reference dataset.
  • do_not_replace - in this case the data is not updated but it is returned unmatched.

Please note that rules that are not used for 90 days will be deleted from the platform.