How to use Spotless Data?

Introduction

Spotless provides a web service for cleansing data. It's designed to be easy to incorporate into your applications to ensure that you always have clean data to work from. There are many different rules to validate whether data is correct and a number of different options for how invalid data is handled.

You can use spotless to clean your data as a one-off process by submitting your file here or you can use the API to integrate it into your applications.

Key concepts

Spotless works using rules, plans, and jobs. Files are cleansed in jobs which apply a series of rules to a file according to a plan specified by the user.

  • Rules are set-up for validating a field in the data. This could be checking that the data is of the correct type, applying a regular expression, checking the bounds of a number, or ensuring that the data is chosen from a specific lists
  • Plans specify how rules should be applied to a specific file
  • Jobs are run when a file is submitted for processing using a specified job

In order to run Spotless you need to create some rules (or choose from some of the publicly available rules), make a cleansing profile that describes the file to be uploaded, and then submit one or more jobs to be processed.

Uses Cases

Spotless is typically used to cleanse multiple records in a file. You can process any kind of text file for processing. The default is CSV and if you want to process a TSV file you should set the csv_delimiter parameter on the plan to \t

Typical use cases are:

  • Validating that a CSV is a valid, well formed CSV file
  • Validating columns in the CSV file against a Regular Expression rule
  • Validating foreign key columns in a CSV file that exist in another file
  • Validating that manually entered data, such as City Names, exist in an external list of city names
  • Validating that date fields are well formed
  • Validating that number fields are positive
  • Validating that number fields are integers