Data cleansing against reference datasets

Frequently data can be checked by seeing if it exists in another database. For example, countries, cities, and streets are all well defined, as are car registration plates and popular products. However, when getting people to enter information into online forms or when combining data from two different databases, different practises can frequently lead to different spellings and data which is apparently different even though it should be the same.

At Spotless, we correct these problems by comparing data to a reference dataset. Any data is checked to see if it is present in the reference dataset and if it isn’t then it can be set to the closest match (to correct for typos) or even added to the reference dataset (if it’s missing).

Some of our most popular standard rules use the reference datasets:

  • TV Shows Title Match Rule - the TV show validator compares against a database of over 17,000 TV shows to clean any TV show you might have
  • UK Cities & Towns Name Check Rule - provides easy address database cleansing, particularly if you want to visualise cities and cluster according to geographies
  • India Cities Name Check Rule - India is a slightly more challenging market for address databases and this rule is maintained actively to provide a complete list of all cities


You can also create your own reference rules set by uploading your own reference dataset. Remember to share them with the Spotless community!

You can also check out our pricing and our Subscription Packages here.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now