Data cleansing with regular expressions

One of the most common problem with free form data entry is that the data is not submitted in a standard form. This makes it hard to identify duplicated records and even harder to integrate data from a number of different sources.

 

For example, email addresses should always be in the form xxxx@yyy.zz and telephone numbers in the US should always have 10 digits. If you’re matching two datasets and one has +1 202-456-1111 and the other (202) 456 1111 you wouldn’t know that they are the same phone number unless you pre-process them.

At Spotless, we use regular expressions to validate that a particular record is in the right format and if it isn’t then we automatically cleanse the data to put it into the right format.

Some of the most popular standard rules use regular expressions:

You can also create your own regular expression rules and share them with the Spotless community. For a great tutorial on regular expressions, try this link:

http://www.regular-expressions.info/tutorial.html

You can also check out our pricing and our Subscription Packages here.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now