One of the most common problems with free-form data entry is that the data is not submitted in a standard form. This makes it hard to identify duplicated records and even harder to integrate data from a number of different sources to ensure data integrity.
For example, email addresses should always be in the form email@example.com and telephone numbers in the US should always have 10 digits. If you’re matching two datasets and one has +1 202-456-1111 and the other (202) 456 1111 you wouldn’t know that they are the same phone number unless you pre-process them.
At Spotless, we use regular expressions to ensure data validation so that a particular record is in the right format and if it isn’t then we automatically clean the data to put it into the right format.
For a great tutorial on regular expressions, try this link:
If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now