Five dirty data issues in need of data cleaning

Transferring data from legacy requires data integration

Dirty or rogue data will always have a profound effect on any organisation.

Spotless Data have recognised five fundamental different types of cleansing of dirty data which can be done by using our unique web-based data quality API solution.

1. Regex

Regex are regular expressions, which define search patterns and identify particular strings found within data sets. For instance, if you know that within your user generated cContent data there is a common string then at its simplest a regex can identify one or more occurrences of this string. However, occasionally a hurried user will accidentally enter a typo such as, or gmail.cmo. These are three examples of dirty data, so if you are trying to contact the users who recorded these faulty email addresses by accidentally typing one of these incorrect strings, the emails you send them will bounce.

If there are hundreds of thousands of emails within your big data you clearly do not have the time to check each entry manually and nor do you want hundreds of bouncing emails. If each client is on average worth $10 and 5% of them submit email addresses with typos (unfortunately not an unreasonable assumption given dirty data rates are roughly 5%) that could represent a loss to your company of thousands of dollars. Not to mention the potential loss of brand reputation if your users are expecting an email from you that never arrives.

To fix this perennial problem we have use regex with both a search and a search & replace functions. So, in the above gmail case, our API solution would search for bad strings such as and automatically replace them with the correct string Regex has many more applications than solely cleaning email addresses and are particularly useful for any error in data that appear multiple times or on a regular basis, such as known sources of dirty data, which commonly occur when bringing disparate data sets into a data warehouse on a regular basis.

2. References

References use a database against which new data is compared to ensure it is correct. For instance, a database of addresses can be used to check the validity of your stored addresses, typically having been submitted by your customers, entered by your employees or obtained from a third party source.

We also have 17,000 TV show titles which you can use as a reference to ensure that the TV titles you have are consistent, for instance converting your inaccurate Gavin and Stacey into the correct Gavin & Stacey so that your customers can actually find the show, something consistent feedback from our customer has shown us is always problem in the TV industry.

Customers often make typos without realising it, although reference data cleaning can also be used to spot potentially fraudulent customers who deliberately insert typos into their postal addresses, hoping to say get a special offer twice by giving apparently two different postal addresses while thinking that their postmen, with their local knowledge, will be able to recognise both are actually the same address but your company, without any apparent local knowledge, will fail to do so.

Regardless of how well you train your employees, if they are recording customer details all day long and entering them into your system it is almost inevitable that they will occasionally make a mistake, for instance in recording postal addresses or telephone numbers. Third party sources, by their very nature outside your control, may also contain errors and this is particularly so if you aren't paying for the data or are getting it very cheaply, meaning it becomes your responsibility, rather than that of the data's original owners, to ensure that the data are spotlessly clean. Typically using reference cleaning for addresses, say in the United States, would work from a list of valid US addresses and zip codes. When the reference cleaning has to deal with two apparently different addresses it can check them against the database and fix any inaccuracies automatically. Reference cleanings are also extremely useful in differentiating between Natural and Foreign Keys, or that two natural keys from different data sets have the same format.

3. Duplications

Duplications are focussed on removing unnecessary and unwanted duplications which can so affect your data as to make them almost unreadable, whether by humans or machine learning software programmes. Thanks to our data science team the removal of duplications is now an easy and straightforward task.

When you submit your dirty data to Spotless we ensure that only a single value can appear in a particular row of a given column. Typical cases where our cleaning processes prove their worth time and again are when analysing a set of user log-ins, email addresses, postal addresses, telephone numbers or URLs to ensure that the same address does not appear twice.

After all, you don't want to send out the same mailshot outlining your latest offer to the same person twice, not merely because of the waste of time and resources this represents for your business but also because customers and potential customers get extremely irritated at being bombarded by what they consider to be spam, ie not the first but the duplicated communication. While one email or letter from your company would be welcome two of them might make said recipient see red and decide not to have anything to do with your organisation anymore, and unwanted duplications may represent a threat to your brand's reputation. This is even worse with a charity and a begging letter, with people likely to think "I'm not going to waste my money donating to that charity given they have just wasted their allegedly precious time and money sending me two identical letters."

Another classic case of unwanted duplications is when a unique key turns out not to be unique at all. We all know that names are mostly not unique so it wouldn't be much use using Joe Bloggs name to uniquely identify him among your customers as you probably have two or even twenty other customers with the same name. Even email addresses can be shared so that both Joe and Jane Bloggs cannot be distinguished using email as an indicator because they use the same email address, nor can their physical address be used when they live in the same house. As they are both valued customers you certainly don't want to restrict to one customer per household. For this reason, most online retailers and other companies use a unique key in order to identify a particular customer. This key should never appear two or more times in your data, only once, or the new XBox that Joe Bloggs ordered might end up in the hands of Zhang San, who lives on the other side of the country and hasn't ordered one but would probably like to have one anyway, and so won't send it back.

However, these are merely examples. Spotless can remove any multiple entry duplications from different systems as we recognize that you cannot have quality data which contains unwanted duplications.

4. Sessions

Sessions are designed for the cleaning of anytime session data, which can include when users log in and log out from a website or other platform, schedules of events such as TV listings, or User Generated Content, entries with a time element in platforms such as crowdsourcing sites. A failure to get the data entered correctly in TV listings can result in all the shows for a particular time segment, such as for a particular day in a particular channel, being displayed wrongly, making the TV Listings worse than useless and actively misleading. Failure in login or UGC info might result in an inability to comply with legal requirements. For instance, if a fintech has mistakes in its time session information this may impede a fraud investigation.

Spotless recognises how important accurate session data is, both because of compliance and also because a single failure here can mess up a whole set of data. We want to ensure that these commonly found session errors will never plague your business again.

5. Data validation

Data Validation is designed for data which already have a format. This is typically data which have been obtained from a 3rd party source, eg UGC data, scraped data or data obtained directly from another organisation. The format of the data is often incompatible with the data in one's own data warehouse or data lake. When Spotless data validation is used both the date and its format are checked and, if necessary, modified, to ensure that all your data are in the same format.

One problematic example which often requires data validation is date-and-time data, due to the many different date formats which are commonly used throughout the world. We can most of us recognise that 08:00 on 08-17-2016 is the same moment in time as 17-08-2016 or 17-08-16 though a software programme may not recognise these different formats and might even think the latter example was referring to August 1916. We'd also recognise 8 am on 17th August 2016 is the same time-and-data as the previous examples but most automated processes would fail to do so. Yet getting the time-and-date wrong, even once, can mess up large quantities of data, can get us in trouble with legal authorities for lack of compliance or could result in the delay of sending out an urgent delivery which had to arrive within 48 hours of the time it was sent out.

Data quality API solution

To get started with Spotless Data Quality API you can test a file in a CSV or TSV format by going to the my filters page. You will need to sign-up using your Facebook, Google or GitHub account or simply with your email address in order to see this page, but be aware that if there are problems with your file and the filters you applied to it, resulting in an automatic flag being sent to our data science team who will use these contact details to get in touch with you so you can discuss what to do next. You can also view our videos on data cleaning an EPG file and data cleaning a genre column which explain how to use our API.

To allow you to sample our cleansing using our API we are giving away 500Mb worth of free data cleasing so you can see how well it works for you. We are confident you will be impressed!

We guarantee that your data will be secure and not visible available to any 3rd parties during the period of time it is in our care, using the https protocol among our security precautions.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now