Saying good-bye to rogue data

A quality check managed from a keyboard to ensure data validation

Spotless is a web-based API that filters data coming into your systems so rogue data can never get into your data platforms.

What is rogue data?

Also known as dirty data, rogue data are essentially any corrupted, mismatched or inaccurate data which, if they get into your data platforms, will affect either your final product and thus are used and/or viewed by your customers, or will affect the data analysis and business intelligence which the high-level executives and the various teams require to have a firm grasp on what is happening in your company.

There are many causes and types of rogue data. Here are a few examples:

1. Using a variety of formats which are inconsistent

Let us imagine the goal of your data is to produce an EPG so your customers can decide what they want to watch on TV. The way you construct the EPG is to gather together TV listings data released by the different TV broadcasters which operate in your country. However, it may be that the broadcasters use different names for the same TV show. This may be as simple as one company calling Breaking Bad Breaking Bad S01 E02 and another using the simple name of the show. Or variations may be caused by the use of ampersands within TV titles, so while one broadcasting company calls the new detective show Rizzoli & Isles another, perhaps because of problems dealing with ampersands in its database, instead calls the show Rizzoli and Isles.

These naming issues are by no means restricted to TV show databases. Acronyms are another common naming issue such as failing to realise that Amazon Web Services and AWS both refer to the same service offered by Amazon though the AWS acronym can itself have multiple meanings, also referring to automatic warning systems or the US Advanced Wireless Services, among others.

2. Blanks and overlaps

Blanks can occur in any dataset and are a major source of rogue data problems. For instance, if there are two columns of data which need to match each other, such as an item and its price, then if a blank in the price column is removed the result will be that all the prices below it within the data columns move up one space. This would mean that all the prices from this point on in the data would be wrong. However, leaving the space blank is not satisfactory either, especially if the blank is one of the most popular products, and could result in a catastrophic drop in the sales of one of the leading lines within the dataset.

If we return to our EPG example, typically each TV show will have a start and a stop time. A classic rogue data example here would be if The News starts at 6 pm and stops at 6.30 pm but the next show does not start until 6.45 pm. Or a longer news show stops at 7 pm but the next comedy show apparently begins at 6.45 pm. If a customer starts seeing these anomalies in the EPG they will simply look for another EPG and may never use yours again. So these examples show how getting rid of blanks and overlapping rogue data before they can enter your data platforms is an absolutely critical issue for the health of your entire business.

3. Duplications

Duplications are another rogue data problem that sometimes negatively affect a business. A simple example is in a data-driven marketing campaign. For instance, if your business is using emails as a way to contact potential and existing customers with a discount offer which can only be claimed with a unique code that is in the email it is not a great idea to send the same email three times each with a different discount code to the same customer, given that you only want a customer to be able to claim the discount once. If a customer claims it three times this will also skew the data on how many people have taken up the offer and if it occurred on a large scale might make the offer seem more successful than it actually is.

More generally, if a customer receives two copies of the same email, far from being twice as impressed they may simply mark the second email as spam, meaning that they will never receive any further emails from you to their inbox, a deeply counter-productive action from your point of view but nobody can blame them.

On the other hand, if you assume that two identical postal addresses are duplicates but actually they belong to a husband and wife (who should presumably be treated as separate customers) or two students sharing a house you will also lose customers.

4. Number values

Numbers are extremely important within data and when they go wrong this is a classic sign of having rogue data. Returning to our EPG example, if we know that all the numbers within a column entitled broadcast date and time need to be between 00:00 on 11-09-2017 and 23:59 on 18-09-2017 then any numbers outside this range are rogue data. If the companies providing the data simply use a different date format such as 09-18-2017 (ie American date format), 18th September 2017 or even something as simple as 18-9-2017 or 18/09/2017 then while these three examples are actually correct they might appear incorrect to automated software which believes that a correct date format needs to contain six numbers in three sets of two, each divided by a "-" symbol. Sometimes a genuine error occurs, perhaps because the numbers have, somewhere down the line, been entered manually and the employee entering the figures made a mistake such as entering 12-09-2016. This might be a particularly common issue in January where humans often take a few days to adjust to the reality of a new year.

The Spotless data quality API solution

As the above examples, which by no means demonstrate all the examples of rogue data, show, keeping these dirty and inaccurate data out of your data platform to ensure data quality you can trust again is no easy business. One great advantage of using our API solution rather than a cruder software application, which you download onto your PC or another device, is that Spotless is a machine learning API which figures out automatically where rogue data is a problem and then immediately sends you a report without actually modifying your data in any way. Instead, we allow you to check the report and set the values which will then be used in order to modify and clean the data.

In order both to allow you to see how good our API actually is and also to let you experiment with setting these values for yourself, as the first time you set the values you may make a mistake and not be satisfied with the final product, we are offer 500Mb of free data cleaning to each new customer, see our pricing structure. You first need to sign-up to our service using your email address, facebook or github accounts, then go to the my filters page, upload your file and get started, simply following the clear easy-to-use instructions and setting the delimiter as Spotless cannot automatically tell the difference between comma separated, tab separated and pipe separated files. You can also view our videos on data cleaning an EPG file and data cleaning a genre column which explain how to use our API.

The great thing about our machine learning filters is that they learn as they go along, which is particularly useful for our regular customers as they will tend to come back time and again with the same data quality issues, eg submitting new EPG data week after week. The data vary but the rogue data issues tend to do so much less. Thus if you want to say good-bye to rogue data, and hello to seamless data integration, come and give our service a try! If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now