Spotless version 11 now live

Two blocks spelling 11 on lawn, representing data quality from Crude Data

Spotless Data version 11 introduces our new machine learning filters for guaranteeing data quality.

Spotless are delighted to announce the release of version 11 of our data quality solution to data cleaning your dirty and mismatching data problems at the speed of business, which focusses on introducing the concept of machine learning filters, applying the latest in artificial intelligence technology to our unique approach to data cleaning.

Why we are changing our approach

At Spotless we have a highly experienced data science team who are outstanding in terms of their knowledge of data quality and the problems which dirty data cause and as a result have developed over the last couple of years a complex system of records, jobs and plans as well as five issues in order to ensure that your organization's data is clean and trustworthy, meaning that we can proudly state that Spotless is the API that guarantees data quality in your workflow.

However, we recognise that not everybody who has to deal with the problems of cleaning dirty data shares our knowledge and experience. Indeed, until recently most companies which dealt with data were either IT companies or were so large that they had their own data science department, although not all large companies have such a department.

With the sheer explosion of data over the last few years, many companies, including SMEs, are realising that they have a lot of data and need to exploit these data to their maximum potential in order to remain competitive and to ensure growth and success. And they simply do not have the expertise in place to do this, nor in many cases do they have neither the finances nor the time necessary to build their own data science team from scratch. The Spotless data quality API solution for dirty data is perfect for these businesses as we do the data cleaning for them.

However, after receiving feedback that some businesses who have very little or no experience of data science and are unable to employ data scientists, partly due to cost but also due to lack of availability given that there is a far greater need for data scientists than there are data scientists available to cover all these needs, are finding it difficult to understand and correctly use Spotless API solution, we have decided to simplify the process of using our service so that you do not need a CDO in order to be able to do so. We are therefore updating our service with the incorporation of Machine Learning filters so that literally anyone, whatever their background, can now use our solution.

Spotless filters

Our new version 11 release, which is still in beta, means that ensuring that your organisation has quality data has never been simpler. If you go to my filters page (you will need to be logged in first) you can easily upload a file in a CSV format. Spotless will then automatically analyse your file for problems in the quality of its data, without, however, actually changing anything. Instead, you get an instant report in our API indicating the variables which you personally then need to set and which will then allow us to modify in order to properly clean your data according to your indicated specifications.

Let us imagine that the data you have submitted is for an EPG for television broadcasts in the coming days. The report on your data will then allow you to set the values for the items which define the columns in your file. Typically these items will appear at the top of the columns they relate to. EPGs are a good example because often the data you have comes from multiple sources, a classic reason for having corruptions in your data in a non-standardised world.

The top of our report will look like this for any file submitted:

Column name             Type               Blanks             Action

The column name will contain the item such as (in this imagined case) the name of the broadcaster, the date the show will appear on air, the title of the show, the title of the episode (eg What They Died For is an episode of Lost) and all the other variables in the data which you require in order to then use the file to generate your EPG, show by show

The next column, type, will define the type of filter which is to be applied. The standard filter will be a string, but there are two others, date, which is used for any dates which appear in your file, typically the date and time when a show is broadcast in this particular example, and number, which can be used for any set of numbers such as years, which might be the year in which a TV show or a film was first made available to the general public.

If in our report you open the plus button to the left, you can then define for yourself variables such as the length of the string. If you know already that no broadcaster name is shorter than five characters (including blank spaces) or longer than 17 characters then you can define this using a simple drop-down box so that Spotless can automatically spot any broadcast names which are shorter or longer than the string length you have defined. You can also run a regular expression check against the string which is particularly useful for known inaccurate strings, such as C4 used instead of Channel 4 to define the UK broadcaster, as you do not want your viewers to wrongly think C4 and Channel 4 are two different channels when they view your EPG; it probably doesn't matter which of these two names you choose to use as long as you are consistent, while data from multiple sources tends to lack consistency, creating dirty data problems.

Also if there was a language column and you want to flag those TV shows which are not in English you can specify that the string for this column must only contain seven characters while with the regular expression check you can ensure that they all begin with the letters En to spot content in say Spanish or Russian, or to spot where the word English has been shortened to En (a string with two characters).

With date formats, you can define the format you want for dates. So if you want the format to be

2017-06-30 07:06:05

you can define this with

%Y-%m-%d %H:%M:%S

whereas if you prefer to use American formatting you can define it as

%Y-%d-%m %H:%M:%S

so that a date would look like this

2017-30-06 07:06:05

if you prefer the time before the date you can format

%H:%M:%S %Y-%d-%m

it will appear as

07:06:05 2013-30-06

With dates we can also check against a fixed or rolling series of dates; for instance if you know that all the dates in your file need to be between July 20th and July 30th, 2017, such as in an EPG displaying TV shows in the coming days of this month, you can define the dates between which all the dates in the date & time column of your file need to be within. Then, an entry for July 2016, due to a mistake made by the source of this particular TV show info would be replaced with 2017.

With numbers, we allow you to define the range of numbers within which numbers in your file must appear so if you know that all the numbers in the file must be between the ranges of 1980 and 2017 then we can quarantine any numbers that appear outside this range.

The third column in our report is blanks, see data number validation. We give you two options here, either to process the blanks or to allow the blanks. Basically, if you feel confident that the blank spaces in your file are fine and do not need our attention then you should set this value to allow the blanks whereas if blanks are in fact problematic and can skew the entire data in your file then you should allow Spotless to process the blanks. For instance, if you have items for episode title and episode number but some of your TV shows, such as the news, have neither of these then you can specify to leave these blanks alone. However, if for these same shows you know that they must have a broadcast time then no blanks are acceptable and these blanks need processing.

The fourth and final column in our report is entitled action. Here you have three options; you can set to a default value, which you can define for yourself. For instance, if one of the columns in your submitted file is for genre and contains many different genres but you actually want all the TV shows to have the same genre, entertainment, then the default value will allow you to do this, setting whatever genre description you want for all the TV shows appearing in the file. The other two options are called Notify Only where we essentially flag the issue, and quarantine where we act on the problem.

You can then apply these filters to the still unchanged file you have submitted to us and press run and we apply these filters and produce a new file, with a different name, and with these filters already applied. This will automatically be downloaded onto your computer from our API but with a different name from the file you uploaded to us so you don't lose the original file. You will also receive a report from us, both directly to the API and also as an email from us (assuming you have signed up using your email address) where we tell you exactly how we have modified your file in order to clean it.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually you can sign-up to Spotless Data using your email address, Facebook, Google or GitHub accounts. You can also view our videos on data cleaning an EPG file and data cleaning a genre column which explain how to use our API. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now