With the new, improved version 19 of Spotless Data's machine learning filters everything works ten times as fast, thanks to a new, easy-to-use JSON config, making it simple to get the data quality you need in real time through Spotless data validation, data cleaning and the integration of different datasets.
Welcome to the new, improved version 19 of Spotless Data's data validation solution, using our machine learning filters to filter out, modify or quarantine rogue data so that only spotless and well-integrated data can enter your data platforms. The main improvement in this upgrade is that we are now using a new JSON configuration, making our solution to all your data issues both easier to use for software developers and all those who use our API and about ten times faster at doing the challenging work of data validation; this is always useful given the importance of having validated data in real time.
We have now configured Spotless using a JSON object which we call the config. There are three basic types of data validation rules, called string, number and date. A string uses minimum length, maximum length and regex parameters, while those for number are minimum value, maximum value, defining the number of decimal places and fixed decimal places, where all the records which are validated are configured to have the same number of decimal places. Those for date are the validation of date fields as a particular format, within a range of dates or simply validating the field as a date rather than a set of numbers which are not a date.
For Spotless the work our API does in handling invalid data is two-fold. First, we have to identify the invalid data; then we have to do something with them. While we already allow the data owner to configure the settings which actually clean the data, and based on the report we give them, the new fall-back option defines what is done with the data.
One way of validating data is with a separate dataset where the data are compared against unique identifiers, categorical datasets and primary keys, the latter in an ETL database. This includes Spotless attempting the closest match, with a default distance threshold for closest matches to be applied when specified, and based on the reference dataset which is being used.
Duplications are a classic rogue data problem found in many datasets. Spotless can check for duplications against all those fields containing strings, numbers and dates to ensure that a column only contains unique values.
Blanks are sometimes, but not always, an indication of rogue data. At times it is better to have a blank than to have an inaccurate entry, which can negatively affect the whole dataset, or skew any reporting or analytics based on the data. You can specify how to handle the blanks, that is, whether Spotless should ignore them or not. We also have a lookup match rule which specifies whether any entries which fail to validate are replaced by the most similar looking record of those within the file, or not.
With the rise of the Internet-of-things, session data are becoming much more common. Typically, working with session data involves fixing gaps (where A finishes at 12:14:39 but B does not start until 12:15:02) and overlaps (where C finishes at 14:02:37 but D does not start until 14:05:09). Session rules are also used in EPGs and anywhere where different time sessions need to match say the start and finish times. Our session data rule fills in the gaps and removes overlapping sessions.
Thanks to the Spotless team for all their hard work in making this new update into a reality!
You can read our introduction to using our API to validate your data. You can take advantage of our free offer of 500Mb of free data cleaning to see how much you like our service. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.
We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.
Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us, you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.
If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now