Eliminating rogue data from numbers is one of the first steps towards ensuring trustworthy data validation
Validating numbers is always going to be important in data cleaning to ensure you have data quality you can trust in. Therefore the Spotless Data Science team have been working to ensure you can clean any list of numbers which may have errors in it. This type of dirty data can, with a single error, mess up an entire dataset.
If a list contains several thousand entries, each with two separate columns, and for some reason the second columns in one of the entries is blank, due to an inaccuracy in the data, the result may be that all the items listed after this are inaccurate as well due to the fact that the first and second columns are mismatched.
To give a simple transactional data example which describes the latest purchases of products and the customers who bought them, where each customer is assigned a unique Natural Key in column 1 to ensure that two customers of the same name and/or address are not confused with each other, while the product which has been purchased is in column 2, including the number of items of this particular product. If a customer purchases two types of goods, as is the case with the third customer in the example below, two purchases are recorded. The transactional data might look like this:
Col 1 Col 2
pfjig309747 2 air pumps
wevfy490892 1 RIB Inflatable boat
aghrt401563 4 lifejackets
aghrt401563 1 spare fuel can
However, if for some reason the first customer's purchase appears as a blank which is not recognized by the software reading the data, all the purchases after this will be mismatched to the wrong customer, so the first three entries would read like this.
Col 1 Col 2
pfjig309747 1 RIB Inflatable boat
wevfy490892 4 lifejackets
aghrt401563 1 spare fuel can
Given that the second customer's purchase is by the far most expensive and, let us say comes with a 2 year guarantee, then when the customer calls the company demanding they honour the inflatable boat guarantee, for perfectly legitimate reasons, the employee checks the records but according to the inaccurate records the customer had supposedly bought 5 lifejackets while it was the first customer who actually bought 2 air pumps, who would be the customer able to claim under the inflatable boat guarantee.
A disaster all round, generating such low levels of customer satisfaction, especially given that this one mistake can affect thousands of records and hence thousands of customers. These same dirty records may also be used to replace stock and so before they know it the company has run out of air pumps because they thought they had 50 more in stock than they actually do.
We at Spotless understand how important getting numbers right actually is to ensure that you can trust in your data quality by avoiding the kind of mistakes alluded to above.
Using our API; an introduction. You can also view our videos on data cleaning an EPG file and data cleaning a genre column which explain how to use our API. You can sign-up for Spotless Data using your email address, Facebook, Google or GitHub accounts.
We're giving away 500Mb of free data cleaning for you to test out our service and see for yourself how well it actually works. We guarantee that your data will be secure and not available to any 3rd parties while they are in our care. If issues with the data do arise an automated flag will alert our data science team who then review the issue manually and, when necessary, contact you via your log-in details to discuss the problem.
Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. Please do sign up for our service using your email address, Facebook, Google or GitHub accounts.
If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now