Making Data Spotless Blog

Data integration and Natural Key Validation

When you’re integrating multiple datasets together you need to be able to match them together using a specific field, either a primary key, or more frequently what we call a natural key. Unfortunately it’s quite common for such natural keys to not match quite as often as we want, or for database primary keys to mismatch between different systems.

At Spotless, one of our most important type of validation rules is the reference rule which can be used to validate that one field exists in a separate file. For example, if you’ve got a list of records from one system with a unique keys like CPACK222555, PACK32256263, CPACK 2852080, etc… you can validate that each of these keys exists in a master reference list.

So what do you do if you find a key that doesn’t exist? The best option in this case would be to exclude the mismatching record from the file before you ingest the data and then to escalate. With Spotless you get an exception report every time a job is completed so you can easily report the problems to the owner of the system providing the bad data and get them to fix it.

If the matching is not based on a primary key but based on a natural key, for example if you are matching user accounts based on addresses, if an address doesn’t match you may want to compare it against it’s closest matches to see if the data integration failure is caused by a typo rather than actual invalid data. In this case, you can see the Spotless rule to use the “closest match” fallback mode and let the algorithm take care of the rest.

If you've any questions about Spotless Data or would like to speak with our team, please email team@spotlessdata.com