Validating against a separate dataset
Some fields can be validated with reference to a known dataset. This includes unique identifiers, primary keys in database ETL, and categorical datasets.
In order to implement this, additional reference dataframe(s) need to be passed to FileProcessor.run() in the df_dict parameter and a rule of rule_type="Lookup" used.
The lookup rules have the following parameters:
- original_reference - specifies the entry in the df_dict dictionary containing the reference dataframe
- reference_field - the name or number of the column in the reference DataFrame that contains the values that this field should be validated against.
- attempt_closest_match - Specifies whether entries that do not validate should be replaced with the value of the closest matching record in the reference DataFrame. If a sufficiently close match, as specified by the string_distance_threshold is not found then the fallback_mode is still applied. Defaults to True
- string_distance_threshold - This specifies the default distance threshold for closest matches to be applied. This is a variant of the Jaro Winkler distance and defaults to 0.7