If the data does not validate then a fall back option is applied depending on the specification of the following params:
- fallback_mode - specifies what should be done if the data does not comply with the rules, and can take the following values:
- "remove_entry" - the default value, the record is removed and quarantined in a dataframe sent to the Logger object passed to the FileProcessor class
- "use_default" - the record is removed and replaced with whatever is specified in the "default_value" field
- "do_not_replace" - the record is left unchanged but still logged to the Logger object.
- default_value - The value that records that do not validate should be set to if fallback_mode is set to use_default. Defaults to ''.
For String, Number, and Date fields, there are two additional methods that can be applied for handling invalid data.
The best match options search the other values in the field that meet the validation rule and replaces the invalid value with one that is sufficiently similar. This is controlled by these two parameters:
- attempt_closest_match - Specifies whether entries that do not validate should be replaced with the value of the closest matching record in the dataset. If a sufficiently close match, as specified by the string_distance_threshold is not found then the fallback_mode is still applied. Defaults to True
- string_distance_threshold - This specifies the default distance threshold for closest matches to be applied. This is a variant of the Jaro Winkler distance and defaults to 0.7
The best matching is not possible where invalid data is blank. In this case an additional option is available to replace a field with a value from the record that is most similar in its other fields. This is known as a lookalike model and is most useful for filling in blank records.
It is enabled with the single parameter:
- lookalike_match - This specifies whether entries that do not validates should be replaced with value from the record that looks most similar to the other records. This implements a nearest neighbor algorithm based on the similarity of other fields in the dataset. It is useful for filling in blank records and defaults to False