This tool is written with type hints introduced in Python 3.5
- What this tool can do
- Validation schema
- Validation rules
- How to install & run
- How to add a custom column validation rule
The purpose of this tool is to validate comma separated value files. This tool needs the user to provide a validation schema as a json file and a file path of the file to be validated, or a folder path to validate multiple files in one run against the provided validation schema.
Validation schema is a json file. Let's have a closer look at a real life example file.
{
"file_metadata":{
"file_value_separator":",",
"file_value_quote_char": "\"",
"file_row_terminator":"\n",
"file_has_header":true
},
"file_validation_rules":{
"file_name_file_mask":"SalesJ",
"file_extension":"csv",
"file_size_range":[0,1],
"file_row_count_range":[0,1000],
"file_header_column_names":[
"Transaction_date",
"Product",
"Price",
"Payment_Type",
"Name",
"City",
"State",
"Country",
"Account_Created",
"Last_Login",
"Latitude",
"Longitude"
]
},
"column_validation_rules":{
"Transaction_date":{
"allow_data_type": "datetime.%M/%d/%y %H:%S"
},
"Country":{
"allow_fixed_value_list":[
"Norway",
"United States"
],
"allow_regex":"[a-zA-Z].+",
"allow_substring":"sub",
"allow_data_type":"str"
},
"Price":{
"allow_int_value_range":[0, 100000],
"allow_fixed_value":1000,
"allow_data_type":"int"
},
"Latitude":{
"allow_float_value_range": [-42.5, 90.5]
}
}
}Mandatory objects in the validation schema json are:
file_metadataobject containing these 4 keys:
"file_metadata":{
"file_value_separator":",",
"file_value_quote_char": "\"",
"file_row_terminator":"\n",
"file_has_header":true
},- at least one defined rule
If validating a file that has a header, we have to set the file_has_header key to true and define the column names in the column validation rules.
{
"file_metadata":{
"file_value_separator":",",
"file_value_quote_char": "\"",
"file_row_terminator":"\n",
"file_has_header":true
},
"file_validation_rules":{
"file_name_file_mask":"SalesJ",
"file_extension":"csv",
"file_size_range":[0,1],
"file_row_count_range":[0,1000],
"file_header_column_names":[
"Transaction_date",
"Product",
"Price",
"Payment_Type",
"Name",
"City",
"State",
"Country",
"Account_Created",
"Last_Login",
"Latitude",
"Longitude"
]
},
"column_validation_rules":{
"Transaction_date":{
"allow_data_type": "datetime.%M/%d/%y %H:%S"
},
"Country":{
"allow_fixed_value_list":[
"Norway",
"United States"
],
"allow_regex":"[a-zA-Z].+",
"allow_substring":"Norwayz",
"allow_data_type":"str",
"allow_fixed_value":"value"
},
"Price":{
"allow_int_value_range":[0, 100000],
"allow_fixed_value":1000,
"allow_data_type":"int"
},
"Latitude":{
"allow_float_value_range": [-42.5, 90.5]
}
}
}If validating a file that has no header, we have to set the file_has_header key to false and define the column indexes in the column validation rules so they're starting from 0 for the first column.
{
"file_metadata":{
"file_value_separator":",",
"file_value_quote_char": "\"",
"file_row_terminator":"\n",
"file_has_header":false
},
"file_validation_rules":{
"file_name_file_mask":"SalesJ",
"file_extension":"csv",
"file_size_range":[0,1],
"file_row_count_range":[0,1000]
},
"column_validation_rules":{
"0":{
"allow_data_type": "datetime.%M/%d/%y %H:%S"
},
"1":{
"allow_fixed_value_list":[
"Norway",
"United States"
],
"allow_regex":"[a-zA-Z].+",
"allow_substring":"xzy",
"allow_data_type":"str"
},
"2":{
"allow_int_value_range":[0, 100000],
"allow_fixed_value":1000,
"allow_data_type":"int"
},
"10":{
"allow_float_value_range": [-42.5, 90.5]
}
}
}- File level validation rules:
- file_name_file_mask : checks file name matches the file mask regex pattern
- file_extension : checks file extension is an exact match with the provided value
- file_size_range : checks file size in MB is in the range of the provided values
- file_row_count_range : checks file row count is in the range of the provided values
- file_header_column_names : checks file header is an exact match with the provided value
- Column level validation rules:
- allow_data_type : checks column values are of the allowed data type ( allowed options:
str,int,float,datetime,datetime.<<format>>) - allow_int_value_range : checks integer column values are in the range of the provided values
- allow_float_value_range : checks float column values are in the range of the provided values
- allow_fixed_value_list : checks column values are in the provided value list
- allow_regex : checks column values match the provided regex pattern
- allow_substring : checks column values are a substring of the provided value
- allow_fixed_value : checks column values are an exact match with the provided value
- allow_data_type : checks column values are of the allowed data type ( allowed options:
-
ideally create and activate a
virtual environmentorpipenvin order to safely install dependencies fromrequirements.txtusingpip install -r requirements.txt -
Set PYTHONPATH , from Windows CMD for example
set PYTHONPATH=%PYTHONPATH%;C:\csv_file_validator -
run using a command for example:
python C:\csv_file_validator\csv_file_validator -fl C:\csv_file_validator\tests\files\csv\with_header\SalesJan2009_with_header_correct_file.csv -cfg C:\csv_file_validator\tests\files\configs\config_with_header.json -
in
settings.conffile- you can set the variable
RAISE_EXCEPTION_AND_HALT_ON_FAILED_VALIDATIONtoTrueorFalse, this variable drives the behavior whether the tool stops validations after it hits a failed validation or not - you can set the variable
SKIP_COLUMN_VALIDATIONS_ON_EMPTY_FILEtoTrueorFalse, this variable drives the behavior whether the tool bypass the column level validations on a file that has no rows or not
- you can set the variable
-fl<string: mandatory> single file absolute path or absolute folder location (in case you need to validate multiple files from a directory in one app run)-cfg<string: mandatory> configuration json file location absolute path
Column validation rule interface: 
The keyword argument
validation_valueis the value in the config.json file, describing the allowed values for the validation rule
The keyword argument
column_valueis the value in the corresponding column in the .csv file being validated
-
Create a function in
/csv_file_validator/validation_functions.pymodule and decorate it with@logging_decoratorlike this:@logging_decorator def my_new_validation_function(kwargs): # example condition that validates the exact match of a validation_value and a column_value: if kwargs.get('validation_value') == kwargs.get('column_value'): # your validation condition success returns 0 return 0 # your validation condition fail returns 1 return 1
-
Add your function name to the registered validation name key - function mapping dictionary
_ATTRIBUTE_FUNC_MAP. This dictionary is located at the end of the file/csv_file_validator/validation_functions.py."my_new_validation_function": my_new_validation_function
-
For a column you wish to evaluate using this new validation rule, setup a validation function to validation value mapping in
config.json"my_column_name": { "my_new_validation_function": "some_validation_value" }
-
If you need to define regex patterns in regex validation rules, check https://regex101.com/