Data validation is relatively simple, but it’s a tricky problem to solve at scale. Poor or non-existent data validation is the downfall of many data science and engineering projects!
Take a simple example of a handwritten spreadsheet for a hotel. The hotel writes down all information about the booking in a manual booking form, which includes various entities (like the customer) and events (like the check-in date).
When it comes to interpreting this data in some way, various formatting issues come to the fore. A typical example of troublesome data is dates, which can be written in many different ways:
Dates of birth are the same, and might be written as either text or dates:
Dive into our comprehensive Udemy course and learn to craft compelling, AI-optimized prompts. Boost your skills and open up new possibilities with ChatGPT and Prompt Engineering.
Embark on Your AI Journey Now!When a human interprets this information, they might be able to unravel some of these issues. But even then, mixing up the month and day, for example, could prove a critical error when looking up someone’s information.
The above example is an issue of manual data validation – the hotel needs to establish a definitive way of structuring data, and then a protocol or process for comparing ingested data against that structure to validate it.
Data validation is the process of comparing ingested data to a pre-configured or pre-defined set of rules to ensure that it conforms to requirements. This process involves running a series of checks, called check routines. Check routines range from simple checks (such as ensuring that a date of birth only includes numbers) to advanced checks involving structured conditional checks.
Data validation (when done properly) ensures that data is clean, usable and accurate. Only validated data should be stored, imported or used and failing to do so can result either in applications failing, inaccurate outcomes (e.g. in the case of training models on poor data) or other potentially catastrophic issues.
For example, when training machine learning models, introducing non-validated data to a model that was trained on very well-structured, perfectly validated data (e.g. from a dataset), will result in unexpected outcomes.
It’s worth mentioning that data validation and verification are often confused, but they are not the same. Data verification pertains to data accuracy, e.g. if someone uses their card to make a payment, the card information will be verified against the data held by the bank.
If the card details match the records, the card is verified. In addition, matching a membership ID against a purchase or membership plan or tier is also verification.
Verification is also crucial when working with multiples of the same data. For example, when data is migrated from one place to another, the new data will have to be verified with the old data to check for issues. If during migration, a field was deleted, or another change was made to the data format, then this would mean that the new data set is inaccurate. In this situation, it’s crucial to match the data in the new system with the data in the old system to ensure that it’s absolutely identical in every respect.
Data validation is extremely important in most data science and engineering projects. In a business context, the cost of invalid data can be enormous. According to Gartner and IBM, poor data costs businesses in the US trillions of dollars every year – HBR similarly cites that bad data incurs yearly costs of an eye-water $3 trillion in the US. Invalid data is one part of this enormous problem.
Not only is invalid data costly, but it can also pose a business risk if it prevents a business from upholding its regulatory or legal obligations.
Here are 4 examples of when poor data validation affects businesses:
Some of the above issues are data verification issues as well as data validation issues, but the point still stands that invalid data has a potent knock-on effect that can harm a business. For those training predictive models, inconsistencies, including invalid data, will hurt your model once you expose it to real data.
While validating data in simple systems is generally straightforward, problems arise when ingesting data from multiple sources using multiple methods. Once you combine data from multiple sources into one database, data validation suddenly becomes incredibly important.
Want to improve your data skills?
See the best data engineering & data science books
Most forms of data validation involve checks. These checks, usually run chronologically, simply ‘test’ the data for its validity. Once data passes the checks, it either passes through the connecting stage or is logged in the database.
Sometimes the application needs to be told about the invalid data (e.g. an invalid card payment sent back to the payment client). Other times, invalid data is flagged and actioned in the processing framework (e.g. an IoT sensor sends some invalid data; this is processed and filtered before being logged in the database). Invalid data can also be rejected at source (e.g. in the case a customer inserts an invalid email address).
The most common data checks are below:
1: Data Type Check
Data types are fundamental, and whilst different programming languages have different data types (e.g. Python has the map data type), most are fundamentally the same (e.g. numeric data types). The most common example here are when numeric data such as dates or DOBs incorporate letters or punctuation marks – this would render that data invalid if the field is designed only for a numeric data type.
2: Code Check
Similarly, some data involves codes, like country codes, states or postcodes. These codes will often need to comply with a unified, acceptable standard. For example, you would not be able to process a shipping label through an automated application if you receive a country or state with the wrong code – the application would reject that data. As such, code checks performed at source ensure the compliance of data with other applications downstream.
3: Range Check
Ranges cannot be infinite – there must be a specified range by which data has to comply to be valid. For example, a latitude value should be between -90 and 90, while a longitude value between -180 and 180. Values input outside the specified range are invalid.
4: Format Check
Format checks are crucial for addresses, dates and other strings or numerical values that involve multiple combinations. Again, if an application ingests poorly formatted information, that data will be invalid and will cause issues downstream. Formatting also applies to SI-derived units and other measurement units that can be formatted in many ways, e.g. Celsius, Fahrenheit and Kelvin. Measurement units are another major example – storing the same integer regardless of whether it’s measured in cm, m, km, etc, will cause serious issues.
5: Consistency Check
Consistency checks for logic. For example, for data to be validated as consistent, it might need to take place in the correct chronological order. Someone cannot log out before they have logged in, or check out before they’ve checked in.
6: Uniqueness Check
Uniqueness checks ensure that input data is unique from other values. These may be conditional, i.e. if the road name is accidentally input in both the first and second line of the address.
Most of these examples are easily visualised by imagining a user inputting data into a set of fields, but this is certainly not the only time when data is validated. Data should be validated when ingested from any source for analysis or use downstream.
ETL pipelines will often do some data validation for you during the transformation process, but any transformed data should be tested for validity to make sure that the ingested values match the expected values – though this should be done before any data is moved. Data validation during ETL usually takes place during the testing phase.
The ETL test phase includes checking duplicate data, to make sure that multiple rows and columns don’t have the same combinations of values, validating data according to mapping docs or rules, validating fields against some of the check tests listed above and checking data against expected file formats.
There are 2 broad approaches to validation:
Data can be validated using scripts written in programming languages. Here, a key example is type hinting-enabled validation features in Python, allowing developers to view validation errors at runtime.
For example, Pydantic enforces type hints and greatly assists in validation, enabling developers to check whether any data given matches a predefined schema model. This is an excellent way to ensure data validity at runtime, and data can also be inherited from other models and subject to the same validation.
If you need to validate large volumes of data ingested from multiple sources, modern automated tools provide accurate validation at scale and across an array of inputs. For example, Segment’s Protocol validates events and properties against a predefined tracking plan. Invalid or non-conforming data can then be actioned, e.g. with enforcement actions.
That depends on the use case or application. For example, if the invalid data is input into a form, then the most common response is enforcement action.
Enforcement action requests that a user inputs valid data (e.g. writes their DOB in a certain format). The data may also be automatically changed to conform (e.g. the system accepts the DOB and converts it into a conforming format).
Advisory action requests that the user changes the data, but doesn’t enforce it. This might be better when collecting the data is a maximum priority and any invalid data can be fixed in the processing stage.
Verification action requests ask the user to check the data again because it looks invalid but not actually be invalid. E.g. a house name may not be invalid if it doesn’t have a number, but might look like the user accidentally input a road name in the house number field.
The invalidation might also simply be logged so it can be checked.
Most of the above actions occur before data is connected or loaded to an application or database, but this is not always the case.
Take the example of a card transaction. Here, the transaction has to be properly validated via an API call to the bank or card provider. The card information is both validated and verified. As soon as the data is ingested and validated, the result is sent back to the client (e.g. data valid or invalid). If the string is incorrect or invalid then the client is alerted immediately and nothing else happens (though the failed payment will be logged).
Alternatively, validation will take place in processing, probably at the same time as cleaning, transforming or enriching the data. Take the example of an IoT weather sensor. This IoT sensor ingests data from the environment, and is ingested via an API. Of course, that data could be validated at the connect phase, but what would be the point in informing the IoT sensor that the data is invalid?
Instead, it’s necessary to validate the data after you’ve confirmed receiving it. Then, when you process the data, you test its validity. If it looks invalid, that data is filtered out. The invalid data doesn’t get saved into a database or used, but it isn’t sent back to anyone either.
Data validation is well worth its time. Once you produce a tracking plan, it’s best practice to note the data types you’ll use and your expected values. If you do that, building conforming ingestion pipelines will be much simpler.
Whilst tools such as Pydantic are excellent for bespoke data validation in many use cases, validation software vastly simplifies the process of validating data ingested from multiple sources using different techniques, and with different entities, properties and events.
Enterprise tools like Segment contain their own automated data validation services. Otherwise, you can use validation libraries and scripts such as Pydantic.
What is data validation?Data validation is the process of ensuring that data is validated against a set of pre-defined rules. This ensures that any and all data ingested into a system conforms to what that system expects. This prevents errors and data integrity issues.