Ensuring Data Integrity in DWH/BI Systems with 3 Types of Testing
Author:
Wayne Yaddow
Data migration / integration / ETL consultant
Date: Jul. 11, 2018
Editor’s note: Wayne Yaddow is an independent consultant with over 20 years’ experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing. He continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.
Data warehousing and business intelligence users assume, and need, trustworthy data.
In the Gartner IT Glossary, data integrity and data integrity testing are defined as follows:
Data Integrity: the quality of the data residing in data repositories and database objects. The measurement which users consider when analyzing the value and reliability of the data.
Data Integrity Testing: verification that moved, copied, derived, and converted data is accurate and functions correctly within a single subsystem or application.
Data integrity processes should not only help you understand a project’s data integrity, but also help you gain and maintain the accuracy and consistency of data over its lifecycle. This includes data management best practices such as preventing data from being altered each time it is copied or moved. Processes should be established to maintain DWH/BI data integrity at all times. Data, in its production state, is the driving force behind industry decision making. Errors with data integrity commonly arise from human error, noncompliant operating procedures, errors in data transfers, software defects, compromised hardware, and physical compromise to devices.
This blog provides a focus on DWH/BI “data integrity testing”—testing processes that support:
All data warehouse sources and target schemas
ETL processes
Business intelligence components/front-end applications
We cover how data integrity verification is addressed in each of the above categories.
Other categories of DWH/BI and ETL testing are not a focus here (e.g., functional, performance, security, scalability, system and integration testing, end-to-end, etc.).
Classifications of Data Integrity for DWH/BI Systems
Data Integrity is an umbrella term that refers to the consistency, accuracy, and correctness of data stored in a database. There are three primary types of data integrity:
Entity Integrity ensures that each row in a table (for example) is uniquely identified and without duplication. Entity integrity is often enforced by placing primary key and foreign key constraints on specific columns. Testing may be achieved by defining duplicate or the null values in test.
Domain Integrity requires that each set of data values/columns falls within a specific permissible defined range. Examples of domain integrity are correct data type, format, and data length; values must fall within the range defined for the system; null status; and permitted size values. Testing may be accomplished, in part, using null, default and invalid values.
Referential Integrity is concerned with keeping the relationships between tables Referential integrity is often enforced with Primary Key (PK) and Foreign Key (FK) relationships. It may be tested, for example, by deleting parent rows or the child rows in tables.
Verifying Data Integrity in Schemas, ETL Processes, and BI Reports
The framework in Figure 1 illustrates the major DWH/BI components that are generally tested in all categories of end-to-end DWH/BI testing. Data integrity testing often requires considerable time and resources.
Figure 1: General Framework for DWH/BI End-to-end Data Verifications