Wednesday, 2 May 2012

The Quality of Data

With the introduction of SQL Server 2012 we now have a dedicated facility for the management of data quality, aptly named Data Quality Services. So what exactly does this entail?

The first question to ask is “what exactly is data quality?”. SQL Server is a relational database and by its very design it maintains relationships between tables through the addition of foreign key constraints, it ensures uniqueness of data columns using unique constraints, primary keys and indexes. We have data types to ensure that the correct type of value is entered into columns and we can add check constraints and triggers to apply logical validations. So what could possibly go wrong with the data quality?

Well interestingly quite a lot can still go wrong. Let’s take the user registration database for a typical ecommerce site as an example. Users register when they wish to make a purchase, the system records the email address, client name and delivery and billing addresses. So far so good SQL server will ensure that this data is all stored correctly in the database. Then 6 months later the user returns to the site to buy something else, and forgets that they already registered, so they register again, with a different email address, a middle initial added to the name and the address formatted slightly different with Street instead of St and no postcode. SQL Server stores the user and address and all looks good, but all is not good. The user’s details have been stored twice, the address has been duplicated and the formatting of the fields is inconsistent. The above example may sound trivial but it is a big issue for online retailers and just about every other business. The database server is rigorous about storing structured data, but it can’t handle the inconsistencies of the users that enter the data. Having many thousands of duplications in a database is not uncommon.

Data Quality Services is designed to address the above issues. It comes as a server program that manages the data and client program used to manage the data quality process. Part of the client program user interface is shown below.

The stages of use are:
·         Firstly rules are established in the knowledge base. Rules can be imported from a standard set of rules, or derived from sample data or setup by the user. Rules include things such as:
o   Spell checking data
o   Validating or standardising data
o   Defining matching rules that test the similarity of data to check for duplicates
·         Secondly a Data Quality project is created to apply the rules in the knowledge base to a particular set of data in a database. The project applies the rules to cleanse the data and then provides an interactive facility for the user to approve or reject its changes. It can also apply matching rules to the data to detect and resolve duplicates
·         Finally Data Quality Services will report on data completeness and quality through its Profiling and Notification facilities

Data cleaning is not a one-time activity. Once the Knowledge Base and Data Projects have been created they are repeatedly used and evolved over time to keep the data quality levels high.

Data Quality Services comes as part of SQL Server 2012 Enterprise and Business Intelligence Editions.

No comments:

Post a Comment