What is Data Quality?
By Enoch Moeller
Ask 10 people what "data quality" means and you're going to get 10 different (perhaps wildly different) answers. Data quality can cover a scope anywhere from point of data entry to an overall enterprise-wide approach to data. Those organizations that understand the importance of data quality typically consider it to be just one component of an overall strategy called data management.
Data quality is an approach to enterprise information that ensures your data is consistent, accurate and reliable. When considered within the context of data management, data quality is the step after you figure out what you've got, and before you integrate in with existing data. Data quality typically encompasses the following four steps:
- Cleanse. The first step is to make sure that your data is "clean" - meaning that it follows certain rules for content, field length, data types, etc. An example of a cleansing exercise would be looking at an incoming file and any non-alphabetic values passed in the field that is supposed to contain the customer's first name. The resulting action is that records are either programmatically updated or flagged for manual investigation.
- Standardize. Standardization is the process of taking potentially disparate values and converting them into a "standard" set of values for each field. The most common example of this is standardizing name and address information to meet postal standards (e.g., changing Street to St, Road to Rd, 123 North Main Street to 123 N Main St, etc.), but this could also encompass product data, survey responses and so forth.
- Validate. Validation is the exercise of confirming that data being loaded is "valid" and falls within an acceptable range, meets certain rules, etc. Once again, the most common example is in postal applications, where the U.S. Postal Service's Coding Accuracy Support System (CASS) certification process could be applied to validate that an address is deliverable. Validation could also be used to enforce business rules such as "valid birth dates must be between January 1, 1899 and today."
- Match. One of the most common data quality issues is duplication of customer records. It is quite easy for data entry or record lookup errors to result in multiple records across different contact channels (Web, call center, point of sale, etc.) for the same person. Matching can be used to consolidate identical but duplicated records into a single record. Typical matching algorithms would utilize portions of name, address, email address and/or phone number. Householding is a type of matching where customer records are rolled up to a "household" level (all members of a family at an address, all employees of a company, etc.).
Why is Data Quality Important?
At the lowest level, data quality helps to ensure that your data is accurate and usable. Strategically, data quality standards help improve efficiency, reduce costs and provide the foundation for a better customer experience. In most modern organizations, data is the foundation of all operational systems. Poor quality or "dirty" data can have a significant impact across the enterprise, and that impact can be a financial one.
Let's look at some examples of the impact of data quality on an organization and its customers:
Example A
The ABC Widget Corp. has multiple purchase channels - online orders through the Web site, phone orders through their call center system and point of sale in retail stores. Chuck D. Controller would like to know how many units of ABC's Xtreme Widget V2.01 product have been sold across all channels. Unfortunately, the source systems have different descriptions for the same product and no consistent SKU across systems. Currently, when Chuck tries to pull his report, he winds up with the following:

Figure 1
Chuck may or may not be aware of all of the iterations of the product, so producing a report on the product will require additional consolidation, and may potentially result in inaccurate results if one or more records are inadvertently omitted. Now let's assume that a data quality process is implemented (using cleansing and standardization) to assure that incoming product definitions are consistent. As a result, all of the products in Figure 1 are consolidated to one - Xtreme Widget v2.01:

Figure 2
Chuck can now easily run a report and determine the breakdown by channel of Xtreme Widget V2.01 units sold in the last month. Oh, the wonders of data quality!
Example B
ABC Widget's chief marketing officer has approved an outbound marketing campaign that will attempt to upgrade owners of the eWidget Lite to the eWidget Premium! product. Once again, ABC has historically not had a data quality process in place to consolidate information from various channels. As a result, there are many individuals in the customer database who have multiple records from different channels. For example, Samantha MacIntosh has the following records:

Figure 3
The campaign is executed, and poor Samantha receives three different pieces of identical direct mail for the same offer. She receives all three offers on the same day, and frustrated at being bombarded with duplicate mailings, she decides to call ABC Widgets and remove herself from their mailing list. The call center receives Samantha's request, does a lookup and finds the record for S. McIntosh, and opts her out of future communications. Too bad for Samantha, but her other records do not get updated and the next wave of the campaign results in the sending of two more direct mail offers. By this point, Samantha is fed up with ABC Widgets, and decides to take her business to their competitor, SuperWidget Manufacturing Company of America (SWMCA). Not only did ABC spend unnecessary marketing funds on duplicate mail pieces, but their lack of data quality resulted in the loss of a valuable customer to their competitor.
ABC's CMO looks at the results of their marketing campaign and sees an increase in customer attrition shortly after the second wave of the campaign executes. After some investigation, she determines that a lack of data quality has contributed to the attrition. A push is made for the implementation of a standardization and matching process on the load of all customer records from the various source systems. Samantha's name and address data is standardized and matched based on an intelligent algorithm, and her records are consolidated to a single master record. Later, a win-back campaign reaches Samantha, and she is so overwhelmed by the customer service that she once again becomes a loyal ABC customer.
Part of a successful implementation of data quality controls within the enterprise will involve calculating the ROI of the solution. In many cases, this will be difficult to gauge, as these tend to be hidden; but keep this in mind as your plan your initiative.
How Can I Implement Data Quality in my Enterprise?
First, start by figuring out what you've got. A discovery or profiling exercise will prove invaluable as you start to design your data quality processes. This will help you identify where issues exist, find deviation from expected values and help in building a case for the all-important ROI.
Once you have a solid understanding of your data and where work is needed, it's time to start planning the processes that will actually ensure data quality. Remember, you can take a tactical approach to data quality (applying standardization to key tables, performing CASS Certification, etc.), but the largest overall benefit will come from a strategic data quality plan across the enterprise. Think about how data enters your organization and how it flows through the system(s) over time. How is that data transformed over its life? How is the data used? If the data will drive a reporting system, how would standardization benefit those reports? How are customer records consolidated from your various customer touchpoints? Is your organization subject to regulatory compliance, and, if so, how can clean data facilitate this compliance? Try to tie a dollar value to the solution to help justify the expense of implementing the solution.
There are quite a few vendors to consider. The vendor offerings fall along a spectrum from comprehensive data management suites to piecemeal code plug-ins that provide limited scope functionality to solve specific problems.
If you have a talented development team, it may make sense to develop your own processes for data standardization and cleansing, and potentially supplement the effort with the inclusion of pre-built modules for CASS certification or matching, for example. Building an address validation and certification process can be extremely complex and time-consuming, and many vendors offer good solutions that can be integrated into data flows with limited effort. For your own sanity, if you plan to build everything else, at least consider purchasing modules to handle those tasks.
At the other end of the spectrum, you may have done a great job justifying the ROI of a data quality initiative, and thus have a big budget to go out and buy a tool suite that handles the overall data management process. In this case, there are a number of comprehensive tool vendors that would be more than happy to spend time with you describing the benefits of their products and the weaknesses of all the others. Take advantage of the salespeople - get them involved in your strategic planning:
- Ask the vendors to assist with profiling exercises.
- Give the vendors a sample customer list (after signing an NDA, of course) and ask them to return a standardized, validated and matched file.
- Compare the results from each vendor to understand the strengths and weaknesses of each suite.
- Spend some extra time to thoroughly investigate the accuracy of the matching algorithms. The vendors should be more than happy to go through these exercises with you.
- Do a thorough comparison of features, functionality, support options, methods of integration, cost and performance.
- Make certain that you involve people from different groups across your organization - IT, marketing, finance, operations, etc.
Once you have chosen your approach, it's time to actually implement the solution. This will be an iterative process, and your rules will change as you identify more areas of focus. Don't expect to catch it all in the first phase. The chosen solution will drive the method of integration, but don't just look at the data, look at overall processes and determine how best to use the capabilities enabled by your chosen tools.
Where Should it be Implemented?
Think about all of the sources of data that enter your organization. Each of those sources could potentially have data quality processes implemented. Any system that provides a method of manual data entry is a top candidate. Online forms and call center data entry screens can have standardization rules applied as the form is submitted. Free-form text values can be converted to parameterized fields. Translation tables can be built to consolidate incoming values into smaller, more accurate lists of usable information.
Now, consider all of the processes that move data from one system to another. Do you have a data warehouse? Do you have a system of record for customer information? Do you have an operational data store that feeds limited-focus data marts? The key is to try to apply the processes at the first possible stage to limit the impact of "dirty" data in your systems. The closer to the point of entry in your data flow that the data quality rules are applied, the less time your resources will have to spend correcting problems. The other factor to consider is effort of implementation. It may require much more effort to build in a process within a mainframe application than to implement the rules as data moves into or out of the mainframe to/from other systems. Weigh the implementation effort against the impact of "bad" data as you decide where your rules need to be applied.
What's Next?
Now you've got a world class data quality process implemented across your enterprise. Data comes in clean from the points of entry, matching is performed, records are consolidated, and reporting is easier and more accurate. You've saved your organization both time and money, and have gotten yourself a big bonus and promotion as a result. Don't stop there! Data quality is an ongoing, iterative process, and there will always be room for improvement. The most effective implementation will be one that includes a closed loop data management strategy - from discovery through data quality processes, to integration, to measurement and back to ongoing discovery. Diligence in this process and periodic reevaluation will result in a greater long-term benefit for your organization.

Enoch Moeller is VP, director of SpringBoard Operations for Quaero Corp. He has a diverse skill set that includes system administration, network design, application, Web and database development, database administration, telephony design and implementation, and enterprise application integration. He may be reached at moellere@quaero.com. |