Cleaning Up Dirty Data

Many companies try to mine their mountains of data for new marketing ventures and better customer information. But are they finding gold in there — or garbage?

Information is power, as the saying goes, and if that’s true, bad information is worse than no information at all. Bad information tricks you into thinking you know something, when you often might be better off admitting that you don’t. Consider corporate databases. In an attempt to cut costs and increase efficiency, many companies are dropping buckets of cash into ambitious projects that try to “mine” a company’s database, looking for valuable information that can be turned into new products or services. The problem is that many companies are digging through the virtual equivalent of a garbage heap instead of a gold mine.

If you’ve ever received more than one copy of the same catalog, you have an inkling of what can go wrong. Due to some computer fluke or data entry error, you get listed twice in a customer database, perhaps as “J. Smith” one place and “Jane Smith” somewhere else. The database thinks you’re two separate people, and the company blindly sends each of you a copy. Then the company sells its mailing list to someone else, and pretty soon the rest of your junk mail starts arriving in duplicate too. For you, it’s annoying (a few extra items for the recycling heap), but for the companies dropping tens of thousands of items in a single mailing, such problems add up to a lot of expense and waste.

And it’s not just direct marketers that face these issues. According to a recent PricewaterhouseCoopers survey of 600 IT directors, fully 75 percent of companies polled have experienced problems due to faulty data — everything from missed payments and overlooked accounts receivable to flawed assumptions about their customers. According to the survey, only 37 percent of the IT directors at established businesses (54 percent at startups) felt “very confident” about the quality of their company’s information.

Software engineers have a half-joking explanation for the mistakes in databases — they attribute them to “bit rot,” as if the data physically decomposed over time, like unrefrigerated meat or old banana peels. But unless your hard drives are bathed in gamma rays or strong magnetic fields, that kind of corruption is unlikely. Chances are, the glitches come from human error, such as data entry gaffes or database engineers who don’t always follow consistent design guidelines.

Human nature being what it is, the project of cleaning up this “dirty data” typically gets put off for as long as possible — often until there’s some pressing IT project that absolutely requires reliable information. In one extreme example, the U.S. Bureau of Land Management had to start cleaning up 225 years’ worth of land-use records in the course of a Y2K compliance project. There were problems just about everywhere the staff looked, says Leslie Cone, a project manager for the bureau. Much of the old data, collected long before there were such things as electronic databases, was incomplete or inconsistent. And field offices didn’t always follow instructions when sending records to the main file — for instance, Cone says, the date field in a given record might contain nonsensical numbers like “2525” or “999999.”

The BLM used data profiling software from San Francisco-based Evoke Software to partially automate the process of finding and correcting such mistakes. Evoke’s system can flag obvious errors such as a driver’s license number entered where a Social Security number should be, or a customer whose address has been entered twice in the same file. It can also help discover relationships between databases — for instance, notifying you when two divisions of your company, keeping separate records, have a customer in common.

The bad news is that products from Evoke and its other competitors cost about $400,000 (or more, depending on the size of your database), so it’s unlikely your company will put it on the shopping list unless you’re undertaking a larger project that necessitates clean data. (However, other data profiling vendors may charge less than Evoke. See our page on data management for more information.) For the rest of us, PricewaterhouseCoopers recommends good data management practices — setting clear, companywide policies for data collection and use. In other words, everyone from the CEO on down needs to be aware of the importance of maintaining accurate information. And even if you can’t go through and correct the possible mistakes in existing files, an easy first step is to make sure all new data gets entered correctly. After all, your company’s information can’t give you a competitive edge unless it’s accurate enough to rely on.

Link: Cleaning Up Dirty Data

Link broken? Try the Wayback Machine.