Cleaning Up Dirty Data

Many companies try to mine their mountains of data for new marketing ventures and better customer information. But are they finding gold in there — or garbage?


Information is power, as the saying goes, and if that’s true, bad information is worse than no information at all. Bad information tricks you into thinking you know something, when you often might be better off admitting that you don’t. Consider corporate databases. In an attempt to cut costs and increase efficiency, many companies are dropping buckets of cash into ambitious projects that try to “mine” a company’s database, looking for valuable information that can be turned into new products or services. The problem is that many companies are digging through the virtual equivalent of a garbage heap instead of a gold mine.

If you’ve ever received more than one copy of the same catalog, you have an inkling of what can go wrong. Due to some computer fluke or data entry error, you get listed twice in a customer database, perhaps as “J. Smith” one place and “Jane Smith” somewhere else. The database thinks you’re two separate people, and the company blindly sends each of you a copy. Then the company sells its mailing list to someone else, and pretty soon the rest of your junk mail starts arriving in duplicate too. For you, it’s annoying (a few extra items for the recycling heap), but for the companies dropping tens of thousands of items in a single mailing, such problems add up to a lot of expense and waste.

And it’s not just direct marketers that face these issues. According to a recent PricewaterhouseCoopers survey of 600 IT directors, fully 75 percent of companies polled have experienced problems due to faulty data — everything from missed payments and overlooked accounts receivable to flawed assumptions about their customers. According to the survey, only 37 percent of the IT directors at established businesses (54 percent at startups) felt “very confident” about the quality of their company’s information.

Software engineers have a half-joking explanation for the mistakes in databases — they attribute them to “bit rot,” as if the data physically decomposed over time, like unrefrigerated meat or old banana peels. But unless your hard drives are bathed in gamma rays or strong magnetic fields, that kind of corruption is unlikely. Chances are, the glitches come from human error, such as data entry gaffes or database engineers who don’t always follow consistent design guidelines.

Human nature being what it is, the project of cleaning up this “dirty data” typically gets put off for as long as possible — often until there’s some pressing IT project that absolutely requires reliable information. In one extreme example, the U.S. Bureau of Land Management had to start cleaning up 225 years’ worth of land-use records in the course of a Y2K compliance project. There were problems just about everywhere the staff looked, says Leslie Cone, a project manager for the bureau. Much of the old data, collected long before there were such things as electronic databases, was incomplete or inconsistent. And field offices didn’t always follow instructions when sending records to the main file — for instance, Cone says, the date field in a given record might contain nonsensical numbers like “2525” or “999999.”

The BLM used data profiling software from San Francisco-based Evoke Software to partially automate the process of finding and correcting such mistakes. Evoke’s system can flag obvious errors such as a driver’s license number entered where a Social Security number should be, or a customer whose address has been entered twice in the same file. It can also help discover relationships between databases — for instance, notifying you when two divisions of your company, keeping separate records, have a customer in common.

The bad news is that products from Evoke and its other competitors cost about $400,000 (or more, depending on the size of your database), so it’s unlikely your company will put it on the shopping list unless you’re undertaking a larger project that necessitates clean data. (However, other data profiling vendors may charge less than Evoke. See our page on data management for more information.) For the rest of us, PricewaterhouseCoopers recommends good data management practices — setting clear, companywide policies for data collection and use. In other words, everyone from the CEO on down needs to be aware of the importance of maintaining accurate information. And even if you can’t go through and correct the possible mistakes in existing files, an easy first step is to make sure all new data gets entered correctly. After all, your company’s information can’t give you a competitive edge unless it’s accurate enough to rely on.

Link: Cleaning Up Dirty Data

Link broken? Try the Wayback Machine.

Cleaning Up Dirty Data

HAL 9000 Is Ready to Take Your Order

In the 1968 film 2001: A Space Odyssey, a sentient (if ultimately haywire) shipboard computer named HAL 9000 converses with astronauts. Computers today aren’t smart enough to second-guess our actions (thankfully) or to carry on long, rambling discussions. But speech recognition software has gotten good enough to be adopted by major companies ranging from Merrill Lynch and T. Rowe Price to British Airways. These days, when you phone your broker or airline, there’s a good chance that the voice at the other end will belong to a computer — and that you’ll be able to command that computer with ordinary sentences, such as “What’s my account balance?” and “Buy 2,000 shares of IBM at $102.”

This technology is having a big financial impact on many companies’ call centers, where the software has replaced human operators in handling repetitive, informational calls asking for things like store locations and hours. It’s more consistent than even the most dedicated employee, it’s cheaper in the long term, and it never takes sick days. Moreover, it’s faster and more flexible than the previous generation of so-called interactive voice response, or IVR, systems — those annoying recordings that ask you to push 1 for sales, 2 for customer service, and so on.

How does it work? Essentially, speech recognition software listens for words rather than complete sentences. It first analyzes a stream of speech to identify phonemes — the sounds that make up words. Next it compares the phonemes to prestored sound patterns in its database to figure out what you actually said (e.g., “account” will likely be in a bank’s database, but “weather” will not). Finally the software assembles those words into a meaningful command or request for information.

For example, on a stock-trading site, if you say “buy” or “sell,” the system will then start scanning your speech for a name of a stock, a quantity, and a price. If something is missing or unclear, the software will ask you to repeat it. These programs are sophisticated enough that, with a bit of fine-tuning by your IT staff, they’re rarely flummoxed by background noise, accents, or poor pronunciation. According to Nuance, one of three major producers of voice recognition software, the technology is about 95 percent accurate at interpreting human speech on the first try, provided it’s within a well-defined subject area (such as the stock market). In practice, when the software can ask speakers for word confirmations and clarifications, the success rate rises to nearly 100 percent. For example, if you say you want to buy 100 shares of Cisco, the system might ask whether you mean tech vendor Cisco Systems or food-services company Sysco Corp. before executing the trade.

As you might expect, technology like this doesn’t come cheap. A full-scale voice recognition system for a major corporation can cost as much as several million dollars, depending on call volume and what features you want. But because of the speed and efficiency gains, it typically pays for itself in about six months, says William Meisel, president of speech-industry consulting firm TMA Associates. In addition to Nuance (NUAN), both SpeechWorks (SPWX) and IBM (IBM) have systems on the market, but you’ll most likely buy the technology through a telecom or IVR vendor such as Nortel (NT), Edify (SONE), or InterVoice-Brite (INTV). These companies license the underlying speech recognition technologies and can integrate them into your existing telecom setup. You’ll still need human operators to handle more complicated customer requests and to deal with cases where you want to provide white-glove customer service. But for simple transactions, routine account questions, order status inquiries, and the like, speech recognition is a safe bet.

One caveat: Don’t bother putting your money into speech-activated websites just yet. Many telecom carriers, as well as startups like TellMe and BeVocal, are making bold promises about a “voice Web,” where we will be able to surf the Internet by telephone. But these sites are still in their infancy and not yet widespread enough to be useful, says TMA Associates’s Meisel. Besides, who really wants to listen to a computer read a webpage over the phone?

As voice recognition technology matures, new uses for it will continue to evolve. In the meantime, today’s software may not be good enough to carry on breezy discussions a la HAL 9000, but it can help you keep costs down and improve customer service — and that’s an idea whose time definitely has come.

Mea Culpa: In a previous version, we mistakenly called the HAL 9000 the HAL 2000.

Link: HAL 9000 Is Ready to Take Your Order

Link broken? Try the Wayback Machine.

HAL 9000 Is Ready to Take Your Order

Are Home PCs a Backdoor Into Your Corporate Network?

When it comes to network security, your corporate IT department probably has the company’s computers locked up in the technological equivalent of a medieval fortress. Your systems are likely ensconced behind a firewall, with antivirus monitoring software patrolling the perimeters and an array of additional security measures fending off unwanted intruders. But what you may not know is that, all the while, your employees’ home computers and laptops are sitting out in the middle of the battlefield, unarmed and unprotected.

That’s the recent assessment of corporate computer security made by computer scientists at Carnegie Mellon’s CERT Coordination Center, which tracks computer security threats and disseminates information on how to protect against attacks. According to CERT, the number of hacker attacks on home computers has risen sharply this year. In many cases, hackers aren’t going after personal files, but using the computers to gain access to your corporate network.

Home computers aren’t inherently more vulnerable than work PCs. The trouble is that home users generally don’t keep their systems up-to-date with the latest security fixes and antivirus software, according to CERT. Without a firewall and proper protective software, telecommuters can unknowingly infect their systems with a pernicious virus simply by opening an e-mail.

Even worse, many home users don’t understand that their broadband connections (DSL services or cable modems) make them more vulnerable to hackers. That’s because these “always-on” connections leave your computer attached to the Internet 24 hours a day. And unlike dial-up modem connections, which assign your computer a new “address” on the Internet each time you connect, broadband providers often assign a permanent Internet address to each customer. That makes broadband-connected computers sitting ducks, because hackers can easily target and retarget them. As the number of broadband users increases, you can count on one thing: more and more hacker attacks on home PCs.

What’s an IT manager to do? First of all, make sure your employees have the latest software security patches. Windows users can visit Microsoft’s Windows Updates page and download a program that will automatically check for security patches and other updates to the Windows operating system. Those with Macs (AAPL) can check Apple’s security page for guidelines and links to updated software.

Second, telecommuters should use antivirus software. The most popular programs are made by Symantec and Network Associates; their programs sell for about $30 to $40. In a pinch, you can also use free online virus scanners that run on your browser. Symantec (SYMC) and Network Associates (NETA) both offer these on their websites.

The third step, if you’ve got a broadband connection to the Internet, is to install a firewall. This can be either a hardware-based firewall box or a software program. Both accomplish the same thing — which is to prevent hackers from connecting to your computer — but software is less expensive (although it will use some of your PC’s resources while it’s running). The above-mentioned antivirus vendors sell firewall software (Network Associates’ McAfee Firewall is $30 and Symantec Personal Firewall is $50), and I’ve also had good results with a product called BlackIce Defender ($40 from NetworkIce).

Finally, CERT’s website offers additional tips for increasing the security of your home PC. It’s worth spending 15 or 20 minutes to read through.

While no security measure can offer 100 percent protection, taking a few simple steps can greatly reduce the chance that a home PC will fall prey to hackers. More important, it just might save your company network.

Link: Are Home PCs a Backdoor Into Your Corporate Network?

Link broken? Try the Wayback Machine.

Are Home PCs a Backdoor Into Your Corporate Network?