Mining the catalog

RLG's RedLightGreen Project

Mining the Catalog

What happens when you take a massive database of bibliographic descriptions and redesign it for the Web, not just as a resource for librarians, but as a tool for undergraduate students and the public at large? To put it another way: "How can we strip away the ‘librariness’ of the catalog so it looks more like what students expect?" says Jim Michalko, president of RLG.

Those are among the questions that RLG’s ambitious RedLightGreen project aims to address. After more than a year of effort, the project has already generated some intriguing answers.

One thing is clear: the finished application will look and act a lot more like Google or Amazon.com than a traditional library catalog, even though it will have catalog records at its heart. It will also open up new uses for bibliographic data. Instead of simply retrieving call numbers and shelf status, RedLightGreen can function as a potent tool for information discovery and for identifying the most authoritative research sources.

The RedLightGreen project is a whole new way of thinking about library catalogs. Catalogs today are optimized for inventory control and transaction management—not necessarily information discovery. By taking the large, multi-institution database that is the RLG Union Catalog, and mining it for conceptual relationships and holdings data, RedLightGreen aims beyond what both catalogs and Internet search engines can provide.

Getting results

Although development is still underway (see timeline), RLG has already learned volumes from the project. The first is that the RLG Union Catalog is a rich, untapped lode of data on which books and authors are authoritative sources of research information.

What we’ve learned from the RedLightGreen project

We’ve learned more about what undergraduate users want from online information resources
We’ve learned how data mining software can uncover valuable new information hidden in the RLG Union Catalog
We’ve learned how to provide access to a wealth of complex information through a simple, easy-to-use interface
We’ve discovered new opportunities for using bibliographic data to help end users find authoritative sources of research information
We’ve learned what’s involved in the complicated process of transforming MARC records to XML
We’ve learned about Functional Requirements for Bibliographic Records (FRBR), an emerging standard for distinguishing between various editions
We’ve confirmed how RLG can take a leading role in envisioning, building, and testing innovative new information services

RLG’s 23-year-old Union Catalog encompasses more than 126 million bibliographic records, representing 42 million unique titles. It provides unparalleled coverage across subjects and material types in more than 370 languages, from hundreds of libraries worldwide.

RedLightGreen can use this data to put the most widely held items near the top of any search results list—helping users to zero in on the most credible books and authors quickly. If a book appears in dozens of libraries’ collections, it’s a good bet that the book is considered an important source of information in its subject area: its selection by dozens of librarians is an implicit endorsement. By contrast, an item held by only one library may be of interest to Ph.D. candidates and specialists, but is probably less interesting to a general audience.

RLG Union Catalog data can also be used to evaluate the relevance of various items to particular keyword searches and to organize search results. For example, one book might be listed under three different subject headings at three different libraries. By comparing these three records and the words used in the book’s title, one can discover implicit relationships between all of the words. Data mining, provided by Recommind Inc.’s MindServer™ software, extends that process to the entire database of 42 million items, grouping related words together and making new connections between subject headings, titles, and keywords.

For example, a student might enter a search for the keywords "Civil War" without specifying the American, Spanish, or other civil wars. Using Recommind, RedLightGreen can organize the results in clusters of related items, letting the student pick which civil war interests her. At the same time the application can insert more specific, scholarly subject classification terms into the search that have been derived from the MindServer data. A search for "New York riots" would turn up records pertaining to the Irish draft riots of 1863, with subject headings such as "New York—History—Civil War, 1861-1865" and "Civil War, 1861-1865—Fiction"—even though those headings don’t match the keywords exactly.

In the finished application, both holdings information and Recommind-generated relevance data can be used to sort search results. RLG expects that this will produce more pertinent, authoritative search results than can be delivered by any single institution’s catalog—or by an Internet search engine.

What students want

Of course, mining the information latent within the RLG Union Catalog and making practical use of it are two different things. However, early indications suggest that RedLightGreen’s enhanced searches will interest undergraduates.

RLG conducted two separate user studies with undergraduates, using mockups of the RedLightGreen application. The results showed clearly that undergraduates’ needs are very different from those of the librarians and scholars who have, up to now, been RLG’s primary end users. "Students live in a different universe than we do," says RLG information architect Arnold Arcolio.

Undergraduates don’t use the precise terminology of Library of Congress classifications. They may be thrown by library terminology: Several students in the study thought that a link to "scores" pertained to sports scores, not musical scores, while others thought a link to "maps" would provide maps and directions to the library, not items from map collections.

Although students aren’t discriminating about library classifications, they are concerned about finding sources for their research, and Web searches aren’t meeting that need. In this respect, RLG’s user studies echo the findings of a 2002 study by the Digital Library Federation and Outsell, Inc. That study found that existing information sources available to students are falling far short of their expectations in terms of quality, ease of access, subject coverage, and other aspects they considered most important. Ninety-one percent of students felt that high-quality information sources were very important; only 51 percent felt that available electronic information sources are meeting that need. A study published by the Online Computer Library Center (OCLC) in June 2002 confirms these findings, showing that accuracy is the most important attribute of online information for students—yet only half of those surveyed found that information on the Web is acceptable in this regard.

Delivering the goods

According to RLG’s user studies, undergraduates are not concerned about detailed edition information: What matters to them is getting the most recent English edition, and getting it as quickly as possible.

Therefore, RLG needed a way to organize the vast number of records in the RLG Union Catalog without overwhelming users with a deluge of information about different editions. RLG turned to the Functional Requirements for Bibliographic Records (FRBR), an emerging model proposed by the International Federation of Library Associations and Institutions. FRBR distinguishes between a work, its expressions (e.g. translations), manifestations of those expressions (specific editions), and items (specific copies). The RedLightGreen database collapses FRBR’s four levels into just two, displaying a work and various manifestations of that work. This approach will reduce a potentially overwhelming number of editions into a smaller, more manageable set of works that match a user's search terms.

Many RedLightGreen users will be facing imminent deadlines, so shelf status and location information are particularly urgent. If a needed book is on the shelf in the university’s library, students indicated that they would go get it. In this respect, RedLightGreen’s weighting of search results may have a second benefit: Widely held items are more likely to appear in the student’s home library. "We’re hoping that it will not only establish the credibility of sources, but will also establish a certain level of accessibility," says RLG software development manager Judith Bush. Shelf status and location are not recorded in the RLG Union Catalog, however, so RedLightGreen will need to connect with local online public access catalogs to retrieve this information.

RLG’s user studies showed that the undergraduates aren’t likely to use interlibrary loan—many weren’t even aware that this resource was available to them. They might, however, go to a neighboring college or public library if they knew the book was available. In this, RLG’s test subjects echoed the OCLC study, which found that over half of surveyed students wanted some way to search other libraries’ collections, and that 72 percent of those would go to a nearby library to pick up a book. By contrast, RLG found, undergraduates are unlikely to order the book from an online bookseller, even if the application provides a link to do so.

Students were particularly intrigued about the possibility of creating formatted, ready-to-use bibliographies from their search results. RedLightGreen will support this feature with a variety of citation formats: University of Chicago (Turabian), Modern Language Association (MLA), and American Psychological Association (APA).

Students are particularly keen to have access to journal articles in addition to monographs. This feature is outside the scope of RedLightGreen, as the RLG Union Catalog does not include journal citations. However, it may be possible to link RedLightGreen searches to journal databases—or even to the broader Web—in future versions of the application.

Catalogs: The next generation

RedLightGreen aims to deliver useful responses to broad, unsophisticated keyword queries. The interface is simplified and easy to use, less like an online catalog query form with many different fields and more like a search engine, with a single keyword entry field and a "search" button. But behind this interface is the power and depth of one of the world’s largest bibliographic databases.

For students as well as Web users in general, RedLightGreen will be a window into information not previously available online. "The RLG Union Catalog’s unique value is the breadth of unusual things it contains," says independent search engine consultant Avi Rappoport, of SearchTools.com. Many of the resources RedLightGreen will offer are unmatched by any search engine. "In using the Web, you start getting annoyed by how much is missing," says Rappoport. RedLightGreen can help fill in the gaps by supplying detailed information on authoritative research sources.

With RedLightGreen, RLG and its members are in a unique position to redefine the library catalog—and to create a brand new kind of information resource. No one is currently providing this kind of research information on the public Web, without fees, and with the kind of usable interface expected by Web users. With RedLightGreen, RLG is creating an innovative new application—and in the process is discovering new opportunities for using bibliographic data.

"This is the library community’s chance to reinvent what a catalog is," says Michalko.

Last updated 15 January 2003

Timeline

Through most of its 23-year history, the RLG Union Catalog has been used primarily by librarians and scholars within academic or research institutions. It is currently available to subscribing institutions through a variety of interfaces—RLG's Web-based Eureka® interface, telnet connections to RLIN®, and a Z39.50 gateway.

A grant from The Andrew W. Mellon Foundation enabled RLG, in late 2001, to begin envisioning what form the RLG Union Catalog might take if it were a Web application aimed at undergraduates and nonspecialist researchers. "Our experimental instincts were encouraged by the Mellon Foundation," says RLG president Jim Michalko, who notes the foundation’s support for a wide range of academic information-sharing projects. Brainstorming, research, and design work by RLG staff and outside consultants marked the early phases of the project, then known as the "Union Catalog on the Web."

RLG began sketching more definite outlines for the application, nicknamed "RedLightGreen" by RLG staffers, in spring 2002. Working in part with outside consultants, RLG developed a functional specification, a database design, and a customized XML expression of the MARC data structure used in the RLG Union Catalog. RLG and outside consultants designed "wireframes" (mockups) of the Web application, and conducted two rounds of user tests with students from Stanford, San Jose State, and Santa Clara Universities. RLG also contracted with Recommind to use that company's MindServer software in the RedLightGreen pilot project. MindServer will enable fast keyword searching, enhance relevance ranking for search results, and provide the ability to expand searches to related categories.

As of late 2002, RLG was in the process of extracting about 4 million records from the RLG Union Catalog (a subset of the data) and loading them into the RedLightGreen database in order to test MindServer's performance, XML conversion, and other technical functions. RLG is also building the Web application that will be RedLightGreen's public face. Pilot deployments using the full RLG Union Catalog database are scheduled to begin in fall 2003 at Swarthmore College, New York University, and Columbia University.

Under the Hood

RedLightGreen is built on an IBM DB2 database containing XML data. The advantages of XML are many, since it is an adaptable, extensible data format and is widely accepted. "Once you have data in XML format you have lots of flexibility in using that data internally or delivering it to outside partners," says RLG dataloads specialist Joe Altimus.

However, extracting bibliographic data from the RLG Union Catalog's current mainframe databases to an XML format was "far less straightforward than we expected," according to RLG software development manager Judith Bush. Character set encoding was one problem: Data is stored in the mainframe database in EBCDIC format, and supports many European, Asian, and Middle Eastern scripts (including Arabic, Hebrew, Cyrillic, Chinese, Japanese, and Korean). That data needed to be converted to UTF-8, a transformation of the Unicode™ character set suited for 8-bit environments and XML. "What RLG learned was that it was a very complex process to write all the rules needed to translate the Union Catalog EBCDIC data into UTF-8, particularly for the Asian and Middle Eastern scripts," says Altimus.

Also challenging was the process of coming up with an XML format (known as a document type definition, or DTD) that supported the full range of features needed by RedLightGreen. RLG started with the Library of Congress MARC XML format. However, this format couldn't effectively validate many of the records stored in the mainframe database, because RLG's database adds more than 40 custom fields to those defined by the MARC 21 standard. In adding these fields to the DTD, RLG learned that some mainframe database element names could not be used because they violated XML rules for element names.

Eventually, RLG developed its own XML DTD for MARC records—an iterative process that required modifying the initial Library of Congress DTD, testing it on sample data, and then modifying it again. The current DTD (version 16 of RLG's XML format for MARC records) is somewhat "looser" than the Library of Congress's, according to Altimus, but works more effectively with the data actually contained within the RLG Union Catalog.

By removing more than 2,000 subfield elements, RLG's DTD is also about 20 percent the size of the original Library of Congress DTD, so data conversion and XML validation go correspondingly faster. (Because RLG has already done MARC validation of the field tags, indicators, and subfields in its mainframe databases, it wasn't necessary to repeat this validation when migrating the data to XML. Therefore, the LC MARC subfields could be removed without presenting problems. At some point in the future, however, if data no longer enters the RedLightGreen database from a prevalidated MARC source such as the RLG Union Catalog, RLG will have to develop a more rigorous MARC validation system.)

While the RLG Union Catalog is unusually large and heterogeneous, many other library catalogs may face similar issues when migrating from legacy database systems to XML—particularly if their catalog's records haven't been validated against the MARC standard as they were entered. Indeed, the Library of Congress has recently published a simplified XML schema and a number of migration tools to assist libraries with solving such migration problems.

Glossary of terms

Data mining The process of analyzing large amounts of data in order to extract new kinds of useful information (such as implicit relationships between different pieces of information). DTD Document Type Definition. A file that defines the elements and data structure contained in an XML document. EBCDIC Extended Binary-Coded-Decimal Interchange Code. The code used on IBM mainframe computers to represent characters (such as letters, numbers, and symbols). Eureka® RLG's Web-based search interface for both novice and experienced users. Eureka provides access to an array of RLG information resources, including the RLG Union Catalog. FRBR Functional Requirements for Bibliographic Records. A model for organizing bibliographic information proposed by the International Federation of Library Associations and Institutions. FRBR distinguishes between a work, its expressions (e.g. translations), manifestations of those expressions (specific editions), and items (specific copies). MARC MAchine-Readable Cataloging, a standard for storing bibliographic data in electronic databases. The standard, which first emerged from a Library of Congress initiative in the 1970s, is used by most library catalogs today. The current version is MARC 21. OPAC Online Public Access Catalog. Any electronic library catalog that can be used by library patrons. RedLightGreen An application in development by RLG that aims to use bibliographic data as a way of providing undergraduates with authoritative sources of research information. RLG Union Catalog RLG's 23-year-old bibliographic database, containing records on more than 42 million different titles at hundreds of institutions. RLIN® Research Libraries Information Network. A bibliographic information system offered by RLG that provides access to a number of RLG resources, including the RLG Union Catalog. RLIN is scheduled to be replaced by a Windows-based client in 2003. Unicode™ A widely-used standard for digitally representing characters in a variety of Western, Middle Eastern, and Asian languages. UTF-8 A way of encoding Unicode characters for use on computer systems such as Unix and Linux. Validation The process of making sure that the data within an XML document is formatted properly according to a DTD. XML eXtensible Markup Language. A widely used standard from the World Wide Web Consortium (W3C) that facilitates the interchange of data between computer applications. XML is similar to the language used for Web pages, the HyperText Markup Language (HTML), in that both use markup codes (tags). Computer programs can automatically extract data from an XML document, using its associated DTD as a guide. Z39.50 An international standard, maintained by the Library of Congress, for searching and retrieving information from remote databases.

Link: Mining the catalog

Link broken? Try the Wayback Machine.

Mining the catalog

RLG's RedLightGreen Project