Dylan Tweney
Published Work

Mining the catalog

RLG's RedLightGreen Project Mining the Catalog In this report: Getting results What students want Delivering the goods Catalogs: The next generation For more information: RedLightGreen timeline Under the hood: technical details Glossary of terms What happens when you take a massive database of bibli
Dylan Tweney 11 min read


RLG's RedLightGreen Project

Mining the Catalog




What happens when you take a massive database of bibliographic descriptions     and redesign it for the Web, not just as a resource for librarians, but as     a tool for undergraduate students and the public at large? To put it another     way: "How can we strip away the ‘librariness’ of the catalog so it looks     more like what students expect?" says Jim Michalko, president of RLG.

Those are among the questions that RLG’s ambitious            RedLightGreen project aims to address. After more than a year of            effort, the project has already generated some intriguing answers.

One thing is clear: the finished application will look and act a            lot more like Google or Amazon.com than a traditional library            catalog, even though it will have catalog records at its heart. It            will also open up new uses for bibliographic data. Instead of            simply retrieving call numbers and shelf status, RedLightGreen can            function as a potent tool for information discovery and for            identifying the most authoritative research sources.

The RedLightGreen project is a whole new way of thinking about library catalogs.     Catalogs today are optimized for inventory control and transaction management—not     necessarily information discovery. By taking the large, multi-institution     database that is the RLG Union Catalog, and     mining it for conceptual relationships and holdings data, RedLightGreen aims     beyond what both catalogs and Internet search engines can provide.

Getting            results

Although development is still underway (see timeline),     RLG has already learned volumes from the project. The first is that the RLG     Union Catalog is a rich, untapped lode of data on which books and authors     are authoritative sources of research information.

What we’ve learned from the RedLightGreen project

  • We’ve learned more about what undergraduate users want from online information resources
  • We’ve learned how data mining software can uncover valuable new information hidden in the RLG Union Catalog
  • We’ve learned how to provide access to a wealth of complex information through a simple, easy-to-use interface
  • We’ve discovered new opportunities for using bibliographic data to help end users find authoritative sources of research information
  • We’ve learned what’s involved in the complicated process of transforming MARC records to XML
  • We’ve learned about Functional Requirements for Bibliographic Records (FRBR), an emerging standard for distinguishing between various editions
  • We’ve confirmed how RLG can take a leading role in envisioning, building, and testing innovative new information services

RLG’s 23-year-old Union Catalog encompasses more than 126 million bibliographic     records, representing 42 million unique titles. It provides unparalleled coverage     across subjects and material types in more than 370 languages, from hundreds     of libraries worldwide.

RedLightGreen can use this data to put the most widely held items            near the top of any search results list—helping users to zero            in on the most credible books and authors quickly. If a book            appears in dozens of libraries’ collections, it’s a            good bet that the book is considered an important source of            information in its subject area: its selection by dozens of            librarians is an implicit endorsement. By contrast, an item held by            only one library may be of interest to Ph.D. candidates and            specialists, but is probably less interesting to a general            audience.

RLG Union Catalog data can also be used to evaluate the relevance of various     items to particular keyword searches and to organize search results. For example,     one book might be listed under three different subject headings at three different     libraries. By comparing these three records and the words used in the book’s     title, one can discover implicit relationships between all of the words. Data     mining, provided by Recommind Inc.’s MindServer™ software, extends     that process to the entire database of 42 million items, grouping related     words together and making new connections between subject headings, titles,     and keywords.

For example, a student might enter a search for the keywords            "Civil War" without specifying the American, Spanish,            or other civil wars. Using Recommind, RedLightGreen can organize            the results in clusters of related items, letting the student pick            which civil war interests her. At the same time the application can            insert more specific, scholarly subject classification terms into            the search that have been derived from the MindServer data. A            search for "New York riots" would turn up records            pertaining to the Irish draft riots of 1863, with subject headings            such as "New York—History—Civil War,            1861-1865" and "Civil War,            1861-1865—Fiction"—even though those headings            don’t match the keywords exactly.

In the finished application, both holdings information and            Recommind-generated relevance data can be used to sort search            results. RLG expects that this will produce more pertinent,            authoritative search results than can be delivered by any single            institution’s catalog—or by an Internet search engine.

What            students want

Of course, mining the information latent within the RLG Union            Catalog and making practical use of it are two different things.            However, early indications suggest that RedLightGreen’s            enhanced searches will interest undergraduates.

RLG conducted two separate user studies with undergraduates, using            mockups of the RedLightGreen application. The results showed            clearly that undergraduates’ needs are very different from            those of the librarians and scholars who have, up to now, been            RLG’s primary end users. "Students live in a different            universe than we do," says RLG information architect Arnold            Arcolio.

Undergraduates don’t use the precise terminology of Library            of Congress classifications. They may be thrown by library            terminology: Several students in the study thought that a link to            "scores" pertained to sports scores, not musical            scores, while others thought a link to "maps" would            provide maps and directions to the library, not items from map            collections.

Although students aren’t discriminating about library            classifications, they are concerned about finding sources for their            research, and Web searches aren’t meeting that need. In this            respect, RLG’s user studies echo the findings of a 2002 study            by the Digital Library Federation and Outsell, Inc. That study            found that existing information sources available to students are            falling far short of their expectations in terms of quality, ease            of access, subject coverage, and other aspects they considered most            important. Ninety-one percent of students felt that high-quality            information sources were very important; only 51 percent felt that            available electronic information sources are meeting that need. A            study published by the Online Computer Library Center (OCLC) in            June 2002 confirms these findings, showing that accuracy is the            most important attribute of online information for            students—yet only half of those surveyed found that            information on the Web is acceptable in this regard.

Delivering the goods

According to RLG’s user studies, undergraduates are not            concerned about detailed edition information: What matters to them            is getting the most recent English edition, and getting it as            quickly as possible.

Therefore, RLG needed a way to organize the vast number of records in the     RLG Union Catalog without overwhelming users with a deluge of information     about different editions. RLG turned to the Functional Requirements for Bibliographic     Records (FRBR), an emerging model proposed     by the International Federation of Library Associations and Institutions.     FRBR distinguishes between a work, its expressions (e.g. translations), manifestations     of those expressions (specific editions), and items (specific copies). The     RedLightGreen database collapses FRBR’s four levels into just two, displaying     a work and various manifestations of that work. This approach will reduce     a potentially overwhelming number of editions into a smaller, more manageable     set of works that match a user's search terms.

Many RedLightGreen users will be facing imminent deadlines, so shelf status     and location information are particularly urgent. If a needed book is on the     shelf in the university’s library, students indicated that they would go get     it. In this respect, RedLightGreen’s weighting of search results may have     a second benefit: Widely held items are more likely to appear in the student’s     home library. "We’re hoping that it will not only establish the credibility     of sources, but will also establish a certain level of accessibility,"     says RLG software development manager Judith Bush. Shelf status and location     are not recorded in the RLG Union Catalog, however, so RedLightGreen will need     to connect with local online public access catalogs     to retrieve this information.

RLG’s user studies showed that the undergraduates            aren’t likely to use interlibrary loan—many            weren’t even aware that this resource was available to them.            They might, however, go to a neighboring college or public library            if they knew the book was available. In this, RLG’s test            subjects echoed the OCLC study, which found that over half of            surveyed students wanted some way to search other libraries’            collections, and that 72 percent of those would go to a nearby            library to pick up a book. By contrast, RLG found, undergraduates            are unlikely to order the book from an online bookseller, even if            the application provides a link to do so.

Students were particularly intrigued about the possibility of            creating formatted, ready-to-use bibliographies from their search            results. RedLightGreen will support this feature with a variety of            citation formats: University of Chicago (Turabian), Modern Language            Association (MLA), and American Psychological Association (APA).

Students are particularly keen to have access to journal articles            in addition to monographs. This feature is outside the scope of            RedLightGreen, as the RLG Union Catalog does not include journal            citations. However, it may be possible to link RedLightGreen            searches to journal databases—or even to the broader            Web—in future versions of the application.

Catalogs:            The next generation

RedLightGreen aims to deliver useful responses to broad,            unsophisticated keyword queries. The interface is simplified and            easy to use, less like an online catalog query form with many            different fields and more like a search engine, with a single            keyword entry field and a "search" button. But behind            this interface is the power and depth of one of the world’s            largest bibliographic databases.

For students as well as Web users in general, RedLightGreen will be            a window into information not previously available online.            "The RLG Union Catalog’s unique value is the breadth of            unusual things it contains," says independent search engine            consultant Avi Rappoport, of SearchTools.com. Many of the resources            RedLightGreen will offer are unmatched by any search engine.            "In using the Web, you start getting annoyed by how much is            missing," says Rappoport. RedLightGreen can help fill in the            gaps by supplying detailed information on authoritative research            sources.

With RedLightGreen, RLG and its members are in a unique position to            redefine the library catalog—and to create a brand new kind            of information resource. No one is currently providing this kind of            research information on the public Web, without fees, and with the            kind of usable interface expected by Web users. With RedLightGreen,            RLG is creating an innovative new application—and in the            process is discovering new opportunities for using bibliographic            data.

"This is the library community’s chance to reinvent            what a catalog is," says Michalko.

Last updated 15 January 2003

Timeline


Through most of its 23-year history, the RLG Union Catalog has been used     primarily by librarians and scholars within academic or research institutions.     It is currently available to subscribing institutions through a variety of     interfaces—RLG's Web-based Eureka®     interface, telnet connections to RLIN®,     and a Z39.50 gateway.

A grant from The Andrew W. Mellon Foundation enabled RLG, in late 2001,     to begin envisioning what form the RLG Union Catalog might take if it were     a Web application aimed at undergraduates and nonspecialist researchers. "Our     experimental instincts were encouraged by the Mellon Foundation," says     RLG president Jim Michalko, who notes the foundation’s support for a wide     range of academic information-sharing projects. Brainstorming, research, and     design work by RLG staff and outside consultants marked the early phases of     the project, then known as the "Union Catalog on the Web."

RLG began sketching more definite outlines for the application, nicknamed     "RedLightGreen" by RLG staffers, in spring 2002. Working in part     with outside consultants, RLG developed a functional specification, a database     design, and a customized XML expression of     the MARC data structure used in the RLG Union     Catalog. RLG and outside consultants designed "wireframes" (mockups)     of the Web application, and conducted two rounds of user tests with students     from Stanford, San Jose State, and Santa Clara Universities. RLG also contracted     with Recommind to use that company's MindServer software in the RedLightGreen     pilot project. MindServer will enable fast keyword searching, enhance relevance     ranking for search results, and provide the ability to expand searches to     related categories.

As of late 2002, RLG was in the process of extracting about 4            million records from the RLG Union Catalog (a subset of the data)            and loading them into the RedLightGreen database in order to test            MindServer's performance, XML conversion, and other technical            functions. RLG is also building the Web application that will be            RedLightGreen's public face. Pilot deployments using the full            RLG Union Catalog database are scheduled to begin in fall 2003 at            Swarthmore College, New York University, and Columbia University.

Under the Hood


RedLightGreen is built on an IBM DB2 database containing XML     data. The advantages of XML are many, since it is an adaptable, extensible     data format and is widely accepted. "Once you have data in XML format     you have lots of flexibility in using that data internally or delivering it     to outside partners," says RLG dataloads specialist Joe Altimus.

However, extracting bibliographic data from the RLG Union Catalog's     current mainframe databases to an XML format was "far less straightforward     than we expected," according to RLG software development manager Judith     Bush. Character set encoding was one problem: Data is stored in the mainframe     database in EBCDIC format, and supports     many European, Asian, and Middle Eastern scripts (including Arabic, Hebrew,     Cyrillic, Chinese, Japanese, and Korean). That data needed to be converted     to UTF-8, a transformation of the Unicode™     character set suited for 8-bit environments and XML. "What RLG learned     was that it was a very complex process to write all the rules needed to translate     the Union Catalog EBCDIC data into UTF-8, particularly for the Asian and Middle     Eastern scripts," says Altimus.

Also challenging was the process of coming up with an XML format (known     as a document type definition, or DTD) that     supported the full range of features needed by RedLightGreen. RLG started     with the Library of Congress MARC XML format. However, this format couldn't     effectively validate many of the records stored in the mainframe database,     because RLG's database adds more than 40 custom fields to those defined     by the MARC 21 standard. In adding these     fields to the DTD, RLG learned that some mainframe database element names     could not be used because they violated XML rules for element names.

Eventually, RLG developed its own XML DTD for MARC records—an iterative     process that required modifying the initial Library of Congress DTD, testing     it on sample data, and then modifying it again. The current DTD (version 16     of RLG's XML format for MARC records) is somewhat "looser" than     the Library of Congress's, according to Altimus, but works more effectively     with the data actually contained within the RLG Union Catalog.

By removing more than 2,000 subfield elements, RLG's DTD is also about     20 percent the size of the original Library of Congress DTD, so data conversion     and XML validation go correspondingly     faster. (Because RLG has already done MARC validation of the field tags, indicators,     and subfields in its mainframe databases, it wasn't necessary to repeat     this validation when migrating the data to XML. Therefore, the LC MARC subfields     could be removed without presenting problems. At some point in the future,     however, if data no longer enters the RedLightGreen database from a prevalidated     MARC source such as the RLG Union Catalog, RLG will have to develop a more     rigorous MARC validation system.)

While the RLG Union Catalog is unusually large and heterogeneous, many other     library catalogs may face similar issues when migrating from legacy database     systems to XML—particularly if their catalog's records haven't     been validated against the MARC standard as they were entered. Indeed, the     Library of Congress has recently published a simplified XML schema and a number     of migration tools to assist libraries with solving such migration problems.

Glossary of terms


Data mining  The process of analyzing large amounts of data in order to extract new       kinds of useful information (such as implicit relationships between different       pieces of information). DTD  Document Type Definition. A file that defines the elements and data structure       contained in an XML document. EBCDIC  Extended Binary-Coded-Decimal Interchange Code. The code used on IBM       mainframe computers to represent characters (such as letters, numbers, and       symbols). Eureka®  RLG's Web-based search interface for both novice and experienced       users. Eureka provides access to an array of RLG information resources,       including the RLG Union Catalog. FRBR  Functional Requirements for Bibliographic Records. A model for organizing       bibliographic information proposed by the International Federation of Library       Associations and Institutions. FRBR distinguishes between a work, its expressions       (e.g. translations), manifestations of those expressions (specific editions),       and items (specific copies). MARC  MAchine-Readable Cataloging, a standard for storing bibliographic data       in electronic databases. The standard, which first emerged from a Library       of Congress initiative in the 1970s, is used by most library catalogs today.       The current version is MARC 21. OPAC  Online Public Access Catalog. Any electronic library catalog that can       be used by library patrons. RedLightGreen  An application in development by RLG that aims to use bibliographic data       as a way of providing undergraduates with authoritative sources of research       information. RLG Union Catalog  RLG's 23-year-old bibliographic database, containing records on more       than 42 million different titles at hundreds of institutions. RLIN®  Research Libraries Information Network. A bibliographic information system       offered by RLG that provides access to a number of RLG resources, including       the RLG Union Catalog. RLIN is scheduled to be replaced by a Windows-based       client in 2003. Unicode™ A widely-used standard for digitally representing characters in a variety       of Western, Middle Eastern, and Asian languages. UTF-8  A way of encoding Unicode characters for use on computer systems such       as Unix and Linux. Validation  The process of making sure that the data within an XML document is formatted       properly according to a DTD. XML  eXtensible Markup Language. A widely used standard from the World Wide       Web Consortium (W3C) that facilitates the interchange of data between computer       applications. XML is similar to the language used for Web pages, the HyperText       Markup Language (HTML), in that both use markup codes (tags). Computer programs       can automatically extract data from an XML document, using its associated       DTD as a guide. Z39.50  An international standard, maintained by the Library of Congress, for       searching and retrieving information from remote databases.

Link: Mining the catalog

Link broken? Try the Wayback Machine.

Share
Comments
More from Dylan Tweney

Storylines

Subscribe to my newsletter on writing & storytelling

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Dylan Tweney.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.