Naked journalism.

Posting an unedited transcript, as I just did, is a disconcerting experience. It feels a little like taking off my journalistic clothes in public. For one thing, it reveals just how inarticulate I am when talking. Unlike Brian Lamb or Michael Krasny, I rarely ask questions that sound well-informed, well-rounded, and eloquent. Instead, my questions sound like this: “Well, so, um, what about that other thing?” And then the respondent goes on to talk for ten minutes.

Part of that is intentional: I figure I’m there to interview the subject, not the other way around, and once they start talking, it’s better if I just shut up and get out of the way. (The other part is that I’m just not that eloquent when speaking, in person — I’m much more comfortable expressing myself in writing.)

In the published version of an interview, I clean things up and insert questions that sound smarter — but they also help the reader, by setting up the interviewee’s replies better. Frankly, I find unedited transcripts a little hard to follow sometimes — you miss out on a lot of the nuance and gesture that forms the context of a live conversation. Editing attempts to make up for that, and that’s one of the reasons I like to let the edited interviews stand on their own.

But there’s a deeper reason why I don’t usually post transcripts.
Continue reading “Naked journalism.”

Naked journalism.

Transcript of Doctorow interview.

Cory Doctorow fans and others might be interested in the full, unedited transcript of my interview with Cory Doctorow.

Doctorow is a fast talker, so even though the interview lasted less than 45 minutes, this transcript is almost 8,500 words. I’ve done almost no editing — it’s pretty much a straight transcription of the audio recording. I didn’t record every single “um” and “you know” but otherwise the transcript is pretty faithful.

The edited version of this interview was published on 1/23/2003 by SFGate.
Continue reading “Transcript of Doctorow interview.”

Transcript of Doctorow interview.

Quarterly envy.

Washington Post book columnist Michael Dirda indulges a fantasy that afflicts nearly every book lover at some time or another — the desire to run a literary magazine. I know this temptation well. Sure, I know, the Internet gives everyone the equivalent of a printing press and blogging tools are the equivalent of a free photocopy machine with an unlimited budget for stamps — in other words, a zine publisher’s dream come true. But there’s still something compelling about a nice, chunky literary quarterly printed on a nice, off-white, toothy paper and deep black ink. It’s a form of nostalgia, I suppose. Dirda understands the impulse and gives it free reign here. Then he closes with an unapologetically sentimental hymn of praise to the literary arts.

I do believe in the great W’s: the whimsical and the wistful, the witty and the worldly. But then knowledge of the arts really should be a source of personal amusement and satisfaction. You memorize poetry, said Anthony Burgess, so that you can belt out appropriate verse when drunk, just as art in general, as Samuel Johnson reminded us, should allow people to better enjoy life or to better endure it. (via bookslut)

Meanwhile, if you want to see how some poets are using the web to revitalize (vitalize?) language poetry, check out what poet Ron Silliman has to say. (via Blogistan)

Quarterly envy.

Googling the library.

I’ve been working with RLG for the past couple of months, writing and editing content for their web site and print publications. RLG is a consortium of research libraries, archives, museum libraries, and the like. One of the things I like about this gig has been that I get to learn about the high-tech information management tools that lurk in the background of these big libraries.

RLG is trying to take some of the library world’s background technology and bring it to the fore with a new Web application (still in development) dubbed RedLightGreen. If it works, it could do for the library stacks what Google did for the Web.
Continue reading “Googling the library.”

Googling the library.

Q&A with Cory Doctorow.

My interview with Cory Doctorow for SFGate.com has just been published. Take a look!

It would make me pretty happy if this book contributed in some way to the idea that reading books on the screen is good. I know that there’s a meme that floats around that says, oh, reading off a screen is hard, and no one wants to do it and so on — despite all the evidence to the contrary. Most of the people I know read off a screen for 12 hours a day.

Q&A with Cory Doctorow.

Q&A: Cory Doctorow

San Francisco, California, USA –Cory Doctorow is a true believer in the power of technology.

His first novel, “Down and Out in the Magic Kingdom,” is one of the first works tobe published under the Creative Commons license — anagreement that lets people copy and redistribute the book freely so long as theycredit the author. That move puts Doctorow at the forefront of a growingdigital rights movement.

Doctorow’s novel is like a love letter to Napster, Google and Walt DisneyWorld. It’s a rollicking, fast-paced story and is entertainingly inventivewithout bogging down in the impressive array of future technologies it imagines.”Down and Out in the Magic Kingdom” (also published in hardcover this month by TorBooks) is set in a future where death has been eliminated, energy and rawmaterials are freely available in limitless quantities (much like MP3 files onKaZaA today) and people’s nervous systems are wired directly into the Internet.The protagonist, Julius, works at Disney World, and the novel chronicles hisstruggles to protect the theme park’s Haunted Mansion from being shut down by anad hoc group of designers who have developed a technology for “flash baking”theme-park experiences directly into parkgoers’ brains.

In his day job, Doctorow is outreach coordinator for the Electronic FrontierFoundation (EFF). He’s also one of the primary contributors to thepopular techie weblog BoingBoing, he co-founded a dot-com,OpenCola, and he has another science-fiction novel and a short-story collection due out later this year.

Like his character Julius, Doctorow is an archetypal geek, from his nerdy DrewCarey-style glasses to the bright yellow cell phone dangling from his cargopants. I caught up with the prolific (and apparently highly caffeinated) Torontonative in his office at the EFF, where a blueprint of the Haunted Mansion hangs overhis desk.

This will make me sound like I’m behind the times, but this is actually thefirst time I’ve read an entire novel on screen.

It would make me pretty happy if this book contributed in some wayto the idea that reading books on the screen is good. I know that there’s a memethat floats around that says, oh, reading off a screen is hard, and no one wantsto do it and so on — despite all the evidence to the contrary. Most of thepeople I know read off a screen for 12 hours a day.

I won’t deny that there’s a sentimental frisson of good feeling you get when youpick up a physical, paper book, especially one with your name on it. Books arenice, but they’re not as nice as we make out.

I think that, ultimately, the role of books in the world of electronic publishingwill be much like the role of live music in the world of recorded-musicpublishing. We’ll still have plenty of paper books, but that will be dwarfed by theenormous size of the electronic-book universe.

You’ve written several novels, you’re at work on two more, you work for theEFF and you’ve got a popular blog where you post 10 or more items a day. Wheredo you find the time?

Well, sleep is for the weak. I’ll sleep when I’m dead.

The thing about it is that there is synergy. The stuff that I do for BoingBoingis basically research in support of EFF and the writing, and the blog is how Ikeep track of it. By doing it in public, I get lots of suggestions, and I also geta lot of feedback. BoingBoing is a net time saver because I get more researchdone with less effort, and I keep track of it better than I would if I were doingit privately.

The research that I do on EFF issues is also feeding the fiction. I published astory last August on Salon called “0wnz0red,” about digital rights managementand trusted computing. That came straight out of a briefing I got here.

In your book, you have a sort of alternate currency called Whuffie. Thecharacters are constantly checking one another’s Whuffie scores and looking forways to earn more Whuffie. Can you explain the idea?

Well, currency is a way of keeping score today. Whuffie is howmuch esteem people hold you in. Currency is a really rough approximation ofWhuffie. You can’t really get a job without esteem. You generally can’t get amortgage with no esteem.

In the book, I have this sort of magical McGuffin technology, which is somethingthat can automatically find out how you feel about everything that you have anopinion on. Then, someone who has a high opinion about me can ask me — withoutany kind of conscious intervention — how I feel about you. They can just ask thenetwork, “How is it that Cory feels about you?” And that gives them some idea ofhow much time of day they should give you.

It sounds a little like walking around with your bank balance displayed in abox above your head at all times.

Well, it’s true. Except, you know, we already do this, in some way. As currencyis a rough approximation of your Whuffie, the things that currency affords, likeyour style of dress, your haircut, all the semiotics of your presentation, aredescended from Whuffie. It’s just that Whuffie’s harder to [fake].

The Internet has made us very socially deviant, in the sense that social normsare enforced by groups. If you have some incredibly strange idea of, forinstance, wearing underwear on your head, generally speaking, there is socialdisapprobation that keeps that factor in check. But on the Internet, you canbasically exist in the communication spheres of people who have the same valuesystem as yours, no matter how weird it may be. On the Internet, you don’t getthat pressure to return to a norm. In some ways, Whuffie is a way to make youmore socially normative. It’s not necessarily a good thing.

Why did you call it “Whuffie”?

The word is what we used in high school instead of “brownie points.” A friendof mine pointed out, given the era that I went to high school in, that it almostcertainly came from “The Arsenio Hall Show”: “Woof, woof, woof.”

Most of the book takes place at Disney World, and the plot centers aroundvarious factions’ attempts to control the Haunted Mansion there. You seem alittle fascinated — almost obsessed — with Disney.

(points out a large collection of Disney paraphernalia in his office) Yeah,I’m a little obsessed. There’s so much to love and so much to hate about DisneyWorld and about the Disney corporation that it’s the perfect obsessive materialfor someone who wants to mine the cultural space.

I was raised by schoolteachers, and my grandparents were snowbirds. Every winterthey would fly south to Fort Lauderdale to a gate-guarded, seniors-only communitycalled Century Village that my dad likes to call “Cemetery Village.” We tookChristmas breaks in Lauderdale, and it was just about as dull as you can imaginefor an eight- or nine-year-old. So we would get in the big, gas-guzzling landyacht that my grandfather drove, and we would go to Disney World for a couple ofdays. Christmas weekends every year, during my whole adolescence, were spent atDisney World, and I just became completely obsessed with it.

Walt’s genius was that he would come up with incredibly novel, innovative thingsthat could only be imitated after a couple of years. Meanwhile, he wouldhave this very healthy margin until his competition figured out what he was doingand drove the price down to a competitive level. Then he would do the next thing.But when Walt died [in 1966], they just stopped doing that. They just starteddoing the same thing. They basically built a twin of Disneyland in Disney World,but bigger.

There’s lots you can say about [Disney chairman and CEO Michael] Eisner thatisn’t very flattering, but the one thing you can say is that under Eisner’sleadership, there has been a definite focus on innovation, at least in Florida. AtDisneyland, unfortunately, they brought in these idiot McKinsey consultants, theystopped spending any money on R&D and they bought all these off-the-shelf midwayrides, with Ferris wheels, for the California Adventure. They built thisincredibly dreary, boring, banal theme park that is like an extremely clean butless fun version of the Santa Monica pier, and, unsurprisingly, it’s a ghost town.You could fire a cannon down the main drag without hitting a tourist.

In the world of “Down and Out in the Magic Kingdom,” there’s no death, there areunlimited resources, nanotechnology can create any object you desire (including aclone of yourself) and energy is free. What were you trying to accomplish bysetting the story in that kind of world?

I wanted to clarify my own thinking about what a non-scarce economics lookslike. Keynes and Marx and the great economic thinkers are all concerned with themanagement of resources that are scarce. If it’s valuable, it needs to bemanaged, because the supply of it will dwindle. You need to avert the tragedy ofthe commons [the notion that self-interested individuals, such as sheepherders,will always use as much of a common resource as possible, such as a grassypasture, until that resource is totally depleted].

Today, with things that can be represented digitally, we have the opposite. Inthe Napster universe, everyone who downloads a file makes a copy of it available.This isn’t a tragedy of the commons, this is a commons where the sheep s***grass — where the more you graze, the more commons you get. So I took the idea ofnanotechnology as the means whereby any good can be reproduced infinitely, atzero marginal cost, and tried to use that as a metaphor for the online world weactually live in.

The other side of it is this notion that you never really run out of scarcity.There are always limits on your time and attention, there are only so many people whocan fit in a restaurant, only so many people who can converse at once. When youare beset on all sides by entertainment, figuring out which bits are worthwhilerequires a level of attention that quickly burns all your idle cycles. Wheneveryone watched Jackie Gleason on Thursdays at 9:30, it was a lot easier– television watching required a lot less effort than whipping out your TiVo andfiguring out which shows you want to prerecord.

What’s your approach to writing?

It’s really quotidian. I write a page a day, basically. With novels, once Iget the first 20 or 25 percent on paper and an outline done, I usually make thatsemipublic. I have a list of about 200 or 300 first readers, and I e-mail them mypage, every day, even before I spell-check it, hot off the word processor. Theykeep me really honest. When I miss a day, they e-mail me and nudge me.

I had a really successful experience doing that with my second book, “EasternStandard Tribe” [due out in November 2003]. I wrote that between Aug. 1 andDec. 12 of 2001, 60,000 words in five months, and actually managed to sell itwithin a week of my finishing it.

You’ve lived in San Francisco a while now. How do you like it here?

I’ve lived here since September of 2000. Right at the height of the boom.

I really miss Toronto. San Francisco’s a really dysfunctional place. It has a lotof the downsides of living in a small town and a lot of the downsides of livingin a big city, and it misses a lot of the upsides of both of those places. It’svery hard to get from one place to another. The mass transit is so-so. Going fromthe Mission to downtown on foot feels about 10 times as long as it actually is.It’s a Jane Jacobs nightmare of freeway overpasses and single-use neighborhoods.

The weather’s OK, although it would be nice if the buildings were insulated,because when it’s 40 degrees at night and you don’t have insulation or centralheating, damn, it’s cold.

The thing about San Francisco that keeps me here is the people, the technology.This is ground zero for technologists. This is geek mecca, it’s nirvana. But Iheartily miss the Northeast. You can see the bones of a great city in SanFrancisco, and there are pockets of it that are like nothing else on Earth, buttaken as a whole, it’s really dysfunctional.

Also, I can’t get my head around the private-medicine thing. I think thisexplains a lot about the various geek cultures of the U.K., Canada and the U.S. Inthe U.S., there are tons of venture capital, so everyone went out and started acompany. In Canada, there are tons of socialized medicine, so everyone became afreelancer. If you’re a freelancer [in the U.S.], and you’re in poor health, andyou can’t get insured, you are embarking on a kind of slow suicide. And then, inthe U.K., they had tons of arts grants, so all the geeks became Net artists, andthat’s why there’s all this kind of strange, situationist, British Net art.

I’m told that Canada spends less money per capita giving away health care thanthe U.S. spends regulating it. So you’re spending more money keeping the HMOshonest than it would cost you to give it away. That’s a big difference betweenthe American and Canadian mind-sets.

How many downloads of your book have there been so far?

There have been 47,334 from my site. Ninety downloads since we started talking. I hope tobreak 50,000 today.

That’s just moving right along.

Hell, yeah!

Link: Q&A: Cory Doctorow

Link broken? Try the Wayback Machine.

Q&A: Cory Doctorow

Mining the catalog


RLG's RedLightGreen Project

Mining the Catalog


In this report:

For more information:

What happens when you take a massive database of bibliographic descriptions and redesign it for the Web, not just as a resource for librarians, but as a tool for undergraduate students and the public at large? To put it another way: "How can we strip away the ‘librariness’ of the catalog so it looks more like what students expect?" says Jim Michalko, president of RLG.

Those are among the questions that RLG’s ambitious RedLightGreen project aims to address. After more than a year of effort, the project has already generated some intriguing answers.

One thing is clear: the finished application will look and act a lot more like Google or Amazon.com than a traditional library catalog, even though it will have catalog records at its heart. It will also open up new uses for bibliographic data. Instead of simply retrieving call numbers and shelf status, RedLightGreen can function as a potent tool for information discovery and for identifying the most authoritative research sources.

The RedLightGreen project is a whole new way of thinking about library catalogs. Catalogs today are optimized for inventory control and transaction management—not necessarily information discovery. By taking the large, multi-institution database that is the RLG Union Catalog, and mining it for conceptual relationships and holdings data, RedLightGreen aims beyond what both catalogs and Internet search engines can provide.

Getting results

Although development is still underway (see timeline), RLG has already learned volumes from the project. The first is that the RLG Union Catalog is a rich, untapped lode of data on which books and authors are authoritative sources of research information.

What we’ve learned from the RedLightGreen project

  • We’ve learned more about what undergraduate users want from online information resources
  • We’ve learned how data mining software can uncover valuable new information hidden in the RLG Union Catalog
  • We’ve learned how to provide access to a wealth of complex information through a simple, easy-to-use interface
  • We’ve discovered new opportunities for using bibliographic data to help end users find authoritative sources of research information
  • We’ve learned what’s involved in the complicated process of transforming MARC records to XML
  • We’ve learned about Functional Requirements for Bibliographic Records (FRBR), an emerging standard for distinguishing between various editions
  • We’ve confirmed how RLG can take a leading role in envisioning, building, and testing innovative new information services

RLG’s 23-year-old Union Catalog encompasses more than 126 million bibliographic records, representing 42 million unique titles. It provides unparalleled coverage across subjects and material types in more than 370 languages, from hundreds of libraries worldwide.

RedLightGreen can use this data to put the most widely held items near the top of any search results list—helping users to zero in on the most credible books and authors quickly. If a book appears in dozens of libraries’ collections, it’s a good bet that the book is considered an important source of information in its subject area: its selection by dozens of librarians is an implicit endorsement. By contrast, an item held by only one library may be of interest to Ph.D. candidates and specialists, but is probably less interesting to a general audience.

RLG Union Catalog data can also be used to evaluate the relevance of various items to particular keyword searches and to organize search results. For example, one book might be listed under three different subject headings at three different libraries. By comparing these three records and the words used in the book’s title, one can discover implicit relationships between all of the words. Data mining, provided by Recommind Inc.’s MindServer™ software, extends that process to the entire database of 42 million items, grouping related words together and making new connections between subject headings, titles, and keywords.

For example, a student might enter a search for the keywords "Civil War" without specifying the American, Spanish, or other civil wars. Using Recommind, RedLightGreen can organize the results in clusters of related items, letting the student pick which civil war interests her. At the same time the application can insert more specific, scholarly subject classification terms into the search that have been derived from the MindServer data. A search for "New York riots" would turn up records pertaining to the Irish draft riots of 1863, with subject headings such as "New York—History—Civil War, 1861-1865" and "Civil War, 1861-1865—Fiction"—even though those headings don’t match the keywords exactly.

In the finished application, both holdings information and Recommind-generated relevance data can be used to sort search results. RLG expects that this will produce more pertinent, authoritative search results than can be delivered by any single institution’s catalog—or by an Internet search engine.

What students want

Of course, mining the information latent within the RLG Union Catalog and making practical use of it are two different things. However, early indications suggest that RedLightGreen’s enhanced searches will interest undergraduates.

RLG conducted two separate user studies with undergraduates, using mockups of the RedLightGreen application. The results showed clearly that undergraduates’ needs are very different from those of the librarians and scholars who have, up to now, been RLG’s primary end users. "Students live in a different universe than we do," says RLG information architect Arnold Arcolio.

Undergraduates don’t use the precise terminology of Library of Congress classifications. They may be thrown by library terminology: Several students in the study thought that a link to "scores" pertained to sports scores, not musical scores, while others thought a link to "maps" would provide maps and directions to the library, not items from map collections.

Although students aren’t discriminating about library classifications, they are concerned about finding sources for their research, and Web searches aren’t meeting that need. In this respect, RLG’s user studies echo the findings of a 2002 study by the Digital Library Federation and Outsell, Inc. That study found that existing information sources available to students are falling far short of their expectations in terms of quality, ease of access, subject coverage, and other aspects they considered most important. Ninety-one percent of students felt that high-quality information sources were very important; only 51 percent felt that available electronic information sources are meeting that need. A study published by the Online Computer Library Center (OCLC) in June 2002 confirms these findings, showing that accuracy is the most important attribute of online information for students—yet only half of those surveyed found that information on the Web is acceptable in this regard.

Delivering the goods

According to RLG’s user studies, undergraduates are not concerned about detailed edition information: What matters to them is getting the most recent English edition, and getting it as quickly as possible.

Therefore, RLG needed a way to organize the vast number of records in the RLG Union Catalog without overwhelming users with a deluge of information about different editions. RLG turned to the Functional Requirements for Bibliographic Records (FRBR), an emerging model proposed by the International Federation of Library Associations and Institutions. FRBR distinguishes between a work, its expressions (e.g. translations), manifestations of those expressions (specific editions), and items (specific copies). The RedLightGreen database collapses FRBR’s four levels into just two, displaying a work and various manifestations of that work. This approach will reduce a potentially overwhelming number of editions into a smaller, more manageable set of works that match a user's search terms.

Many RedLightGreen users will be facing imminent deadlines, so shelf status and location information are particularly urgent. If a needed book is on the shelf in the university’s library, students indicated that they would go get it. In this respect, RedLightGreen’s weighting of search results may have a second benefit: Widely held items are more likely to appear in the student’s home library. "We’re hoping that it will not only establish the credibility of sources, but will also establish a certain level of accessibility," says RLG software development manager Judith Bush. Shelf status and location are not recorded in the RLG Union Catalog, however, so RedLightGreen will need to connect with local online public access catalogs to retrieve this information.

RLG’s user studies showed that the undergraduates aren’t likely to use interlibrary loan—many weren’t even aware that this resource was available to them. They might, however, go to a neighboring college or public library if they knew the book was available. In this, RLG’s test subjects echoed the OCLC study, which found that over half of surveyed students wanted some way to search other libraries’ collections, and that 72 percent of those would go to a nearby library to pick up a book. By contrast, RLG found, undergraduates are unlikely to order the book from an online bookseller, even if the application provides a link to do so.

Students were particularly intrigued about the possibility of creating formatted, ready-to-use bibliographies from their search results. RedLightGreen will support this feature with a variety of citation formats: University of Chicago (Turabian), Modern Language Association (MLA), and American Psychological Association (APA).

Students are particularly keen to have access to journal articles in addition to monographs. This feature is outside the scope of RedLightGreen, as the RLG Union Catalog does not include journal citations. However, it may be possible to link RedLightGreen searches to journal databases—or even to the broader Web—in future versions of the application.

Catalogs: The next generation

RedLightGreen aims to deliver useful responses to broad, unsophisticated keyword queries. The interface is simplified and easy to use, less like an online catalog query form with many different fields and more like a search engine, with a single keyword entry field and a "search" button. But behind this interface is the power and depth of one of the world’s largest bibliographic databases.

For students as well as Web users in general, RedLightGreen will be a window into information not previously available online. "The RLG Union Catalog’s unique value is the breadth of unusual things it contains," says independent search engine consultant Avi Rappoport, of SearchTools.com. Many of the resources RedLightGreen will offer are unmatched by any search engine. "In using the Web, you start getting annoyed by how much is missing," says Rappoport. RedLightGreen can help fill in the gaps by supplying detailed information on authoritative research sources.

With RedLightGreen, RLG and its members are in a unique position to redefine the library catalog—and to create a brand new kind of information resource. No one is currently providing this kind of research information on the public Web, without fees, and with the kind of usable interface expected by Web users. With RedLightGreen, RLG is creating an innovative new application—and in the process is discovering new opportunities for using bibliographic data.

"This is the library community’s chance to reinvent what a catalog is," says Michalko.

Last updated 15 January 2003

Timeline


Through most of its 23-year history, the RLG Union Catalog has been used primarily by librarians and scholars within academic or research institutions. It is currently available to subscribing institutions through a variety of interfaces—RLG's Web-based Eureka® interface, telnet connections to RLIN®, and a Z39.50 gateway.

A grant from The Andrew W. Mellon Foundation enabled RLG, in late 2001, to begin envisioning what form the RLG Union Catalog might take if it were a Web application aimed at undergraduates and nonspecialist researchers. "Our experimental instincts were encouraged by the Mellon Foundation," says RLG president Jim Michalko, who notes the foundation’s support for a wide range of academic information-sharing projects. Brainstorming, research, and design work by RLG staff and outside consultants marked the early phases of the project, then known as the "Union Catalog on the Web."

RLG began sketching more definite outlines for the application, nicknamed "RedLightGreen" by RLG staffers, in spring 2002. Working in part with outside consultants, RLG developed a functional specification, a database design, and a customized XML expression of the MARC data structure used in the RLG Union Catalog. RLG and outside consultants designed "wireframes" (mockups) of the Web application, and conducted two rounds of user tests with students from Stanford, San Jose State, and Santa Clara Universities. RLG also contracted with Recommind to use that company's MindServer software in the RedLightGreen pilot project. MindServer will enable fast keyword searching, enhance relevance ranking for search results, and provide the ability to expand searches to related categories.

As of late 2002, RLG was in the process of extracting about 4 million records from the RLG Union Catalog (a subset of the data) and loading them into the RedLightGreen database in order to test MindServer's performance, XML conversion, and other technical functions. RLG is also building the Web application that will be RedLightGreen's public face. Pilot deployments using the full RLG Union Catalog database are scheduled to begin in fall 2003 at Swarthmore College, New York University, and Columbia University.

Under the Hood


RedLightGreen is built on an IBM DB2 database containing XML data. The advantages of XML are many, since it is an adaptable, extensible data format and is widely accepted. "Once you have data in XML format you have lots of flexibility in using that data internally or delivering it to outside partners," says RLG dataloads specialist Joe Altimus.

However, extracting bibliographic data from the RLG Union Catalog's current mainframe databases to an XML format was "far less straightforward than we expected," according to RLG software development manager Judith Bush. Character set encoding was one problem: Data is stored in the mainframe database in EBCDIC format, and supports many European, Asian, and Middle Eastern scripts (including Arabic, Hebrew, Cyrillic, Chinese, Japanese, and Korean). That data needed to be converted to UTF-8, a transformation of the Unicode™ character set suited for 8-bit environments and XML. "What RLG learned was that it was a very complex process to write all the rules needed to translate the Union Catalog EBCDIC data into UTF-8, particularly for the Asian and Middle Eastern scripts," says Altimus.

Also challenging was the process of coming up with an XML format (known as a document type definition, or DTD) that supported the full range of features needed by RedLightGreen. RLG started with the Library of Congress MARC XML format. However, this format couldn't effectively validate many of the records stored in the mainframe database, because RLG's database adds more than 40 custom fields to those defined by the MARC 21 standard. In adding these fields to the DTD, RLG learned that some mainframe database element names could not be used because they violated XML rules for element names.

Eventually, RLG developed its own XML DTD for MARC records—an iterative process that required modifying the initial Library of Congress DTD, testing it on sample data, and then modifying it again. The current DTD (version 16 of RLG's XML format for MARC records) is somewhat "looser" than the Library of Congress's, according to Altimus, but works more effectively with the data actually contained within the RLG Union Catalog.

By removing more than 2,000 subfield elements, RLG's DTD is also about 20 percent the size of the original Library of Congress DTD, so data conversion and XML validation go correspondingly faster. (Because RLG has already done MARC validation of the field tags, indicators, and subfields in its mainframe databases, it wasn't necessary to repeat this validation when migrating the data to XML. Therefore, the LC MARC subfields could be removed without presenting problems. At some point in the future, however, if data no longer enters the RedLightGreen database from a prevalidated MARC source such as the RLG Union Catalog, RLG will have to develop a more rigorous MARC validation system.)

While the RLG Union Catalog is unusually large and heterogeneous, many other library catalogs may face similar issues when migrating from legacy database systems to XML—particularly if their catalog's records haven't been validated against the MARC standard as they were entered. Indeed, the Library of Congress has recently published a simplified XML schema and a number of migration tools to assist libraries with solving such migration problems.

Glossary of terms


Data mining
The process of analyzing large amounts of data in order to extract new kinds of useful information (such as implicit relationships between different pieces of information).
DTD
Document Type Definition. A file that defines the elements and data structure contained in an XML document.
EBCDIC
Extended Binary-Coded-Decimal Interchange Code. The code used on IBM mainframe computers to represent characters (such as letters, numbers, and symbols).
Eureka®
RLG's Web-based search interface for both novice and experienced users. Eureka provides access to an array of RLG information resources, including the RLG Union Catalog.
FRBR
Functional Requirements for Bibliographic Records. A model for organizing bibliographic information proposed by the International Federation of Library Associations and Institutions. FRBR distinguishes between a work, its expressions (e.g. translations), manifestations of those expressions (specific editions), and items (specific copies).
MARC
MAchine-Readable Cataloging, a standard for storing bibliographic data in electronic databases. The standard, which first emerged from a Library of Congress initiative in the 1970s, is used by most library catalogs today. The current version is MARC 21.
OPAC
Online Public Access Catalog. Any electronic library catalog that can be used by library patrons.
RedLightGreen
An application in development by RLG that aims to use bibliographic data as a way of providing undergraduates with authoritative sources of research information.
RLG Union Catalog
RLG's 23-year-old bibliographic database, containing records on more than 42 million different titles at hundreds of institutions.
RLIN®
Research Libraries Information Network. A bibliographic information system offered by RLG that provides access to a number of RLG resources, including the RLG Union Catalog. RLIN is scheduled to be replaced by a Windows-based client in 2003.
Unicode™
A widely-used standard for digitally representing characters in a variety of Western, Middle Eastern, and Asian languages.
UTF-8
A way of encoding Unicode characters for use on computer systems such as Unix and Linux.
Validation
The process of making sure that the data within an XML document is formatted properly according to a DTD.
XML
eXtensible Markup Language. A widely used standard from the World Wide Web Consortium (W3C) that facilitates the interchange of data between computer applications. XML is similar to the language used for Web pages, the HyperText Markup Language (HTML), in that both use markup codes (tags). Computer programs can automatically extract data from an XML document, using its associated DTD as a guide.
Z39.50
An international standard, maintained by the Library of Congress, for searching and retrieving information from remote databases.

Link: Mining the catalog

Link broken? Try the Wayback Machine.

Mining the catalog

Down and out.

Science fiction writer, EFF evangelist, BoingBoing blogger, and former dot-commie Cory Doctorow has just published his first novel, Down and Out in the Magic Kingdom. It’s a real book that you can buy in bookstores, but he’s also making it available for free download in a variety of formats on his web site. Doctorow says that there’ve been more than 20,000 downloads since its Jan. 9 launch, which qualifies as a smash hit in the SF novel world for sure.

MICRO REVIEW added 1/20/03: Doctorow’s novel (topping 50,000 downloads now) is like a love letter to Napster, Google and Walt Disney World. It’s a rollicking, fast-paced story and is entertainingly inventive without bogging down in the impressive array of future technologies it imagines. Down and Out in the Magic Kingdom is set in a future where death has been eliminated, energy and raw materials are freely available in limitless quantities (much like MP3 files on KaZaA today) and people’s nervous systems are wired directly into the Internet. The protagonist, Julius, works at Disney World, and the novel chronicles his struggles to protect the theme park’s Haunted Mansion from being shut down by an ad hoc group of designers who have developed a technology for “flash baking” theme-park experiences directly into parkgoers’ brains.

The novel gets off to a cracking good start, it’s a fast, entertaining read, and it manages to be entertaining and thought-provoking at the same time. Doctorow is a whiz at description, plot, and inventiveness. I’m looking forward to his next book.

Continue reading “Down and out.”

Down and out.