Googling blogs: A proposal

As much as Google rocks, there is one area where it really sucks: Searching weblogs. That’s because it’s not particularly intelligent about separating or summarizing weblog entries, making it a pretty blunt tool for finding specific information on a blog.

The problem: Google looks at the web in terms of pages. That’s a problem for weblogs because oftentimes the index page may hold many posts — and the same posts are then repeated on archive pages. Sometimes it’s one post to a page, sometimes a whole month’s worth. There’s no standardization. When you do a Google search on someone’s weblog, the results are a mishmash of single pages and agonizingly long archive or index pages. As a result, you’re forced to repeat the same query using your browser’s Find feature (Ctrl-F in Internet Explorer) in order to zero in on the exact spot where the search term appears. Repeat Ctrl-F until you’ve exhausted the current page, then go back to Google and look at the next search result.

Never mind that Google’s index is often several weeks out of date while weblogs get updated daily, if not more often.

On top of that, Google’s default summaries aren’t very good at capturing the essence of a post. Since it doesn’t know the difference between a post and a page, that’s not surprising.

The upshot: As I’ve argued before, weblogs, in their current form, are great for recording information but really suck at information retrieval.

One possible solution: As it turns out, we do have a couple of data formats that understand the difference between a post and a page, include useful summary data, and even include handy pointers back to the exact archive location of a post. They’re called RSS and RDF.

These syndication formats are used to aggregate news, but they could be useful indexing tools too. What if Google (or Daypop, once they can afford to buy a few new hard drives) collected RSS and RDF feeds — and then archived them in a searchable index?

Instead of news stories scrolling off into oblivion when they get to the bottom of a feed, they’d enter a permanent index where they could be used for information retrieval later.

The benefits: A search engine could let you do searches against the archived feeds — and could display the article summaries that are included in the feeds themselves, guaranteeing that these summaries would be appropriate and relevant.

You could display all matching results, along with their summaries (or full text, where available), on a single page, making it much easier to scan the results. And, you’d have the links back to the archival versions, where you could see each post in its full, formatted glory.

You could add search criteria such as date of publication, letting you retrieve all matching posts from a specific year or month.

You could search a single blog, you could search several specific blogs at once, or you could search all indexed blogs — and in each case, all matching results would appear on a single page.

Is anyone doing this now? I’d love to hear about it.