blog.kfish.org

My name is Conrad Parker, and I live in Kyoto, Japan. I am working towards a PhD in Computer Science at Kyoto University, finishing September 2009. I also work on some free software projects including the Sweep sound editor and the Annodex media system, and various smaller projects which you can read about here.

Monday, 28 March 2005

Reading Project Gutenberg

Project Gutenberg provides "free plain vanilla electronic texts", readable by "both humans and computers", scanned or copied from thousands of public domain books. The emphasis on plain text is great for the archive, but I'd like to browse and read these texts with something prettier than plaintext formatting.

Debian contains a fairly old qt program called gutenbrowser, which unfortunately contains a fairly critical bug: it provides an interface for browsing the online catalog to choose a book (Screenshot), but reliably crashes while retrieving your selection. There was also gutenbook which lacked any browsing of the online catalog, and only ever claimed to be a prototype (Screenshot). It was recently removed from Debian unstable. Of course, what would be ideal would be some python classes to manage Project Gutenberg ebooks -- something that could interface with Twisted and support a modern user interface. There's a few such projects floating around, like pybook, but they all seem to be in a "currently broken" phase.

Why are there so many broken Project Gutenberg ebook readers? What can possibly be so hard about reformatting plaintext in a nice graphical interface? There are a couple of reasons for this recurring brokenness. One is that the books do not follow a strictly machine-readable format, so parsing meta-information from them is a common problem. This problem persists, and can perhaps only be properly handled by ad-hoc methods.

Another problem is that the catalog of available books used to only be available in a difficult-to-scan text file, which made reliably creating a searchable catalog difficult. However Project Gutenberg now provides machine-readable feeds of the catalog, including both a daily RSS feed and a daily dump of the full catalog in RDF/XML. The keen developer could also make use of the full-text search of the collection.

So, if you're a hacker who enjoys reading, this could be a fine area to consider. Apart from simply having a nice desktop reader, there's a lot more that could be done with the catalog. If you're already working on it, or if I've missed something out in the above survey, I'd like to hear from you.

Labels:

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home