Skip to content

Python, XML, and databases

People who read this blog regularly (all two of you) might know that my day job involves lots of “enterprise” level technology such as Java, JEE, Spring, heavy duty messaging systems, and Webservices, both of the limber REST species as well as the well-lumbering SOAP type. Also the more attentive reader of the pair of you might be aware that I am doing a PhD in the decidedly non-comp.sci discipline of Classics & Ancient History. This has given me the opportunity to combine these two areas in the area of what’s called “Digital Classics”, (a species of Digital Humanities) which is the application of computer technology to solving problems in Classics.

And although it’s not in anyway part of my PhD (which is a rather traditional historical-literary type of Classics thesis), I’ve also been using this combination to expand my technical areas of expertise, i.e., by learning new languages. My first effort in that area was the simple Latin word parsing tool and dictionary,LatinOWL, written in Objective-C and running on iOS devices, available on the App Store (for free). For my second effort, which I will discuss in this and some future posts, I’ve been using Python and Django, the Python based web app environment, to build the “crowd-sourced” commentary web app for ancient texts, De Commentariis. It’s my intention, to use this blog to document and share the technical experiences I’ve had, in learning the Python language and the Django framework, re-engaging with web application design, figuring out issues of application hosting and the like.

XML data sources

The first problem I faced building De Commentariis, was the issue how to access the XML format of the original text datasources. The source of most of my primary data (the ancient texts) would be the TEI-schema XML files from the Perseus Digital Library of ancient texts. I work a lot with XML in my day job. Rather than decompose the XML documents into a relation database, or even a non-relational datastore, I decided to in the most part, work directly with the raw XML files.

I used Python’s lxml extension to read the XML with xpath queries. There are two parts the XML parsing. The first is the “off line” processing of the XML data to extract metadata about the document, e.g. title and author. This stage also involves an examination of the TEI XML headers to discover the underlying document structure. Each ancient text has a different structure. Some works are organised simply by a single dimension, e.g. by chapter. Others have a more complex division imposed onto them, for example, by book, chapter, and section. Poems are typically organised by book and line, or book, poem, and line, and so forth. My pre-parser reads the structure data from each parsable file, and then counts through the sections to determine how many of each division the document contains, and stores the extracted metadata, as well as the newly built table of contents, into a database structure.

However, once a user selects to view any text, the metadata is of the text is used to apply an xquery which extracts the data directly from the original XML. I have found that xpath exhibits perfectly good performance in this case, and seems to cope well under what little load-testing I have subjected the application to.

I rejected the idea of putting all the text data directly into the database for a number of reasons, but mostly because it would involve a complex system to glue the text sections back together, or, duplicating the sections into the database. This would vastly increase my disk usage footprint, because the texts can be accessed at multiple levels. Consider that a text which uses the book, chapter, section style of division can be accessed either as “1” (the whole of book 1), “1,1” (book 1, chapter 1) or “1,1,1” (book 1, chapter 1, section 1). And for some texts there is sometimes text attached at, say, the book level, without that item of text being in any sub-division, so it’s no good to store all the leaf sections and concatentate them when someone asks for a parent.

The commentary items that users enter, and all the other management data, like voting data, are currently stored in a traditional database. I’ve been thinking about throwing away as much of the SQL database as possible and using raw XML to store as much of the data as I can, but for the moment this idea is on the backburner while I add features more directly useful to my users.

People have asserted to me that “databases are faster than files” but I think this isn’t strictly true (in fact I know it’s not). A database’s i/o is typically constrained by the disc speeds that underlie it (which is why the goal of a lot of high-performance tricks for databases is often to serve as much of the data as possible from RAM rather than discs). A database gets its performance by enabling the quick access of data via indices which use additional disc space to store quick pointers to the location of the data record in the real database file. Alternatively, and much better for large datasets, are the distributed in-memory hashing techniques headlined by Google’s MapReduce function. But, if you know the key by which you’re going to directly access the dataset (i.e. the filename) then a disc-bound database is never going be faster than a good filesystem. Perhaps in the future, I should move to replacing the database with a file based structure and moving the remainder pre-parsed-out metadata off into a MapReduce style system (possibly Amazon Elastic MapReduce).

However for the moment, a SQL database (postgres, for the record, my favorite open source relational database) was the simplest possible thing that could work.

In my next post, I hope to write about my choice to use Python 3.3 and not 2.x, and how that affected my hosting choices to the point that put it on Amazon Web Services, and have not (yet) once regretted that decision.