Qball's Weblog

Database in a large XML file.

Tags General  gmpc 

I wrote a jamendo plugin for gmpc a while back. To get the data I used the json interface. This is slow and not very flexible. (because of this, the plugin was not very usable).

Downloading the database would be better, but it is very large. 4.8mb zipped, 36mb unzipped and around 370 mb fully loaded using libxml.

Last night/this morning I managed to reduce this.

**Database **

Using the xml file as “database” for the jamendo plugin is obviously not the right choise. So I switched to sqlite, I download the xml file, parse it into an sqlite db. So the normal memory usage is not a problem anymore.

But I still see a huge (370mb) spike when parsing the xml file.

Reading subdocuments

To reduce this spike I have to parse the xml file in chunks.

The database (with more then 8000 artist) is structured like this.

?View Code XML

(sorry wordpress is screwing this up)
So parsing the file artist for artist, should be a nice solution.

Libxml has a nice system for this, xmlReader. xmlReader allows you to step through your xml file withouth completely parsing it and building an in-memory tree. If you are in a node, you can tell it to parse the subtree. So stepping throught the xml file, and making it parse the ““  tags in a subtree, reduces the maximum memory usage by libxml to 1mbyte. ( see http://pastebin.ca/1306980 (will be offline in 30 days ))

So the total memory spike now dropped to +- 38 mb. Namely the in memory xml file.
Unzipping on the fly**

The xmlReader has an xmlReaderForIO this allow you to specify a Read and Close function. The read function request a buffer of X bytes and does a linear read. So a few lines later I had xmlReader calling zlib that uncompresses  the needed part on the fly. This reduces the memory usage to 4.8 mb of the in-memory compressed xml file + 1mb of that libxml uses.

So say around 6 mb compared to 370 mb, that is a nice improvement.

(in all the above memory usages I haven’t taken the memory used by sqlite in account.)

So a several hours hacking and debugging (mostly spend on something stupid and obvious) I managed to reduce the memory usage more then 50 times.  Using the sqlite database also has the nice advantage that you can now search jamendo from gmpc’s Search browser. (see http://images.sarine.nl/gmpc_jamendo_seach.png)

I will put the code online soon, when I have it cleaned up.