My mission was simple - to crawl all of our artist databases, extract names and email addresses from a couple of different tables, and spin out some custom XML.
The tables - one for complete user profiles and another for users who have signed up to receive emails only - may or may not exist, depending on the artist.
And the XML is to be consumed by an outside marketing vendor, for use in generating email campaigns. (We do some of our own email marketing, but most of our bulk mailing is handled by this third party.) So I got no spec - didn't need one, really - just some sample output.
Writing up a quick XML generator with minidom was a piece of cake.
But then I realized that some of these artists have more than a million records between their combined mailing list and profile tables. We have at least two artists with more than a million user records in each one. (In the end, some of the XML files I'm generating are as large as 40Mb.)
So that's obviously dealing with waaay too much data to keep in memory and throw to a prettily-formatted XML file all at once. It hummed along nicely for some of our artists with, ahem, smaller followings, but once I got to some of the heavy-hitters the whole process began to bog down.
(For the record - morbid curiosity and all - I did try writing each record to the file one at a time. And yes, it was as stupid and dog-slow as you might imagine it to be.)
What I needed was to be able to stream results to an XML output file. So I looked at some of the other XML packages out there (against the recommendations of a boyfriend who, I might add, is very opinionated when it comes to which XML libraries suck and which don't).
From all the benchmarking results I could find, lxml and libxml2 seem to perform pretty well when it comes to parsing.
But parsing is not my problem - I'll let someone on the receiving end worry about that. My only concern is creating it as simply and quickly as possible, and I didn't find a lot in the way of results for generating XML.
So I decided to try cElementTree. It wasn't an entirely arbitrary decision - not needing to install anything new definitely swayed me (it's been part of Python since 2.5).
I didn't do any formal benchmarking of my own, just got some approximate numbers - it looks like I'm covering my total 3.5 million records at 0.13 ms per. So, considering what I was getting out of minidom, cElementTree is lightning fast.
A few other notes on odd things you might see in the script below:
#!/usr/bin/python # usage: ./mylittlexmlscript.py import shutil, sys, MySQLdb as Database from xml.etree import cElementTree as ET # the completed files should be moved out of the way as they're generated # well, sure, they don't *have* to be - I just like keeping things neat and tidy processed = "/someplace/toput/allthat/xml" db = Database.connect(host="myhostname", port=0000, user="myusername", passwd="mypassword", db="") cursor = db.cursor() def walktheserver(): """ walk the server, generate a list of dbs """ sql = """SELECT table_schema FROM information_schema.tables GROUP BY table_schema""" cursor.execute(sql) dbnames = cursor.fetchall() return dbnames def gettables(dbname): """ return all the table names for a specific db """ tables =  sql = """SHOW TABLES FROM %s"""%(dbname) cursor.execute(sql) rows = cursor.fetchall() for table in rows: ltable = list(table) tables.append(ltable) return tables def generate_xml(dbname): """ check to see if the db has either mailing list or user table and generate the xml accordingly """ tables = gettables(dbname) if "mailinglist" in tables: create_xml(dbname, 'ml') # ml = mailing list if "users" in tables: create_xml(dbname, 'up') # up = user profiles def create_xml(dbname, type): """ here's where we're actually going to put cElementTree to work """ outputfile = dbname+"_"+type+".xml" # name the output file BSWrapper = ET.Element("BSWrapper") # create the root element of the XML tree Header = ET.SubElement(BSWrapper, "Header") # create the first subelement, and give it a few attributes Header.set("AccountID", "996") Header.set("ListID", "885") Header.set("UserID", "774") sql = """USE %s"""%(dbname) cursor.execute(sql) if type is 'fcm': sql = """SELECT fcm.email AS Email, fcm.first_name AS FirstName FROM mailinglist AS fcm GROUP BY fcm.email ORDER BY Email""" if type is 'w3t': sql = """SELECT w3t.U_Email AS Email, w3t.U_First_Name AS FirstName FROM users AS w3t GROUP BY Email ORDER BY Email""" cursor.execute(sql) rows = cursor.fetchall() for record in rows: data = ET.SubElement(BSFImport, "data") Email = ET.SubElement(data, "Email") Firstname = ET.SubElement(data, "Firstname") try: email = record.encode('iso-8859-1') except AttributeError: email = '' except UnicodeDecodeError: email = '' try: name = record.encode('iso-8859-1') except AttributeError: name = '' except UnicodeDecodeError: name = '' Email.text = email Firstname.text = name tree = ET.ElementTree(BSFImport) tree.write(outputfile) shutil.move(outputfile, processed) def main(): dblist = walktheserver() # they're not "bad", per se - I would never pass that kind of judgement on a database # these are just the dbs to be excluded from consideration badlist = ['information_schema', 'artist1', 'artist2', 'artist3'] for dbname in dblist: if dbname not in badlist: # print dbname newxml = generate_xml(dbname) cursor.close() db.close() if __name__ == "__main__": main()
For the record, the XML output looks like this - no pretty formatting, alas, but at least all the nodes are where they ought to be, every element accounted for. If formatting is a problem for the vendor, I'll deal with it in a few days and let you know what I came up with:
<BSWrapper><Header AccountID="996" ListID="885" UserID="774" /><data><Email>firstname.lastname@example.org</Email> <Firstname>christian</Firstname></data><data><Email>email@example.com</Email><Firstname>Daniel</Firstname> </data></BSWrapper>