MS Excel and Python's csv module

Python    2012-06-29

You may have noticed that I haven't posted in quite a while. Almost a year ago I started on some API work for an upcoming iPad app for a recording artist that has taken most of my focus and left me with not much time to participate in other projects. Unfortunately I'm not at liberty to talk about the specifics yet, as it is not due to release until the end of July.

Aside from coding the API, I've been doing a lot of database work on this project - schema changes and occasional massive record inserts/updates. In line with that, I'm working with a group on the business side who don't have much experience with how data is structured. I spend a lot of time cleaning up poorly formatted spreadsheets and turning the data rows into something that can easily be inserted into a database - against my will and better judgement, I've gotten very familiar with Excel.

Last week I got a spreadsheet that looked like this, with set list data from several years' worth of concerts:

And I needed to turn it into something like this (with a few other fields not shown here - basically, I needed to break the list of songs in a single cell into multiple rows of individual songs, with each song retaining the date and id metadata from the original row):

The songs in each cell were separated by a comma - I used this awesome Excel function to replace those with newlines (strictly optional - I just wanted to get a quick visual of how many lines of data I would be dealing with).

	=SUBSTITUTE(A1,",",CHAR(10))

The list started out hundreds of lines long - broken out, it became tens of thousands of lines of song data.

To get to what I needed, my first step was to export from Excel to a .csv. The script I ran against the delimited file uses Python's csv module to read each line, then give each element from each cell its own row that contains the date and other metadata from the original line.

The script is stupid simple, I really just wanted a place to save it:

#!/usr/bin/python

# usage: ./parserone.py

import csv

# the source file
infile = '~/set_lists.csv'
# the target file for all those new lines
outfile = '~/new_song_records.csv'

def parse(infile):
    ofile  = open(outfile, "wb")
    writer = csv.writer(ofile)

    # open the source file in universal-newline mode
    with open(infile, 'rU') as f:
        reader = csv.reader(f)
        # for each row in the file
        for row in reader:
            # the last cell is the one with the song data
            # split on newline character to generate multiple elements
            songs = row[5].split("\n")
            for song in songs:
                # each of those new elements gets a row
                # with the metadata from the original row prepended
                line = [row[0], row[1], row[2], row[3], row[4], song.lstrip()]
                # write to the target file
                writer.writerow(line)
    # done
    ofile.close()

def main():
    parse(infile)

if __name__ == "__main__":
    main()