Coding for Good: Working with the Sunlight Labs APIs

Python    2012-08-29

If you're looking to get your feet wet when it comes to working with open U.S. government data, I can think of no better place to start than with the Sunlight Laps APIs. They're not kidding when they say that using their APIs is absurdly easy.

Sunlight Labs is a project of the Sunlight Foundation, an organization that has been working for several years to access public government data** - the kind of data that is freely available on state and federal web sites, but that is buried behind a Byzantine series of links or is just poorly formatted for analytical use. Sunlight has done the hard work of finding that data and collecting it, and Sunlight Labs has created the tools that make it accessible for all of us to use.

**(Their other projects include Sunlight Reporting Group, Sunlight Live and the Open House Project.)

Currently they provide five APIs accessible with Python:

  1. Sunlight Congress API: returns information about legislators at the federal level
  2. Open States API: exposes similar information at the state level
  3. Capitol Words API: gives you a look at the most-used words in Congressional sessions
  4. Transparency Data API: specific data sets, such as campaign contributions and lobbying records
  5. Real Time Congress API: data such as floor updates, committee hearings, floor video, bills, votes, amendments, and various documents

This script example uses the openstates API to

  • get all available data about legislators at the state level
  • parse out only what's needed to do a summary count of party affiliations per state
  • and return that information as:
    • a JSON object that can be used for visualizations
    • a table suitable for embedding into an HTML page

To use any of the libraries, you'll first need to get an API key:

It only took a few minutes for my key to arrive in the mail. Once you've got it, you have a few options for setting it (I used ~/.sunlight.key):

Then install the sunlight module (this won't apply to the Transparency Data and Real Time Congress APIs) using either pip install or checking out the project from Github:

With all that done, you're ready to go. Let's pop open an interpreter and play around with the given example:

>>> import sunlight
>>> nc_legs = sunlight.openstates.legislators(state='nc')

As you'll see, this returns a list of dicts, each dict containing a lot of publicly availably information - such as name, district, office address, party affiliation, in some cases even a picture - about each state legislator in the state of North Carolina:

>>> nc_legs
>>> [{u'leg_id': u'NCL000242', u'first_name': u'Barbara', 
u'last_name': u'Lee', u'middle_name': u'', 
u'district': u'12', u'chamber': u'lower', u'url': 
u'created_at': u'2012-08-10 02:06:05', u'updated_at': u'2012-08-29 02:09:04', 
u'email': u'', u'+notice': u'[\xa0Appointed\xa008/06/2012\xa0]', 
u'state': u'nc', u'offices': [{u'fax': None, u'name': u'Capitol Office', 
u'phone': u'919-733-5995', 
u'address': u'NC House of Representatives\n300 N. Salisbury Street, Room 613\n\nRaleigh, NC 27603-5925', 
u'type': u'capitol', u'email': None}], 
u'full_name': u'Barbara Lee', u'active': True, u'party': u'Democratic', 
u'suffixes': u'', u'id': u'NCL000242', 
u'photo_url': u''}, 

One simple but powerful API call and we've already got so much information at our fingertips. So what can we do with all that data? Well, since the ultimate goal is to get a count of party affiliations per state, let's start by creating a list of state abbreviations. Then for each state in that list, we can make the same API call to get all the legislative data, and write a subset of that data - the state, the representative's full name, and their party affiliation - to a new dict.

states = ["AL", "AK", "AZ", "AR", ...]

def find_state_reps():
  # Start by instantiating the new dict:
  statereps = {}

    for s in states:
      legs = sunlight.openstates.legislators(state=s)
      # If you print 'legs', you'll see a dict with loads of
      # contact information for each state representative.
      # For my purposes, I'm only collecting name and
      # party affiliation.

      # This dict will hold {name:party} pairs for each state
      l = {}
      for leg in legs:
        name = leg['full_name']
          party = leg['party']
        except KeyError: # In some cases, 'party' is missing
          party = None
        l[name] = party
      statereps[s] = l

    # At this point, the 'statereps' dict contains:
    # {'state':{'rep_name':'party_affiliation'}}
    # for each state.

But you know what? Sunlight Labs is providing this API as a free resource, and I don't want to take advantage of their hard work by pounding their servers with a new set of 50 requests every time I run this script. So I'm going to write the dict to a file so that data doesn't have to be pulled from the API again.

    outfile = 'state_reps_list.txt'
    f = open(outfile, 'w')

Now, as I'm developing, I can just check to see if I have that file in place and use the dict from there. And when it's time to refresh the data, I can just delete the file and hit the API again to rebuild the statereps dict from scratch:

  import os.path

  f = os.path.exists(outfile)

  # If we've already got the list stored in a file, 
  # just refer to that file
  # instead of hitting the API again:

  if f:
    # Get the file content and return it as the statereps dict
    f = open(outfile, 'r')
    statereps = eval(
    # Hit the API for the data

  return statereps

My statereps dict looks something like this, but obviously contains a lot more data:

 'WA': {u'Bruce Chandler': u'Republican', u'Derek Kilmer': u'Democratic', ...}, 
 'WV': {u'Mike Green': u'Democratic', u'Mark Wills': u'Democratic', ...}, 

Now I can pass that data into another function that returns the summary count of party affiliations among state legislators per state (e.g., state: dems=x, repubs=y, other=z):

import re

def partycount(reps_dict):

  partycount = {}

  for s in reps_dict:

    # Create lists to hold the party members on a per-state basis:
    demlist = []
    replist = []
    otherlist = []

    for k in reps_dict[s]:
      # s -> state abbreviation
      # k -> full name
      # reps_dict[s][k] -> party affiliation

      if reps_dict[s][k]:
        # Use the re module to determine if either of these strings
        # appears in the party affiliation value

        dem ='Dem', reps_dict[s][k])
        rep ='Repub', reps_dict[s][k])

        # And funnel those values into the appropriate lists
        if dem:
          # If the legislator's party affiliation contains the substring 'Dem',
          # add their name to the 'dem' list:
        elif rep:
          # If the legislator's party affiliation contains the substring 'Rep',
          # add their name to the 'rep' list:
          # If neither substring appears in the legislator's party affiliation,
          # add their name to the 'other' list
    c = {}
    # Get the length of each list and you have a count of
    # dems vs. repubs vs. other for this state:
    c['Democrats'] = len(demlist)
    c['Republicans'] = len(replist)
    c['Other'] = len(otherlist)
    partycount[s] = c

  return partycount

And now we've got (yet another) dict that looks like this:

 'WA': {'Republicans': 64, 'Other': 0, 'Democrats': 83}, 
 'DE': {'Republicans': 22, 'Other': 0, 'Democrats': 40},
 'DC': {'Republicans': 0, 'Other': 2, 'Democrats': 10},
 'WI': {'Republicans': 74, 'Other': 1, 'Democrats': 55}, 

Before I return that partycount dict, I can insert this somewhat ugly bit of code into the function to generate an HTML page with all that data embedded in a table:

  # This count data could just as easily be output as
  # a template context object, or printed to stdout
  output = "<html><body><table>"
  output += "<tr><td><b>STATE</td><td><b>Republicans</b></td> \

  # Let's sort the keys while we're at it, 
  # so the states appear in alphabetical order:
  for key in sorted(partycount.iterkeys()):
    output += "<tr><td align='center'>%s</td>" % (key)

    for k in partycount[key]:
      output += "<td align='center'>%s</td>" % (partycount[key][k])

    output += "</tr>\n"
  output += "</table></body></html>"

  f = open('redvblue.html', 'w')

One other thing - I can also take that first statereps dict and convert it to json - that might be handy for doing visualizations down the road:

import simplejson as json

def converttojson(reps_dict):
  Take a dict object and convert it to JSON
  result = json.dumps(reps_dict, sort_keys=False, indent=4)
  return result

Some resources for doing visualizations with the resulting JSON object:

Here are a few more things that I could see adding to this script:

  • Add a line reading "Data current as of [date]" to the top of the html - use the filesystem date of the 'state_reps_list.txt' file (or the current date if you're getting the data fresh from the API):
  • Get unemployment data (source: US Department of Labor, Bureau of Labor Statistics) and compare on a per-state basis to see if there is any correlation between unemployment rates and dominance of any particular party at the state level:

  • Use the Transparency Data API to see how campaign contributions compare from state to state

My complete script, minus the changes mentioned above (which I have already implemented locally) can be found here:

And incidentally, here's that table output:


State legislative data current as of 2012-08-29 11:48:26