Being the nerds we are here, we try to look at even our smaller projects as a chance to play around with new ideas. A small project that just ran over the holiday weekend gave us a great opportunity to do just that.
The project was a collaboration between California Watch and the Sacramento Bee, which, for the first time, analyzed a selection of support and opposition data from California legislative analyses to get a sense of who the big legislative “winners” were this year and why. We also built a searchable database application to help readers dig into the details
To parse the information from the free-text analysis files, we once again relied on Amazon's Mechanical Turk service and followed that up with a ton of quality control. But we've done that before – that's not the innovation. In addition, we incorporated three techniques that you don't often see in news apps.
Because this app was so simple on its face, we were looking for excuses to add a bonus feature to make things more interesting. We settled on using a simple recommendation algorithm to identify and recommend similar organizations to encourage further exploration of the dataset.
The algorithm we used comes straight out of Data Science 101. In fact, it's just another application of a technique CAR reporters have been using for a long time: the Pearson correlation coefficient.
Our approach took a couple steps: First, representing bill registrations numerically (-1 means oppose, 1 means support, etc.); and second running a series of pairwise comparisons between groups' bill sets to identify the most similar organizations using Pearson – about 25 million comparisons overall. We pared down the resulting dataset into about 7,000 significant comparisons that we actually use in the app.
The Python implementation of the algorithm comes from Chapter 2 of O'Reilly's Programming Collective Intelligence, by Toby Segaran. It's an extremely simple way to add value to many datasets, and with it another layer of understanding and exploration for your users.
This project also represents the first live implementation of our django-littlebro event tracking system (which is very much still an alpha product). Specifically, we're tracking the organizations people search for and click on, the bills that they're looking up, and how often they use the “similar groups” feature that we built using the recommender algorithm.
The information we gather will help gauge how people respond to our recommendation system, as well as taking the pulse of our user base – how deeply do they explore in the app? Which groups seem to be getting the most attention? As always, we're measuring these things anonymously and are looking at the results in the aggregate.
Littlebro is now up and running on our projects server and dumping results to a MongoDB instance every few minutes, using the excellent Celery distributed task queue.
Data work aside, this app was designed and built primarily during two hour-long train rides. Django always helps to make backend development extremely simple, but this time Twitter's Bootstrap project served to do the same on the design side.
The app was designed to be embeddable in both our CMS and the Sacramento Bee's, and a set of flexible Bootstrap layouts made it simple to do so.