Wouldn't you know it, the nerds here at CIR this week actually managed to turn computer science into news.
Using MapReduce and our pairwise document similarity algorithm, we thought it would be fun to compare all the bills proposed in the California Legislature during the 2009-2010 legislative session – Republican Arnold Schwarzenegger's last session as governor – with those introduced this year, when Democrat Jerry Brown came back for his third go-round in the governor's office.
Sure enough, we found about three dozen examples of bills Schwarzenegger had vetoed but were reintroduced this year and signed by Brown. Not bad for a day's work.
It's not unusual for lawmakers to resurrect bills between sessions, especially when a new administration might be more likely to sign them. But easy-to-track markers like bill titles, sponsors and even key elements of the language often change, making them difficult to track without an insider's knowledge of what's going on. We wanted to to see whether our algorithm would have better luck.
Our comparison system used TF-IDF weighting and cosine distance to determine document similarity, which comes with an arm's-length list of caveats but generally does a pretty good job of identifying similar documents.
The process, which involved about 60 million document comparisons, took about three hours to run on four mid-size Amazon EC2 instances. In other words, it was pretty much MapReduce or nothing if we wanted to get it done.
The end result was a text file with tens of millions of rows, so we did some sampling and hand-checking to figure out at what level our similarity scores were likely to identify actual matches. With that, we were left with a list of a few hundred bill pairs that actually mattered, which we then verified by hand.
The key lesson here is about use cases. A large document comparison job like this isn't going to magically turn raw data into a story. What it will do is take a giant problem – 60 million document comparisons – and turn it into a much smaller and more manageable problem that you can deal with using traditional CAR or reporting techniques.
From start to finish, this analysis took about a day. The post got great feedback, and it spotted a lot of bills that otherwise wouldn't have come close to making headlines. We have much more ambitious plans for this technique, but for now we're happy to see it working so well in the small-scale.