For decades, the tools used by data scientists have provided the backbone for important stories that effected change.
The Center for Investigative Reporting now in the midst of the sixth TechRaking conference on data and journalism, and we’re holding this one in Toronto with Google Canada and The Canadian Press.
As we start to explore ways that technologists and journalists can come together to bring about change and better inform the public, let’s review some key historical moments in what’s now known as data journalism.
In 1952, Walter Cronkite reported stories based on an analysis of election results on one the first mainframe computers. That allowed journalists to call the election for Dwight Eisenhower before the voting closed, despite polling that indicated Illinois Gov. Adlai Stevenson would win.
During the 1967 Detroit race riots, reporter Philip Meyer did something few journalists had done before: He adopted the techniques of social scientists to show the underlying causes of the riots. When the Detroit Free Press won the Pulitzer Prize for local general reporting in 1968, the newspaper cited Meyer’s analysis of survey data. Meyer went on to write “Precision Journalism” (now called “The New Precision Journalism”), considered the bible of computer-assisted reporting.
Despite the difficulty of managing large databases back in those days, more reporters started using them to do stories that no one else could.
In 1989, the Pulitzer Prize for public service was awarded to the Anchorage Daily News for Richard Mauer’s analysis of the deaths of Alaskan natives. Mauer found that an Alaskan native boy had a 250 times greater chance of committing suicide than a boy like him living elsewhere in the United States. After Mauer’s stories ran, suicide prevention programs were launched and the battle against alcohol abuse was strengthened.
More recently, a team of reporters from CIR’s California Watch project used multiple databases to show that state regulators “routinely failed to enforce California’s landmark earthquake safety law for public schools.” As a result, California lawmakers ordered audits and investigations, and the State Allocation Board made it easier for schools to get funding for seismic repairs.
Data has been the foundation for many investigations. It has provided a view of entire populations, rather than anecdotes. But news organizations still could not give readers all the information. Even with large news holes in the 1990s, you could not print a database.
In 1999, the San Jose Mercury News did an investigation based on decades of animal transfer records showing that animals at some of the nation’s top zoos were dumped with dealers or ended up on hunting ranches. The newspaper, among the first to publish online, put up a tool so readers could search by zoo.
Today, spreadsheets and databases are commonplace in newsrooms and on Web sites. Government agencies regularly release information in electronic format. The tools to analyze data and make it available online are cheaper, faster and more powerful.
But challenges remain.
Nearly 20 years since the passage of the Electronic Freedom of Information Act, which required agencies to provide records electronically, among other requirements, federal agencies still balk at releasing databases.
In the U.S. and Canada, federal agencies release data in hard-to-process PDF files. In fact, that might be more commonplace today.
At the local level, state and provincial laws vary widely. So you might be able to get data about day care centers in Texas easily, but, as CIR recently experienced, getting that data in California is virtually impossible.
Some state open records laws don’t address access to electronic records, so interpretation often is left to individual agencies or courts, if the requester has the resources to sue.
The New Hampshire open records law allows an agency to provide a printout of a public record rather than the underlying computer file. Courts there have been mixed on whether an agency must hand over a database. California’s Public Records Act did not address electronic records until 2001.
In an age in which information means money, some agencies, particularly on the local level, try to put hefty price tags on public data. At the same time, newsroom budgets are shrinking, making it harder to pay for data or fight for access to data in court.
The rush to put information online sometimes means that data is not checked thoroughly or that it is posted without the necessary context to enable users to make sense of what they’re seeing. Simply regurgitating a government database in its original form may drive click traffic, but it’s our job as journalists to bring meaning to the data. Not only that, most government databases are flawed. Finding those flaws is another one of our jobs as journalists.
When the nonprofit news organization ProPublica obtained data from the U.S. Department of Education’s Office for Civil Rights in 2011 to build its Opportunity Gap project, reporters found significant problems in the data, including schools with more teachers than students and schools offering more than 1,000 Advanced Placement courses when fewer than 40 exist.
Traditionally, data was sorted in rows and columns, but that’s not always the case today. Data is in free-form text, PDFs or even in things that don’t begin as data. Much work is being done on the data science side to develop tools to manage “big data.” That can be a great resource for journalists trying to analyze pages of documents. To deal with some of these issues, journalists must move from spreadsheets and databases to more powerful tools. Involving data scientists in journalism and developing interdisciplinary programs will increase newsrooms’ ability to bring about change – and produce some pretty cool projects in the process.