Monday, September 10, 2007

A True Congressional Record

I received an email last week from Derek Willis @ the Washington Post concerning a post I did last week. He works on their Congressional Votes database and has been kind enough to let me reprint his email. First off here's what he was responding to:

If print media wanted to cement their place in 21st century information streams they would work together to provide such semantic web reporting services. The Washington Post's US Congress Votes database is a start in this direction, even if it's crude and doesn't dive below the surface of what's really going on on Capitol Hill. Maybe I'm just super cynical, but I've always felt that the votes made in Congress are just misdirection in the 3 card monty game that is public governance.

And here's Derek's email:

Chris,

Saw your posting on media and the growth of the semantic web, and I'm interested in hearing your ideas about how to improve our votes database (although I don't agree that it is crude and barely scratches the surface, given my experience as a reporter for Congressional Quarterly). If you've got some specific suggestions on ways we could make it a better service, I'm all ears.

Derek
--
Derek Willis
Database Editor
washingtonpost.com

First off, I want to apologize for coming off so snarky in talking about work that I think is really cool and obviously took a lot of effort on their part. That was wrong of me and a sin common online. The Washington Post is clearly leading the way in an area that I think is an extremely significant step in 21st century democracy. However, leading the way doesn't mean that you've completed the journey. In hind sight I could have chosen a much better way to describe something that I think is really innovated both from technology as well as media perspectives.

For me the interesting aspect of blogging is the interaction between streams of thought. Much more so that just writing in a vacuum. As such I'll confess that one of my tactics as a writer/debater is to throw off quick confrontational asides in order to provoke reactions from people. This opens up a much more dynamic discussion of issues than what one would usually see in academic circles. And so, here's why it's crude and doesn't scratch below the surface ;-)

In examining the Washington Post site I'm going to use the two criteria that I think are most significant in terms of evaluating a semantic web application: depth and plasticity.

Depth. How deep is the data? How much does the reflect that's really going on?

For example: I'm playing a gig with my band.

OK... When? Where? What band? Who's in the band? What do they sound like? These would be the minimum data dimensions that I would think needed to be provided in order for people to be able to understand what's going on. Does the online listing for that gig reflect the true depth of what's going on?

Plasticity. How flexible and reusable is the data?

With the above example the basic thing would be the event having some kind of meta-data so that visitors can easily import it into their calendars. Adding hCalendar Microformat annotations to the listing on the page is one way to do this. Then if I had the Operator extension for Firefox I could import the event to my Google calender with the click of a button. Automated spiders looking for that kind of meta-data could also add it to their listings, making it easier for other people to find out about my band.

I could go even further by offering a FOAF file for the band, listing contact information and relationships that the group has with other bands. An hCalendar annotated blog that would make it much easier for people to keep up to date with our future gigs.

The easier it is for tools to export information into other tools, the more plastic it is. Plasticity is what the semantic web is all about.

So lets do the same thing with votes in Congress.

The Washington Post US Congress Votes Database focuses on specific votes. It provides an RSS feed for each member that one can subscribe to allowing you to find out about the votes as they come in. This is a great way to keep up with what your member of Congress is doing. It lists the name and number for the bill, which way the member voted, and provides links so that you can see more voting information about the bill and even go to the Library of Congress to view the text of the bill.

As much as I love the site and think that it's very innovative and an indicator of this to come, it isn't deep, and it isn't plastic.

How is it not deep? First off there's the actors. From the database's perspective there is only one type of actors= involved with the bills in Congress: Congressmen. But we know that by the time a bill goes to the floor a lot of fingers have been putting their thumbs into the legislative pie. Lobbyists, corporations, constituents, activists, and members of other government entities all play a part in the magical journey of a bill becoming a law. If we want our information to be deep we need to have those actors classified and provide a way for link them back to the legislation.

The other primary dimension to legislation besides actors is impact. Now it's difficult to classify such dimensions as the social impact of legislation, but when you get right down to it, there's only one dimension that really defines what everyone cares about: money. In order to go deep into the impact of legislation you need to be able to easily examine its financial impact. Earmarks... effects on national debt... benefits to corporations... what it is that the money is buying... etc... these things need to be spelled out if people want to really have a deep understanding of what legislation is all about.

Now this is no small undertaking, and I am not criticizing the Washington Post in any way shape of form. I am simply saying that if we want to have a real fundamental understanding of what the legislative process is all about we need to be able to examine these dimensions.

So how do we pinpoint who the primary actors are in the legislative process? Right now it's a dark science relegated to the back alleys of our nations capitol. Who benefits... how much does it really cost... how much is it really worth; finding the answers to these questions are difficult even when you ask the people creating the laws, let alone voting on them. More often than not, those who do know would rather that you didn't, seeing this information as a privilege coming from their exalted status. Information is power, and in our age power equals capitol.

Until the time comes when we can make truly open government the standard by which our representatives operate, there are several ways that we can look under the hood and deduce who's who and what's what. The biggest clue is what happens in the committees. What are the votes? Who is testifying? What is their roll? While all of this information is theoretically available it is difficult to find and never in one place.

How is it not plastic?
To truly empower citizens with the information they need in order to keep track of what their representatives are doing all public governmental information needs to be classified in a standard way so that they can analyze the information, relationships and their impact. Beyond that, there needs to be a central, easily annotated record of what's said and done in Congress. We are paying for it, but you need a high prices staff to be able to figure out what we are really paying for.

One would think that that would be the Congressional Record, but sadly, that is not the case. This brings up one of my biggest complaints against Congress. Their actions are deliberately obfuscated online. You can't just easily link to it. Everything is walled away behind temporary links that you can only get to by filling out forms, and the only way to see what it the content really is is through a PDF reproduction of the printed version. Sure, they broadcast the floor debates and a few of the committee hearings, but live video streams are the polar opposite of something that can easily be annotated. Congress makes it as difficulty as possible for people to use their words online. (The REST style of architecture is designed as a good solution to these sorts of problems. - Chris)

In order for it to be plastic, it has to be textual. The modern age is a revolution of type. There would be no Reformation or American Revolution without movable type. Being able to easily copy textual ideas is a cornerstone of modern civilization. Yet, the only simple way to get a textual representation of what is said and done in Congress is to us a private service such as LexisNexis. This gives a huge advantage to big money interests over public citizens. Even then it is a crude version, without any sort of cross referencing or annotation.

A true Congressional Record would be one that included all committee activities and was organized in such a way so that it could be easily annotated. You should be able to directly link its sections, and it shouldn't take millions of dollars to figure out what impact the words really have. With this Congressional Record a true political revolution would take place that would go beyond the trivialities of Partisan gamesmanship to true democratic empowerment of every American citizen.

Imagine a Congressional Record that was as plastic as the entries in the Wikipedia. There I can see exactly who made changes to an entry. I can create tools to figure out who are the best editors. I can hold people accountable when they try to hide something. Wouldn't it be nice to be able to easily diff between versions of a bill and see who took what parts out?

What are some other data dimensions that we can add to our Utopian Congressional record? The most important thing has got to be to show me the money. What is being spent on a specific piece of legislation? Who benefits by it? If they do they should be identified so that we can track their relationship to the legislation. That will include information on industry, company health, and geographic locations. Most of this can be easily garnered from other places.

In order for a data feed to be useful, it needs to be defined. This is my biggest criticism of the current Washington Post database. It uses RSS, but it doesn't use RDF. If I want to parse automatically what's going on I've got to write my own scraper and munge the numbers. The semantic web is all about eliminating the need for that step and providing annotations so that you can more easily find matching patterns of data. With that than any other tool can easily lock onto the data and have fun with it.

That means we're going to need an RDF Schema for legislative data. If we had that, then we can turn the whole Internet into a collective tool for analyzing and shaping legislation. Show me which industries in a Congressman's district benefit from legislation? Add census data to the mix so that we can compare money spent in a district to its economic health.

I could see this as being a big part of the future of media. If entities got together to work out a standard for reporting on legislation, they could collectively use that to bring value to their products. Creating hooks into data is easy. Annotating what it means is harder. That's the real value in today's media. That's what big companies are paying a lot of money for.

I could see a large part of a 21st century reporters job being to connect the semantic dots of the legislative process. They could even add their readers and others online into the mix becoming a collective swarm analysis tool. Things like that already are happening online, but just like the Washington Post site, our work is crude and we are barely scratching the surface. Still, it is amazing.

2 comments:

Derek said...

Chris,

Thanks for detailing your thoughts here. A couple of quick reactions:

First, on depth. Given the current state of disclosure of even the two areas required by law - lobbying and campaign finance - it would be very difficult to assemble a timely picture of everything happening with a piece of legislation. Lobbying activity is reported twice a year - in February and again in August - covering the previous six months. Lobbyists can be specific in regards to the bills and issues they're working on, but many are not. The reporting requirements for foreign lobbyists are better, but they only recently began posting reports online. The Senate Office of Public Records posts lobbying forms, but not in what many would call a timely manner. Campaign finance data also has a time-lag element, although not nearly as much as lobbying information. However, you're right in suggesting that more could be done in regards to information posted on interest group websites.

Second, on plasticity. It's true that the Congressional Record posted on Thomas is not a terribly Web-friendly piece of information. But a few very talented folks like Josh Tauberer have essentially reverse-engineered it for GovTrack.us and you'll see more on that front from a variety of people. Frankly, the best way to change the nature of the data produced by Congress is to convince elected representatives that they should do so. House officials are working on XML (though not RDF) standards for legislative information, while the Senate has so far only experimented in this area.

Derek Willis
Database Editor
washingtonpost.com

Chris Baker said...

Derek,

First of all, thanks for commenting.

Govtrack.us is certainly the gold standard of this subject matter.

I guess the main thing I'm wondering about is how we can add a semantic web layer on top of all this so that we can automate and decentralize the information.