The Semantic Caucus: September 2007

Tuesday, September 18, 2007

On The Radar

Open Congress > "Congress, I'm Watching" Widget
Cute little blog widget that lets you show what bills you're monitoring. I like the direction it's going in.

The Fix > Dems Seek Upper Hand With Atlas Project
Boy, I'd love to get a gander at this baby!

Tuesday, September 11, 2007

The wonderful Open House project has a post up of a video presentation by open government pioneer Carl Malamud on his Washington Bridge project given at a Google Tech Talk on May 24, 2006.

There's also a web page for the talk that includes a business plan for the venture. The idea was to provide streaming archived video of Congressional committee hearings.

I haven't been able to find anything recent about the proposed plan, however since then he has sent an unsolicited report to the Speaker of the House on streaming committee data and on August 3rd provided an update that does look promising:

... The analysis of this short-term solution by the Advanced Business Solutions unit has concluded: “from a technical stand point we now know this is very easy and inexpensive to do.”

3. As a long-term strategy, the Office of the Speaker has conducted a large number of meetings, as has the Committee on House Administration, the Chief Administrative Officer, and several other groups. There is a concrete, funded set of initiatives to finish the wiring of the rooms so that all hearings have video coverage, and it is clear from a technical point of view that it is possible to achieve the goal of broadcast-quality video for download on the Internet by the end of the 110th congress. The recommendation to adopt that goal is currently awaiting action from the Office of the Speaker and the Chairman of the Committee on House Administration.

The thing that's really interesting about the video is how the Google people (25' in) latch onto the problem that fascinates me the most: annotating real time data streams. There aren't enough hours in the day for me to listen to my collection of old time radio shows let alone watch everything that is going on in Congress. How do I make it easy for the public to collectively annotate information in real time so that it can be easily processed and reused?

It's hard as hell to annotate video and unless video is annotated it's hard as hell to leverage it with automated tools. You need to be able to make a media clip start at a specific moment rather than playing the whole thing. You need to be able to easily sync up the video and transcripts using some sort of marker. I'd like to be able to drag it onto my blog editor, and have it automatically embed the clip at that point as well as add default folksonomy. Beyond that synchronized transcripts so that I can do quick quotes. As he points out the transcripts are out there but they take months to be made available to the public if they ever are. How do we synchronize video and audio to transcripts so that people can easily quote from it?

We need tags that work against a common ontology so that everyone is working off of the same page. When I say HR1515 am I talking about Representative Harris Fawell's amendment to the Balanced Budget Act of 1997, the 110th's Congress' bill to amend the Housing and Community Development Act of 1974 or the Georgia General Assemblies bill honoring the life of Linton Webster Eberhardt, Jr ?

Ideally websites and tools should take care of those headaches for people. For instance... if I drag a bill in Thomas onto my blog editor, it knows that that is the one I am writing about, grabs the embedded meta-data from the page, and ads it as a folksonomy tag.

Monday, September 10, 2007

A True Congressional Record

I received an email last week from Derek Willis @ the Washington Post concerning a post I did last week. He works on their Congressional Votes database and has been kind enough to let me reprint his email. First off here's what he was responding to:

If print media wanted to cement their place in 21st century information streams they would work together to provide such semantic web reporting services. The Washington Post's US Congress Votes database is a start in this direction, even if it's crude and doesn't dive below the surface of what's really going on on Capitol Hill. Maybe I'm just super cynical, but I've always felt that the votes made in Congress are just misdirection in the 3 card monty game that is public governance.

And here's Derek's email:

Chris,

Saw your posting on media and the growth of the semantic web, and I'm interested in hearing your ideas about how to improve our votes database (although I don't agree that it is crude and barely scratches the surface, given my experience as a reporter for Congressional Quarterly). If you've got some specific suggestions on ways we could make it a better service, I'm all ears.

Derek
--
Derek Willis
Database Editor
washingtonpost.com

First off, I want to apologize for coming off so snarky in talking about work that I think is really cool and obviously took a lot of effort on their part. That was wrong of me and a sin common online. The Washington Post is clearly leading the way in an area that I think is an extremely significant step in 21st century democracy. However, leading the way doesn't mean that you've completed the journey. In hind sight I could have chosen a much better way to describe something that I think is really innovated both from technology as well as media perspectives.

For me the interesting aspect of blogging is the interaction between streams of thought. Much more so that just writing in a vacuum. As such I'll confess that one of my tactics as a writer/debater is to throw off quick confrontational asides in order to provoke reactions from people. This opens up a much more dynamic discussion of issues than what one would usually see in academic circles. And so, here's why it's crude and doesn't scratch below the surface ;-)

In examining the Washington Post site I'm going to use the two criteria that I think are most significant in terms of evaluating a semantic web application: depth and plasticity.

Depth. How deep is the data? How much does the reflect that's really going on?

For example: I'm playing a gig with my band.

OK... When? Where? What band? Who's in the band? What do they sound like? These would be the minimum data dimensions that I would think needed to be provided in order for people to be able to understand what's going on. Does the online listing for that gig reflect the true depth of what's going on?

Plasticity. How flexible and reusable is the data?

With the above example the basic thing would be the event having some kind of meta-data so that visitors can easily import it into their calendars. Adding hCalendar Microformat annotations to the listing on the page is one way to do this. Then if I had the Operator extension for Firefox I could import the event to my Google calender with the click of a button. Automated spiders looking for that kind of meta-data could also add it to their listings, making it easier for other people to find out about my band.

I could go even further by offering a FOAF file for the band, listing contact information and relationships that the group has with other bands. An hCalendar annotated blog that would make it much easier for people to keep up to date with our future gigs.

The easier it is for tools to export information into other tools, the more plastic it is. Plasticity is what the semantic web is all about.

So lets do the same thing with votes in Congress.

The Washington Post US Congress Votes Database focuses on specific votes. It provides an RSS feed for each member that one can subscribe to allowing you to find out about the votes as they come in. This is a great way to keep up with what your member of Congress is doing. It lists the name and number for the bill, which way the member voted, and provides links so that you can see more voting information about the bill and even go to the Library of Congress to view the text of the bill.

As much as I love the site and think that it's very innovative and an indicator of this to come, it isn't deep, and it isn't plastic.

How is it not deep? First off there's the actors. From the database's perspective there is only one type of actors= involved with the bills in Congress: Congressmen. But we know that by the time a bill goes to the floor a lot of fingers have been putting their thumbs into the legislative pie. Lobbyists, corporations, constituents, activists, and members of other government entities all play a part in the magical journey of a bill becoming a law. If we want our information to be deep we need to have those actors classified and provide a way for link them back to the legislation.

The other primary dimension to legislation besides actors is impact. Now it's difficult to classify such dimensions as the social impact of legislation, but when you get right down to it, there's only one dimension that really defines what everyone cares about: money. In order to go deep into the impact of legislation you need to be able to easily examine its financial impact. Earmarks... effects on national debt... benefits to corporations... what it is that the money is buying... etc... these things need to be spelled out if people want to really have a deep understanding of what legislation is all about.

Now this is no small undertaking, and I am not criticizing the Washington Post in any way shape of form. I am simply saying that if we want to have a real fundamental understanding of what the legislative process is all about we need to be able to examine these dimensions.

So how do we pinpoint who the primary actors are in the legislative process? Right now it's a dark science relegated to the back alleys of our nations capitol. Who benefits... how much does it really cost... how much is it really worth; finding the answers to these questions are difficult even when you ask the people creating the laws, let alone voting on them. More often than not, those who do know would rather that you didn't, seeing this information as a privilege coming from their exalted status. Information is power, and in our age power equals capitol.

Until the time comes when we can make truly open government the standard by which our representatives operate, there are several ways that we can look under the hood and deduce who's who and what's what. The biggest clue is what happens in the committees. What are the votes? Who is testifying? What is their roll? While all of this information is theoretically available it is difficult to find and never in one place.

How is it not plastic? To truly empower citizens with the information they need in order to keep track of what their representatives are doing all public governmental information needs to be classified in a standard way so that they can analyze the information, relationships and their impact. Beyond that, there needs to be a central, easily annotated record of what's said and done in Congress. We are paying for it, but you need a high prices staff to be able to figure out what we are really paying for.

One would think that that would be the Congressional Record, but sadly, that is not the case. This brings up one of my biggest complaints against Congress. Their actions are deliberately obfuscated online. You can't just easily link to it. Everything is walled away behind temporary links that you can only get to by filling out forms, and the only way to see what it the content really is is through a PDF reproduction of the printed version. Sure, they broadcast the floor debates and a few of the committee hearings, but live video streams are the polar opposite of something that can easily be annotated. Congress makes it as difficulty as possible for people to use their words online. (The REST style of architecture is designed as a good solution to these sorts of problems. - Chris)

In order for it to be plastic, it has to be textual. The modern age is a revolution of type. There would be no Reformation or American Revolution without movable type. Being able to easily copy textual ideas is a cornerstone of modern civilization. Yet, the only simple way to get a textual representation of what is said and done in Congress is to us a private service such as LexisNexis. This gives a huge advantage to big money interests over public citizens. Even then it is a crude version, without any sort of cross referencing or annotation.

A true Congressional Record would be one that included all committee activities and was organized in such a way so that it could be easily annotated. You should be able to directly link its sections, and it shouldn't take millions of dollars to figure out what impact the words really have. With this Congressional Record a true political revolution would take place that would go beyond the trivialities of Partisan gamesmanship to true democratic empowerment of every American citizen.

Imagine a Congressional Record that was as plastic as the entries in the Wikipedia. There I can see exactly who made changes to an entry. I can create tools to figure out who are the best editors. I can hold people accountable when they try to hide something. Wouldn't it be nice to be able to easily diff between versions of a bill and see who took what parts out?

What are some other data dimensions that we can add to our Utopian Congressional record? The most important thing has got to be to show me the money. What is being spent on a specific piece of legislation? Who benefits by it? If they do they should be identified so that we can track their relationship to the legislation. That will include information on industry, company health, and geographic locations. Most of this can be easily garnered from other places.

In order for a data feed to be useful, it needs to be defined. This is my biggest criticism of the current Washington Post database. It uses RSS, but it doesn't use RDF. If I want to parse automatically what's going on I've got to write my own scraper and munge the numbers. The semantic web is all about eliminating the need for that step and providing annotations so that you can more easily find matching patterns of data. With that than any other tool can easily lock onto the data and have fun with it.

That means we're going to need an RDF Schema for legislative data. If we had that, then we can turn the whole Internet into a collective tool for analyzing and shaping legislation. Show me which industries in a Congressman's district benefit from legislation? Add census data to the mix so that we can compare money spent in a district to its economic health.

I could see this as being a big part of the future of media. If entities got together to work out a standard for reporting on legislation, they could collectively use that to bring value to their products. Creating hooks into data is easy. Annotating what it means is harder. That's the real value in today's media. That's what big companies are paying a lot of money for.

I could see a large part of a 21st century reporters job being to connect the semantic dots of the legislative process. They could even add their readers and others online into the mix becoming a collective swarm analysis tool. Things like that already are happening online, but just like the Washington Post site, our work is crude and we are barely scratching the surface. Still, it is amazing.

Saturday, September 8, 2007

Bonk

I came up with a name for the project I'm working on about a year ago. I thought it was such a kewl name that I didn't want to say it anywhere for fear that it would get hijacked. (Paranoia is strong in my gene pool.) The basic idea is an open-source embeddable distributed engine for building and binding Folksonomy and other meta-data annotated content applications. Rather than building yet another awesome framework for sharing information about your bottle cap collection with other enthusiasts, this would be something that does all of the back end stuff that these Web 2.0 frameworks do over and over again, plus add some cool extra stuff.

So I bought all the core domains but never used them. .COM, .NET and .ORG. Every now and then I'd google the word just to make sure that everything was quiet. For a year it was.

Then last week BONK, low and behold up come a few hits. But instead of coming from the programming space it was coming from the Danish folk music space. Makes sense. I could see why someone would name a band that.

Anyway I'm going to start officially calling it what I call it. I guess I'll have to TM it at the end just to be safe. That gives me six months to cross my eyes and dot my tees. Luckily I've already got someone who's willing to let me use their business as a guinea pig.

So without further ado here's my new favorite band Folk Engine that also has the same name as the business/project I'm working on called FolkEngine(tm). It's the main reason (besides the boy) that I won't be blogging much these days.

This is a big load of my mind. I can finally start using the tag FolkEngine instead of code terms. :-) Now all I have to do is get it done before I drive everyone I know mad.

Monday, September 3, 2007

Baselines

The first moment I looked at the RDF specification I had a problem with it: everything is URI based. If you want to annotate a thing, you have to refer to its URI.

What I'm working on completely seperates content from a specific location. A link to a text document, a thumbnail image linking to a text document, a summary of a text document, and the text document are all facets of the same thing, and I can place those things in many places.

For instance take a technical document with many sections. Now I can display those sections all on one web page for easy printing, or I can seperate them out into many pages for easier reading. Also, I can place the documument on many servers or offer a PDF version of the document. Which version is "the" document? Which URL do I point to? What if I make a new version of the document and I don't want to remove the old one?

In P2P frameworks content doesn't exist in any one place. It floats. I can grab it from many places. That makes the content extremely plastic. I want my content to have the same flexible characteristics.

The brings out what I consider to be the key weakness of what the W3C does. They create standards that define how the web works. The problem is that they do everything purely from the context of the web. Systems that store, display and annotate data don't just exist in a web context. I create a document on my computer that I want to place on the web. It has a local networked path and a system path. How do I annotate it with RDF if I haven't given it a home yet? What if after I place it in one place we decide to move it? Binding myself to URIs makes my data brittle when I want it flexible.

This also dovetails into another problem that is inherant in any framework: how do you identify people? I have many email addresses, that change every now and then. I move from place to place and use different variations of my name depending upon context. What defines me digitally? Right now nothing.

A Friend of a Friend (FOAF) could be said to be such a think, but it has information that changes over time. Also, my FOAF from work would be totally different from one that I exchange with my old drinking buddies.

I think conceptually I've got the solution: baselines.

A baseline is a collection of fields that define unchanging aspects of a thing.

This makes defining who I am digitally rather simple.

<baseline type="person">
<name>Robert Bob</name>
<mother>Momma Bob</mother>
<father>Billy Bo Bob</father>
<birthday>1-1</birthday>
<birthplace>The Moon</birthplace>
</baseline>

This creates a SHA-1 hash of f54634c2c982500c67d254d2afa44c618104bfee

Now I've got an identifier that defines me without containing any information that I'd consider private. I can do the same thing for any content by creating fields that define it. Description, creation date, created by etc...

With a baseline I have a key to an object that isn't bound by its specific content, context, or location.

Sunday, September 2, 2007

The Blogosphere Revolution is a Semantic Web Revolution

I'm sure I'm not saying anything new but I wanted to get this down.

I've been separating out the concept of content syndication from the semantic web, but that is a mistake. Weblog content syndication, which is the technological force that has powered the current wave of online Progressive activism, is the original semantic web application. An RSS file is a collection of metadata that points to people's posts. That metadata is combined with tools allowing people to easily discover and distribute information.

While the "blogs" get all the credit for what's been going on, it wouldn't be anything without the semantic web as the conceptual distribution mechanism. Take away the semantic web and all you'd be left with is bulletin boards, which we've had for a long time.

The semantic web as a concept has taken a lot of hits for being unworkable. What's already been done online with political activism is proof positive that it isn't.

In the end the real difficulty is with making the RDF specification something that can easily interact with content in various formats. The solution by makers of blog software has been to not use RDF and create specs that focus entirely on the 1 dimensional streams that are blogs.

The future has got to be with breaking away from this 1-D prison.

Saturday, September 1, 2007

Blogs and the Media

The Daily Bellwether has an interesting thread up reflecting on Jill Miller Zimon's post concerning relations between bloggers and the print media. This relates to rumors of the PD planning on hiring several bloggers including Jill.

It's going to be interesting watching how old media handles the growth of the semantic web. So far they've been mainly reactionary, which is always a dangerous sign. Both the New York Times and the Washington Post have been adapting in interesting ways.

I'm thinking that eventually we'll see things adjust to a multi tier approach... blog feeds giving up to the minute information and old world print existing to provide overviews. The problem with adding more and more information into a system is that it gets harder and harder to process that information. Old media could do a much better job helping people with that. The problem is that they'd first have to understand what's actually going on, which means dropping a lot of their 20th century definitions for what news is.

One of my favorite examples of this is the Dayton Daily News' Get on the Bus blog that focuses on education. They haven't done a very good job of promoting it, but it does provide a lot of valuable up to the minute information.

One of the key steps of leveraging the semantic web in order to take on corruption in areas such as public education will be in creating standards for reporting public budgets at all levels of government. I can think of few things more important. Follow the money.

If print media wanted to cement their place in 21st century information streams they would work together to provide such semantic web reporting services. The Washington Post's US Congress Votes database is a start in this direction, even if it's crude and doesn't dive below the surface of what's really going on on Capitol Hill. Maybe I'm just super cynical, but I've always felt that the votes made in Congress are just misdirection in the 3 card monty game that is public governance.

The Semantic Caucus