Switchboard Update

As part of my masters thesis at Georgia Tech, I created a library for Processing called Switchboard . The goal was to “create a conceptual level interface to as many web and network related services as possible.” On a more fundamental level, I wanted to allow people to use live data sources to create art as easily as possible. To be able to easily take data from one live source and combine it with another, and then three more, and make some sort of deduction about the people who have put that data there in the first place, namely: everyone in the world (acknowledge grossly inaccurate implication that everyone in the world creates content on the web). Grabbing this data in real-time (as the work is being viewed) allows authors to create work that transcends cultural and historical setting, because the data sources that are driving the work reflect those trends at the time when the user views the work. In other words, a painting of the Backstreet Boys is going to mean something completely different to someone in China, or to someone 20 years from now. So instead of using “The Backstreet Boys”, we can abstract out the idea of “famous pop band”, and create your work dynamically, based on the time and place where the work is viewed. (this idea is illustrated in If Warhol Had Switchboard)

An ideal implementation of Switchboard would have generic functionality. There would be a websearch function to find webpages, a photos function to fetch photos, a products function, and perhaps headlines, people, etc. This way, it wouldn’t even be linked to any one service, but would rather be one of several possible implementations of a toolkit with this type of functionality. Unfortunately, I am not there yet. Unfortunately, all I have is a pile of barely-working code and a lot of people yelling at me to make something better.

So today, in addition to talking to nearly every broker in Brooklyn, I have been working on a newer, leaner version of Switchboard, and the stupidest problem is driving me to drink. You see, one of the fundamental elements of this new Switchboard is a kick ass Browser class, built on top of the Apache HTTPClient. You can use it in an application and tell it to open web pages, submit forms, accept cookies, and do all of the normal stuff that browsers can do, but it all happens behind the scenes. It has no renderer, so you can’t actually see the pages, but you can use xpath to scrape data out of the source code of the pages. For the uninitiated, this means you can easily ask the browser to do things like “get the path to every image”, or “get all text within paragraph tags”, or whatever. In the old version, I also had Lucene support for fuzzy text matching, but I’m not sure if that is going to remain — the library is too big. Since many of the services that are part of Switchboard are either REST or SOAP or screen-scraped, this Browser class is essential to the whole toolkit.

But now I have a problem. And as every good delusional schizophrenic will tell you, the problem isn’t with me, it’s with the world. To use XPath, the first thing you need is a valid XML document, and the fact is, most HTML out there isn’t even close. Probably 99% of the people out there who put HTML onto the web in one form or another have no idea what W3C specifications are, much less follow them. There are two popular solutions to this problem. The most popular is called Tidy, and another Java solution is called TagSoup. The old version of Switchboard was cobbled together quickly and somewhat carelessly, so I believe that it uses both. This was necessary because there are some formatting issues that JTidy refuses to solve. However, for this version, I hope to use only one or the other. But each has problems.

The problem with Tidy that I have been wrestling with all evening is the “missing quotemark for attribute value” problem. This problem occurs when Tidy tries to process an HTML page where (surprise surprise) a tag attribute is missing a quote mark. For instance, in a well-formed XML document, a tag looks like this:

<a href=”http://www.jeffcrouse.info”>link to my site!</a>

But when Tidy encounters this:

<a href=”http://www.jeffcrouse.info>link to my site!</a>

It freaks out and quits. See the difference? There’s a quote missing after the “.info”. This seems small, but in fact it is a serious problem because the syntax fixer doesn’t know where to put the end quote. Should the entire remaining portion of the page be inside that quote? Or just up to the “/>”? What if a scripting language threw an error in that particular attribute, and spewed out a bunch of HTML, so there are actually several lines of markup inside that attribute? How do you know which “/>” is the right one? As you can see, it’s all very frustrating. The approach I choose was to cycle through ever y single tag (defined as anything that starts with < and ends with >) in the document and count the number of quotes inside. If it is an odd number, then I put a quote somewhere inside and hope that Tidy is cool enough to figure out the rest. There will definitely be some data loss there, but at least we will have a valid document. But so far I haven’t been able to come up with the right regular expressions and it is all tears and sadness.

Of course, I could always go over to TagSoup, which explicitly states that it is not a replacement for Tidy because it “does not convert presentation HTML to CSS or anything similar”, but all I want is a valid document, so this really isn’t an issue. The problem is that TagSoup won’t compile in Java 5 without some external libraries that add up to several Megs. And if I use the TSaxon distribution, then I don’t get to use the built-in Java 5 XPath. However, if I do use the TSaxon library, I don’t need to be using Java 5 anyway… So is this my solution?

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • MySpace
  • Netvibes
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
This entry was posted in News, Project Updates. Bookmark the permalink. Trackbacks are closed, but you can post a comment.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>