No More Jargon: December 2006

Monday, December 18, 2006

Dear Lord, Let Us Express in Code That Which Belongs in Code

This past semester I took a class called Advanced Systems Analysis and Design and in this class we ostensibly learned all about designing and managing the creation of a technical system from start to finish. I say ostensibly because I feel like I'm worse off now, if what I'm supposed to be able to do is design and manage such a system. Maybe it's my fault though. Maybe I should've started running for the hills when I saw that the course was listed under a MGMT heading in the course catalog.

Now, to be fair, a few of the management bits in the course made sense and I think they might've conveyed valid information. And if most of the course had been filled with content about that, I'd probably be a happy camper, a little bored, mind you, since it's material I find necessary, but not interesting, but not angry, as I am now.

No friends, the bulk of this course was about designing these systems. I'm not going to bore you with all the details of what went so god awful horribly wrong. Suffice to say, we were getting a Manager's perspective on System Design; a person who is so far removed from the technical details of a system that they hold no importance and thus has no business submitting design specifications to an engineer. However, there is one element from the course that I must elaborate on, if only for the sheer idiocy of it all: The Base Structural Grammar.

The first thing you have to know about the Base Structural Grammar is that no one else on the Internet knows jack or shit about it. Google searches for "Base Structure Grammar" and "Base Structural Grammar" turn up one result each. This led myself and the rest of my teammates on the project to conclude that the BSG (as it was called for 6 weeks before the acronym was actually elaborated upon) was either a fancy acronym for Battlestar Galactica or something that the Professor made up on his own, in isolation, and without anyone to tell him just how full of bullshit it was.

At this point, you might be thinking to yourself, "Oh, I'm sure it wasn't that bad." Well friend, let me tell you just why this was so fucking stupid. It was a formal grammar for the description of control flow within a system that lacked BRANCHING CAPACITY. Okay? Get it. An attempt to describe operations that will be implemented in a Turing Complete programing system, in a language that, itself, is NOT TURING COMPLETE.

How were different courses of execution handled you might be asking yourself? By rewriting the entire case, but changed to account for the different operations that would need to be called in this case. As a comparison, if you were trying to do this while actually programming, you would need to rewrite all your functions 2^N times for every if statement present in them, and then dispatch to each of these functions based on the values of the arguments (which you have to magically express the conditions of, because there's no way to describe it in code).

When he explained this, I wanted to curb stomp his fat head. If I had been in the same room, I would've become physically violent. Luckily, I guess, I was watching the lecture remotely. I think I banged my head against my desk for about 10 minutes, but my memory is a little fuzzy from them.

Why, why, why, why couldn't a programming language be used for that? Something like a pared down Scheme would be perfect for those sorts of expressions! And then the code could actually be USED (if it, in fact, worked) after the lower level operations were implemented. Heck, you could actually get a jump start on that if a half-decent testing system existed for this theoretical design language!

I dunno, the mind boggles.

Monday, December 04, 2006

Metadata: The Darker Side of Meta

This series of posts is part of a short paper I am writing for Communication Design for the WWW.

The garden of metadata is not all sunshine and roses though. There are thorns on our flowers of knowledge. Cory Doctorow, prolific Internet auteur, released long ago a general list of problems that will exist on the semantic web whenever it materializes. I will mention but a few of these that I consider the most worrying, specifically poisoned metadata and a lack of investment.

It has been established that attention is the most valuable commodity on the Internet aside from hard currency (with the majority of concern about attention going towards converting it to said currency). In underhanded attempts to gather attention, people will make all sorts of audacious claims and use any technique to grab attention wherever they may. You need only look in your spam folder to see this reality. With the capacity to add new information about information, a less than scrupulous user may attempt to place their content in a place it does not belong. Within the context of tagging, this has been unimaginatively named "tag spam", a prime example of which can be seen here. They keywords that have been placed on this photograph have little, if anything to do with the subject of the picture and seem to serve only promotional purposes.

Spam of this sort seems not to have taken off yet, perhaps in part because the spamers are forced to work with an application that requires user registration and thus limits the number of anonymous entries they can deploy. This low volume has left the signal to noise ratio relatively high, and thus it is not much of a problem. But when metadata moves towards non-federated creation, there will be no such guarantee of a central executor to punish those who seed poisoned data.

Apathy would be what I would consider the next largest problem. Despite an inevitable trend towards a more technologically savvy population, there remains a large segment, even among younger users that couldn't give a damn. Even though adding a tag takes but a few keystrokes, and adding geotags involves a few clicks on a map, it still does require work. Work that people just aren't interested in. For now, this is not a large concern among users since the majority do in fact care about adding metadata to their content.

Perhaps it won't be a problem though. If the visionaries are to be believed, not adding metadata will mean that not only will your content not be findable, it won't even be usable, even by yourself. Self-interested utility may be what drives metadata creation.

In many senses, this is already the case. When a user tags a link on del.icio.us they typically do so because they found the link noteworthy and would like to be able to find it again. When a Flickr user tags a photo and adds geotags, they do so because they would like to find a specific picture again, or see all the pictures they have taken someplace.

Will it fly in Peoria though? I'm not sure.

Sunday, December 03, 2006

Metadata: State of the Art

This series of posts is part of a short paper I am writing for Communication Design for the WWW.

Metadata's most modern incarnations exist in myriad forms. Of those, a few that rank highest on the cross section of hype and usefulness include tagging, geotagging, and microformats. These forms of metadata overlap in their concerns somewhat, but they are distinct in the way that they are employed by users.

Tagging is not a complicated concept and once it is understood some people dismiss it as too simple to be of serious use. Despite these doubts, tagging has proven to be an effective and easy way to add metadata to content.

Tagging is simply attaching an explicit keyword to some data. It is different from categorization in two important ways: as many tags as is desired can be added and there is no vocabulary for tags. This is typically where tagging is dismissed as overly simplistic. However, when the uses of tags begin to be explored, what initially seemed like a simple system begins to gain some exciting emergent properties.

Tagging of course allows for findability, and that is how the vast majority of tag use is considered. Especially for image content, tags have been a boon to persons searching for particular subjects, photography styles, and even colors. With the additional metadata of Creative Commons licenses attached it becomes trivial to find appropriate content from creators who are willing to share their work with you (as been requested of myself 3 times since I began using Flickr). Tag searches can use as many tags to filter results as is desired, both by including and excluding terms. While this is the most popular, and most familiar, use of tags, it's probably the least interesting.

Because tags are explicitly added metadata and the systems of content have the capacity to track entries as they appear. It is possible, using syndication technology, to maintain a subscription to a particular keyword search, being notified of new entries as they are added. This allows a user of Technorati's blog tag search feature to track interest and discussion of particular topics across all weblogs on the internet.

Where this capacity truly becomes interesting is when tags are created for a single specific purpose, either to uniquely identify a concept that a limited subset of users are aware of, or instead of attempting to define the subject of data, to describe how, why, or for what purpose the data is to be used. Specifically, this tends to emerge around events such as conferences, festivals, and expos, (see sxsw2006 for example) but other examples can be seen in meme like activities such as 10placesofmycity or infiniteflickr. The uses for non-subject specification purposes are also fascinating, many people (the author included) have tagged things that they are interested in, but do not have the time to explore as todo or toread. Content that a user feels would be interesting to another user will be tagged as for:username.

Of course, tagging is not without its difficulties. Chief among them being semantic confusion between tags that are homonyms and syntactic disparity between tags that are synonyms of each other. One solution that has begun to gain ground on solving this particular problem however is that of tag clustering. The basic concept behind which is that a given tag will likely have a number of other tags it is commonly seen with. Groups of tags, clusters, tend to emerge that are linked to a particular tag, but syntactically and semantically distinct.

Geotagging is a similar, but distinct concept from tagging, as can be inferred by the name. It still invovles the addition of specific keywords to a piece of content. But those keywords are very strict and have a direct correlation to a physical location on the globe, either as a recognizable location name or in latitude and longitude coordinates. The practice of geotagging probablly emerged from the psuedo-sport Geocaching, but it has a wider appeal in its use.

Geotagging first began to emerge when the Google Maps API was hacked and people began producing mashups against existing databases that had locational information. One of the earliest and most striking of these mashups was Chicago Crime, which culled information from police reports and showed incidents against a map of the city. Another, less serious, example is overplot, a mashup of Overheard in New York (where all entries include a street address).

Flickr also began seeing use of geotagging on photos with an informally specified set of tags "geo:lat=xx.xxxx", "geo:lon=xx.xxxx" and "geotagged" which could be pulled from Flickr's databases using their tag access APIs. These tags were collected on external websites and allowed visitors to see tags within a specific geographic area, as well as determine precisely where a picture was taken. Flickr has since added a built in geotagging tool.

Of particular interest with regards to geotagging is the automatic creation of geolocational information by devices that are involved in the creation process themselves. At least one camera has supported an integrated GPS recording feature and there is a system available to add the capacity to dSLRs.

Of course, neither tagging nor geotagging address concerns of how automated agents will go about actually using this metadata. While the applications where all these datum are being stored allow programmatic access through APIs, designing an agent that would be capable of accessing and understanding all those APIs would be nigh impossible. That is where microformats step up. Microformats are an especially simple conception; they don't even attempt to address new types of metadata. What a microformat is, is simply a specifically structured valid XHTML fragment that conforms to one of the predefined (micro)formats.

As an example, let's look at the hCard microformat, which corresponds to the vCard contact format that has gained popularity amongst communication applications. Here is a simple hCard:


<div class="vcard">
<a class="url fn" href="http://nomorejargone.blogspot.com/">Daniel Nugent</a>
<a class="email" href="mailto:nugend@fakemail.com">nugend@fakemail.com</a>
<div class="adr">
<div class="street-address">999 Madeup Street</div>
<span class="locality">Springfield</span>,
<span class="region">NY</span>,
<span class="postal-code">00001</span>
<span class="country-name">USA</span>
</div>
<div class="tel">518-867-5309</div>
</div>

and how the hCard appears without escaping the HTML tags:

Daniel Nugent
nugend@fakemail.com

999 Madeup Street

Springfield, NY, 00001
USA

518-867-5309

Not exactly the prettiest looking output on the block, but that can be corrected with some style sheets. More importantly, this text is easily machine parseable because it is in a commonly accepted format, and it is also human parseable because it is in clear-text in a common layout.

Microformats exist for addresses, calendar entries, content licenses, and tags among others, with formats for resumes, reviews, and geotagging being developed.

Some people have raised the question as to why they should bother with microformats now if the full-hog semantic web is going to wipe the floor with it tomorrow. The answer is this: Data and metadata stored in microformats will be easily convertible to the official semantic web formats when they are finally decided upon. As an added bonus, robots that are developed to work with microformats will be able to recognize this data immediately and enhance the utility of content.

Next: The Darker Side of Meta

Metadata: The Vision

This series of posts is part of a short paper I am writing for Communication Design for the WWW.

What does having all the extra data in the form of amateur publications do for us though that it ought to be organized in the first place? Aside from the intrinsic reward of having data in a coherent order we gain the potential for software to better discover data that we need for various reasons and to assemble that data in a meaningful way.

This paper was actually researched with the use of del.icio.us's metadata search tools. While I could've used Google to do searches for the specific keywords related to metadata, what I could not do was limit those searches' results primarily to conference proceedings, papers, or serious web articles (as opposed to blog postings, forum chatter, and news clippings).

While even the early returns are promising, the real vision of metadata lies in what the Semantic Web has the ability to bring us:

The agent promptly retrieved information about Mom's prescribed treatment from the doctor's agent, looked up several lists of providers, and checked for the ones in-plan for Mom's insurance within a 20-mile radius of her home and with a rating excellent or very good on trusted rating services. It then began trying to find a match between available appointment times (supplied by the agents of individual providers through their Web sites) and Pete's and Lucy's busy schedules.

We are quite a bit away from the technology for medical information to be correlated automatically with personal scheduling (let alone the legal issues with such information being open and accessible by machine reasoners) however. The reality of cutting edge metadata is much less striking (and unsettling), though still fairly useful.

Next: State of the Art

Metadata: Browsing the Web is Hard Work

This series of posts is part of a short paper I am writing for Communication Design for the WWW.

If Metadata is so important to organizing and finding data, why has it only recently become a topic under significant discussion?

To answer this question properly, a brief history of the World Wide Web must be explored.

In the summer of 1991, Tim Berners-Lee published the first web page, released the HTTP specification, and made available the first web browser and WYSISWYG editor. Sir Berners-Lee's original vision for the web was as a collaborative medium where all visitors where content creators and everyone had access to a space to publish on of their own. Due to a number of technological, social, and other kinds of circumstances however, web publishers were initially limited to an elite set of advanced users and business interests.

Because these publishers were primarily concerned with content of a technical or business nature, they could rely on existing structures of information to categorize or organize the content they wanted to create. In situations where there was no existing structure, the data might not have been important enough to properly categorize or an Information Architect could be employed to create a new taxonomy or hierarchy for the new data. In addition, compared to the content creation rates of today, there was a minuscule influx of new data to properly organize. This allowed the data that was created to be structured by hand.

Also of import is that the data being published was largely textual in nature. This allowed for search engines to perform latent semantic analysis on web pages to obtain a general meaning of the words on a page. Google further refined on this technique by exploiting a previously unconsidered set of metadata inherent in the structure of the web itself: by counting the incoming links to a page, Google could determine the esteem that a page held with regards to its subject and return better results for keyword searches.

Google's PageRank was likely the last stop gap against the torrent of new web content though. In the last five years, the barriers to individual content creation on the web have begun to fall one by one. Technical knowledge, financial barriers, and connection requirements have been eliminated with the advent of free, ad-sponsored publishing platforms, like Blogger, Flickr, YouTube, Odeo, and a galaxy of other sites.

How does this change anything?

One, most of the content that is being published is undifferentiated except by the actual format. People don't restrict themselves to a single topic when they write, or take pictures, or make podcasts. They can less easily rely on using a taxonomy to describe their content, nor do many users feel compelled to create content regarding a single overarching subject. An amateur photographer on Flickr may be taking snapshots of their family one day and creating experimental Photoshop collages from those very same snapshots the next.

Two, much of the new content being created isn't textual. Computers have gotten better at recognizing objects in pictures and spoken words, but they're still lagging far behind their capacity to read digital text. We can't yet rely on Google to search through terabytes of images, video, and audio without supplementing that data with text.

Three, by giving every John Q Public and his brother the capacity to publish, the amount of content created daily has increased at an exponential rate. No one could do the job the old way even if they wanted to.

Metadata, data about data, is suddenly very important.

Next: The Vision

Metadata: Machine Accessibility

This series of posts is part of a short paper I am writing for Communication Design for the WWW.

A recent article in the New York times heralded the arrival of what they called "Web 3.0" or the Semantic Web. This caused quite a bit of tittering among commentators on the Internet, mostly because the paint was still fresh on Web 2.0 (whatever it actually means), but also because the Semantic Web was nothing new.

The Semantic Web is a format and specification project that has been underway for almost a decade. Its stated goal is the creation of a knowledge format that will allow machine intelligences to comprehend and reason about a wide and constantly evolving range of data. The format, and formats that will be derived from it, are what is known as metadata.

The McGraw-Hill Dictionary of Scientific and Technical Terms defines metadata as:

A description of the data in a source, distinct from the actual data; for example, the currency by which prices are measured in a data source for purchasing goods

Of course, this definition has no strict association with computer data. By all rights, metadata has existed for centuries, the Dewey Decimal System being the most widely known and rigorous. But even a convention as simple as alphabetical ordering by author, then title in the non-fiction section is a use of metadata. The data being the work itself, the text on the pages, and the metadata being the author and title.

This brings me to what I feel is an important point about metadata: although it ostensibly exists to allow mechanical interaction with data, the chief beneficiaries are ultimately humans.

To show this, consider a simple thought experiment:

Tear off the cover of every book in a library.
Try to find a book written by your favorite author.

Next: Web Browsing is Hard Work

No More Jargon

Tags