No More Jargon

The coalescence of thoughts with regards to technical subject matters in the areas of software design and computer languages.


    Sunday, December 03, 2006

    Metadata: State of the Art

    This series of posts is part of a short paper I am writing for Communication Design for the WWW.

    Metadata's most modern incarnations exist in myriad forms. Of those, a few that rank highest on the cross section of hype and usefulness include tagging, geotagging, and microformats. These forms of metadata overlap in their concerns somewhat, but they are distinct in the way that they are employed by users.

    Tagging is not a complicated concept and once it is understood some people dismiss it as too simple to be of serious use. Despite these doubts, tagging has proven to be an effective and easy way to add metadata to content.

    Tagging is simply attaching an explicit keyword to some data. It is different from categorization in two important ways: as many tags as is desired can be added and there is no vocabulary for tags. This is typically where tagging is dismissed as overly simplistic. However, when the uses of tags begin to be explored, what initially seemed like a simple system begins to gain some exciting emergent properties.

    Tagging of course allows for findability, and that is how the vast majority of tag use is considered. Especially for image content, tags have been a boon to persons searching for particular subjects, photography styles, and even colors. With the additional metadata of Creative Commons licenses attached it becomes trivial to find appropriate content from creators who are willing to share their work with you (as been requested of myself 3 times since I began using Flickr). Tag searches can use as many tags to filter results as is desired, both by including and excluding terms. While this is the most popular, and most familiar, use of tags, it's probably the least interesting.

    Because tags are explicitly added metadata and the systems of content have the capacity to track entries as they appear. It is possible, using syndication technology, to maintain a subscription to a particular keyword search, being notified of new entries as they are added. This allows a user of Technorati's blog tag search feature to track interest and discussion of particular topics across all weblogs on the internet.

    Where this capacity truly becomes interesting is when tags are created for a single specific purpose, either to uniquely identify a concept that a limited subset of users are aware of, or instead of attempting to define the subject of data, to describe how, why, or for what purpose the data is to be used. Specifically, this tends to emerge around events such as conferences, festivals, and expos, (see sxsw2006 for example) but other examples can be seen in meme like activities such as 10placesofmycity or infiniteflickr. The uses for non-subject specification purposes are also fascinating, many people (the author included) have tagged things that they are interested in, but do not have the time to explore as todo or toread. Content that a user feels would be interesting to another user will be tagged as for:username.

    Of course, tagging is not without its difficulties. Chief among them being semantic confusion between tags that are homonyms and syntactic disparity between tags that are synonyms of each other. One solution that has begun to gain ground on solving this particular problem however is that of tag clustering. The basic concept behind which is that a given tag will likely have a number of other tags it is commonly seen with. Groups of tags, clusters, tend to emerge that are linked to a particular tag, but syntactically and semantically distinct.

    Geotagging is a similar, but distinct concept from tagging, as can be inferred by the name. It still invovles the addition of specific keywords to a piece of content. But those keywords are very strict and have a direct correlation to a physical location on the globe, either as a recognizable location name or in latitude and longitude coordinates. The practice of geotagging probablly emerged from the psuedo-sport Geocaching, but it has a wider appeal in its use.

    Geotagging first began to emerge when the Google Maps API was hacked and people began producing mashups against existing databases that had locational information. One of the earliest and most striking of these mashups was Chicago Crime, which culled information from police reports and showed incidents against a map of the city. Another, less serious, example is overplot, a mashup of Overheard in New York (where all entries include a street address).

    Flickr also began seeing use of geotagging on photos with an informally specified set of tags "geo:lat=xx.xxxx", "geo:lon=xx.xxxx" and "geotagged" which could be pulled from Flickr's databases using their tag access APIs. These tags were collected on external websites and allowed visitors to see tags within a specific geographic area, as well as determine precisely where a picture was taken. Flickr has since added a built in geotagging tool.

    Of particular interest with regards to geotagging is the automatic creation of geolocational information by devices that are involved in the creation process themselves. At least one camera has supported an integrated GPS recording feature and there is a system available to add the capacity to dSLRs.

    Of course, neither tagging nor geotagging address concerns of how automated agents will go about actually using this metadata. While the applications where all these datum are being stored allow programmatic access through APIs, designing an agent that would be capable of accessing and understanding all those APIs would be nigh impossible. That is where microformats step up. Microformats are an especially simple conception; they don't even attempt to address new types of metadata. What a microformat is, is simply a specifically structured valid XHTML fragment that conforms to one of the predefined (micro)formats.

    As an example, let's look at the hCard microformat, which corresponds to the vCard contact format that has gained popularity amongst communication applications. Here is a simple hCard:

    <div class="vcard">
    <a class="url fn" href="">Daniel Nugent</a>
    <a class="email" href=""></a>
    <div class="adr">
    <div class="street-address">999 Madeup Street</div>
    <span class="locality">Springfield</span>,
    <span class="region">NY</span>,
    <span class="postal-code">00001</span>
    <span class="country-name">USA</span>
    <div class="tel">518-867-5309</div>

    and how the hCard appears without escaping the HTML tags:

    Daniel Nugent

    999 Madeup Street

    Springfield, NY, 00001


    Not exactly the prettiest looking output on the block, but that can be corrected with some style sheets. More importantly, this text is easily machine parseable because it is in a commonly accepted format, and it is also human parseable because it is in clear-text in a common layout.

    Microformats exist for addresses, calendar entries, content licenses, and tags among others, with formats for resumes, reviews, and geotagging being developed.

    Some people have raised the question as to why they should bother with microformats now if the full-hog semantic web is going to wipe the floor with it tomorrow. The answer is this: Data and metadata stored in microformats will be easily convertible to the official semantic web formats when they are finally decided upon. As an added bonus, robots that are developed to work with microformats will be able to recognize this data immediately and enhance the utility of content.

    Next: The Darker Side of Meta

    No comments:

    About Me

    My photo
    Truly a Simple Minded Fool