No More Jargon

The coalescence of thoughts with regards to technical subject matters in the areas of software design and computer languages.


    Sunday, December 03, 2006

    Metadata: Browsing the Web is Hard Work

    This series of posts is part of a short paper I am writing for Communication Design for the WWW.

    If Metadata is so important to organizing and finding data, why has it only recently become a topic under significant discussion?

    To answer this question properly, a brief history of the World Wide Web must be explored.

    In the summer of 1991, Tim Berners-Lee published the first web page, released the HTTP specification, and made available the first web browser and WYSISWYG editor. Sir Berners-Lee's original vision for the web was as a collaborative medium where all visitors where content creators and everyone had access to a space to publish on of their own. Due to a number of technological, social, and other kinds of circumstances however, web publishers were initially limited to an elite set of advanced users and business interests.

    Because these publishers were primarily concerned with content of a technical or business nature, they could rely on existing structures of information to categorize or organize the content they wanted to create. In situations where there was no existing structure, the data might not have been important enough to properly categorize or an Information Architect could be employed to create a new taxonomy or hierarchy for the new data. In addition, compared to the content creation rates of today, there was a minuscule influx of new data to properly organize. This allowed the data that was created to be structured by hand.

    Also of import is that the data being published was largely textual in nature. This allowed for search engines to perform latent semantic analysis on web pages to obtain a general meaning of the words on a page. Google further refined on this technique by exploiting a previously unconsidered set of metadata inherent in the structure of the web itself: by counting the incoming links to a page, Google could determine the esteem that a page held with regards to its subject and return better results for keyword searches.

    Google's PageRank was likely the last stop gap against the torrent of new web content though. In the last five years, the barriers to individual content creation on the web have begun to fall one by one. Technical knowledge, financial barriers, and connection requirements have been eliminated with the advent of free, ad-sponsored publishing platforms, like Blogger, Flickr, YouTube, Odeo, and a galaxy of other sites.

    How does this change anything?

    One, most of the content that is being published is undifferentiated except by the actual format. People don't restrict themselves to a single topic when they write, or take pictures, or make podcasts. They can less easily rely on using a taxonomy to describe their content, nor do many users feel compelled to create content regarding a single overarching subject. An amateur photographer on Flickr may be taking snapshots of their family one day and creating experimental Photoshop collages from those very same snapshots the next.

    Two, much of the new content being created isn't textual. Computers have gotten better at recognizing objects in pictures and spoken words, but they're still lagging far behind their capacity to read digital text. We can't yet rely on Google to search through terabytes of images, video, and audio without supplementing that data with text.

    Three, by giving every John Q Public and his brother the capacity to publish, the amount of content created daily has increased at an exponential rate. No one could do the job the old way even if they wanted to.

    Metadata, data about data, is suddenly very important.

    Next: The Vision

    No comments:

    About Me

    My photo
    Truly a Simple Minded Fool