In Web 2.0 with its open access to create and modify webpages, edit rules are impos- sible to enforce. User-generated content is usually not curated by professionals and in addition often written in an ad-hoc manner, with many typographical errors and usage of non-standard vocabulary. These changes affecting the material that needs to be indexed has to be taken into account in the search process.
Tagging as a substitute to full text search suffers from the same problems (ad- hoc setting of tags, typographical errors, non-standard vocabulary), yet in an even aggravated manner. Tags are usually single words and therefore in many cases highly varying and ambiguous70. As Adam Mathes illustrated in the case of searching for the tag filtering on del.icio.us, completely different meanings have been subsumed under the tag. Below is the list of webpage abstracts that had resulted from searching forfiltering71:
• Last.FM - Your personal music network - Personalized online radio station • InfoWorld: Collaborative knowledge gardening
• Wired 12.10: The Long Tail
• “Oh My God It Burns!” Practical Applications of the Philosophers stone. For drunks. Brita filter makes bad vodka into good vodka
• Introduction to Bayesian Filtering
Tags are not only ambiguous in many cases, but also exhibit orthographic, mor- phological and semantic variation. Searching foraccomodation andaccommodation or video recorder andvcr yields disparate results on current large user-generated content sites with a tagging system such as flickr.com, del.icio.us or youtube.com. As the dis- tribution of tags follows the classic Zipf curve, there are many rarely used tags which are often just variations of commonly used tags. Tapping into the content labeled by rare tags, for example for search refinement or drill-down, would provide a much richer retrieving experience to the user.
The need for normalizing and cleaning tags is even higher if tags should be used to build up any kind of ontology. Letting users set up a shared ontology by themselves has considerable advantages, especially considering that professionally curated ontologies are costly and often deviate from what people expect it to look like. However, if a user- generated ontology is abundant in redundancy and inconsistencies, its contributions will be of very limited worth72
While Web 2.0 is often seen in connection with the Semantic Web, it certainly de- viates from the Semantic Web’s original concept of applying semantics to content. In
70
David Crystal points out nice examples such asdepression— era, geographical formation or psychic state.
71http://www.adammathes.com/academic/computer-mediated-communication/folksonomies.html 72
If multiple users generate content or ontologies, adding nodes, especially in popular areas, will happen quickly, but systematic modifications and balance do not follow automatically.
contrast to the Semantic Web is hardly separable from ontologies and inference mech- anisms73, Web 2.0 builds upon collaborative classifications, often called folksonomies. Well-known examples of folksonomies are social bookmarking sites such as del.icio.us or the category system on Wikipedia (see above).
The main advantage of tagging in comparison to a priori ontologies is often seen in the simpleness, robustness and its origin in actual usage rather than in a rigid normative ontology. There are even voices that see the lack of variance recognition as an innate strength to folksonomies:
Aside: I think the lack of hierarchy, synonym control and semantic preci- sion are precisely why it works. Free typing loose associations is just a lot easier than making a decision about the degree of match to a pre-defined category (especially hierarchical ones). Its like 90% of the value of a proper taxonomy but 10 times simpler.” 74.
However, the lack of explicit relationships often lets folksonomies contain noth- ing more than loosely associated single word. For example, searching notebook on del.icio.us results in the following related tags:
Among the related tags presented here is a synonym to notebook (laptop) and its singular variant, but also very general tags such as shopping, software orblog. Some- times even “junk” tags such as laquo (obviously originated in the HTML entity) are produced (searching after northwest airlines on Oct. 1, 2006). The quality of the related tags thus differs greatly. A variance handling routine could work without any intrusion, for example by just grouping equivalent tags without forcing any control on the usage of tag variants.
Tagging is often considered to be an approximation of how human minds store and access knowledge. It is often just a small piece of information that is needed to retrieve a memory that is inaccessible via a systematic search. For example, if one is looking for a translation of a term into a foreign language once learned (say, the English word for Strassenbahn), the key to retrieving the translation is usually either a phonetic chunks, but might also be a related term (even if weirdly related on the basis of associations or personal memories, in the case of streetcar maybe a scene with Elizabeth Taylor
73See for example [Davies/Fensel/van Harmelen 2003]. 74
Stewart Butterfield’s Sylloge blog, see http://www.sylloge.com/personal/2004/08/folksonomy -social-classification-great.html [Nov. 1, 2006].
of the movie, “A streetcar named desire”). The possibility that the mind starts the retrieving process by scanning an ontology, starting with “entities”, moving down to “concrete objects”, to “means of transport”, to “public transportation” and so on is hardly convincing. An empiric indication of that lies in the low usage frequency of many highest-level ontology labels, such as “entity”, “concrete objects” or “means of transport”, especially compared to concepts of medium granularity, such as “car” or “train”.
Equally, a literal string matching between single-word tags is not a feasible model how the mental tagging process could work. This renders the current tag search applications a very crude approximation of how humans would organize knowledge and memories75.