The overarching purpose of this study is to improve our understanding of the information space of social bookmarking. The approach taken in this study is to conceptualize the space as the aggregation of personal information spaces, which is constructed by the bookmarking activities of individual users. The structure of the information space can then be studied in terms of unions and intersections of personal information spaces. In the framework of social network analysis there is a special kind of network, called an affiliation network, that is very suitable for representing this conceptual picture. An affiliation network provides a representation of the theoretical concept of intersecting social circles, and allows investigation of relations among people based on their joint participation in groups of a sort. Analogous to social circles, the bookmarked information objects comprise information spaces, and users are connected by intersecting information spaces. Together with this abstract representation, the methods of social network analysis provide the analytic framework for this study.
The main focus of the study is on the problem of identifying and characterizing shared interest space(s) within the large-scale information space of a social bookmarking site. The basic underlying assumption is that choices people made in the past can serve as implicit indicators of their interests or preferences, and that
non-random patterns emerge in aggregation which provide a basis for identifying similar people. That is, we can infer shared interests among users of a social bookmarking site, based on their past bookmarking behavior. The well-established research areas of citation analysis and collaborative filtering techniques, reviewed in the previous chapter (under the heading of social information space), provide theoretical and empirical support for the assumption.
In order to address the research problem of a shared interest space, this study is designed to be carried out in three phases, asking each of three separate yet closely related questions: First, to what extent are bookmarking activities accumulated and how much overlap is there in a social bookmarking site? Second, can users of a social bookmarking site be connected based on their shared interests (their common possession of bookmarks) and, if so, how? Third, is it possible to identify communities of interest within the network?
As a setting for the study, a popular social bookmarking site, delicious.com9,
9 Delicious.com (formerly del.icio.us; http://delicious.com) is a “social bookmark manager,” where registered users save their bookmarks on the shared web site. When users add a bookmark, the URL and title of the web page as well as the creation-time of the bookmark are recorded. In addition, users can choose to “tag” the bookmark.
When a bookmark entry is created, it is immediately shown on the front page of the site, where several of the most recent posts are displayed. Here, not only the user who posted the bookmark, but anyone can see the entry. Each entry consists of the link to the web page with the title as link text, the list of tags, the username of the person who created it, the number of other people who have saved the same page (URL), and the time at which it was added. From the point of view of the user who added the entry, the moment he/she posts a bookmark, he/she can see how many other users bookmarked the same page and further how they tagged it and when they added it.
was chosen. Delicious.com is known as the first and one of the most successful instances of social bookmarking. With its relatively long history and broad user base, the site can serve as a strong example of the aggregated information space of social bookmarking.
Finally, an important decision made in designing this study needs to be mentioned – the decision to draw a network based only on bookmark posting behavior and not tagging behavior. It may seem intuitively appealing to use all the available information, both the information objects bookmarked and the tags assigned to them, to build connections among users. Ideally if we find someone who is interested in the same material and also classifies that material in a way similar to our tagging, his/her interests are probably closely related to ours. In fact, there have been studies representing social bookmarking data as a tripartite graph (Lambiotte & Ausloos, 2006), which allows the presentation of all three entities (people, information objects, and tags) and their interconnections. However, dealing with a tripartite graph is computationally complex and demanding, and there is little tool support for studying tripartite graphs. Therefore, it is not feasible to explore a large scale dataset with a tripartite graph. The dominant practice is to reduce the complexity by transforming a tripartite graph into bipartite graphs, each of which consists of two
personal page, where all the bookmarks they have added are displayed in reverse- chronological order, along with the list of tags they have used. Compiled from individual user accounts, each and every tag in the system has a tag page where all the bookmarks tagged with that term by any user are listed. Similarly, for each unique item identified by a URL, there is a page listing all the bookmark entries made on the item.
Delicious.com was founded by Joshua Schachter in 2003 and acquired by Yahoo! in 2005. By the end of 2008, delicious.com claimed about 5.3 million users
distinct kinds of entities and connections between them. A bipartite graph is called an affiliation network in social network analysis. Given the necessity of choosing one entity – either information objects (URLs) or tags – to represent (in addition to people/users), the question is, which would represent the relationships between entities more reliably? Although tags have their own merits, relying on tags can introduce non-negligible noise due to a number of interrelated reasons. First, tags in social bookmarking systems are not controlled. The problems due to the uncontrolled nature of tagging systems, including polysemy and synonymy, have been pointed out and, indeed, reported to be abundant in social bookmarking data. This makes it challenging to process tags. Second, categorization research in cognitive science has documented strong empirical evidence that the categorization process is highly context dependent and subjective. Similarly, in the area of personal information management, it has been shown that, either in a physical environment or a digital environment, people’s organization behavior is significantly influenced by various contextual factors. This means that tags can vary depending on specific tasks or situations and, thus, without knowing the context, there can be many cases where it is difficult to decide whether two instances of the same tag (the same string) used by different users (or even by the same user at different points in time) represent the same or a similar interest. Third, empirical studies collecting data from a social bookmarking site commonly report that a large portion of items are saved without any tags. This finding suggests that people reveal ‘piling’ behavior in this environment too. Considering these factors, it was decided that URLs bookmarked provide a better indicator of interests that will connect users.
The decision to exclude tags was a practical choice and is not, by any means, meant to refute the value of tags. In fact, one of the motivations for studying the shared interest structure within the network comes from the recognition that, given the highly subjective and variable nature of categorization, documented in both cognitive science and information science, it would be beneficial if we could identify homogeneous communities of interest first and study tagging behaviors within those communities. One of the potential contributions of this study would be laying the groundwork for a comparative study of tag usage within and across communities of interest within the broad information space of social bookmarking.
The remainder of this chapter describes the data collection and sampling methods, and measures and/or tools that were used for the analysis in each phase of the study. Note that each phase was designed to build upon the previous phase, with increasingly specific goals. The first phase evaluated the extent to which bookmark postings accumulate and overlap in relation to both entities of interest – users and information objects. Three separate datasets, each of which captures different portions of the information space, were used to properly represent the entire information space of delicious.com. In the second phase, the investigation was focused on a specific part of the information space, by creating and exploring a network of the most active
users. By deriving relations (links) among users based on their intersecting personal information space (i.e., common bookmarks), the network represented users of shared interests. The overall structure of the network was examined with various network analytic measures. Finally, the third phase further narrowed down to a specific structure of the network, i.e. a community structure: a structure consisting of densely
connected sub-regions, possibly representing coherent areas of interest.