The goal of personalization is the recording of a history of user interests and using this history to determine the context of a user query. The natural question arises of how exactly this can be accomplished. There are a number of approaches to user modelling and, in the context of web searching, most of these models rely on keywords or combinations of keywords to represent user interests. There are a few concerns pertaining to the use of a pure keyword or keyword combination approach.
The use of keywords only tend to be fairly isolated, as the relationship between words and the topics they describe are not always clearly defined. Another issue is the synonym and homonym problem. Keyword only approaches do not always consider these alternatives. Clearly, what is needed is a more structured approach to keyword modelling to help address some of these concerns. Some authors use a tree-based organization as the basis for user modelling [67]. A tree-based approach could therefore be most beneficial for modelling users and their interests.
In their paper, Tanudjaja and Mui describe a tree-based approach to user modelling that could be adapted for use in the user agent of this section [84]. In brief, the basic idea is to use a tree data structure and then annotate or “colour” the nodes with additional information. This additional information could typically be keywords or key phrases. This adaption of the idea presented is briefly discussed below.
The open directory project (ODP) tree
The open directory project (ODP) is a large and quite comprehensive human-edited web direc-tory. The main purpose of the ODP is to list and categorize web sites and in order to accomplish this, the creators have defined a tree structure to help categorize different websites submitted to the project. The ODP tree structure is defined in resource descriptive framework (RDF) format and is freely available for use. The ODP also has URLs that are associated with nodes in the tree. For the adaption presented here, the URL associations with nodes of the ODP tree is not as important as the actual structure of the tree itself.
Node annotation approach
The ODP tree is a multiple-connected general tree, where each node can have multiple children and parents. The uppermost level nodes of the tree describe general topics (like arts, business, science etc.) and the children describe further divisions within each topic. Each node of the tree can be annotated with information regarding the specific division described by the node.
In the approach presented here, internal nodes in the tree denote topics and leaf nodes denote key-words or key phrases that are associated with the specific topic. Each node has annotations such as a name, unique identifier, the number of times the keyword was used in query formulations or, if the node is internal, the number of visits to the node and a weight denoting its relevance to the topic (between -1 and 1). Note that the relevance weight can be a negative value, which denotes the degree to which the term is not relevant to the topic and should be avoided. This scheme is illustrated in figure 8.2.
The use of a pre-defined taxonomy like the ODP structure has the main benefit that keywords or key phrases can be classified into separate categories leveraging on the inherent semantics present in a taxonomy [84]. Keywords are no longer treated in isolation but as part of a larger structure that could assist in predicting user interest.
As an example of this consider figure 8.2. If a user queries a term that does not have any infor-mation associated with it (the term “industrialization” in the example), the tree can be searched
TOP
Figure 8.2: Example of the ODP-tree approach
for other instances of the term. In our example, the system will find two nodes, one in busi-ness/books/industrializationand one in arts/books/history/industrialization. The system can then determine the parent of the term-node and consider the weight ratings of other terms in the same category. In the example the system will find that the user has indicated in the past that terms re-lating to business books are more satisfactory to him/her than terms rere-lating to art history books.
The system can then consider this fact when results are analyzed for presentation to the user.
Keyword selection Obviously, in the model presented above, it is of key importance to define what kinds of words or phrases adequately describe a topic. In their paper, Pazzani, Muramatsu and Billsus define two classes of words that could be useful for describing a topic, namely dis-criminating wordsand common words [66].
• Discriminating words are words that are useful in distinguishing instances of a given topic, but do not necessarily describe the topic [66].
• Common words are words that are useful for defining a specific topic. These are words that are generally regarded to be terms associated with a specific topic [66].
Discriminating words are typically the terms a user sees as useful for describing a certain topic of interest to him/her. This of course implies that two different users may have different dis-criminating words for describing the same topic. Disdis-criminating words is obviously the class of words of greater interest in the context of user profiling.
Common words, while perhaps not as important for the personalization task, are of great interest as well in the sense that it could be said that they typically represent what a community of users think the defining terms for a specific topic are. Common words can then be used when a user’s profile does not contain sufficient data for query personalization. This idea will be expanded later in this section and in the following chapter.
User profile The user profile maintained for each user is essentially the mapping of user queries to nodes in the ODP tree. The goal of this mapping is the association of a user query with discriminating words for the topic(s) of the query. Using this simple scheme, context may be introduced into the user query through the use of the structure of the ODP-tree taxonomy as well as the keywords/key phrases stored at the leaf nodes.
The ODP-tree taxonomy is quite large, and if a user agent has the task of profiling many users, it would be inefficient to keep and maintain an entire ODP-tree taxonomy for each user.
It would be more efficient to store an individualized ODP tree for each user containing only discriminating words/phrases that are of interest to the user with the structure of the tree obtained from the general ODP taxonomy. Mappings between queries and nodes in the individualized ODP tree is also stored constituting the actual user profile of the user. In this manner scalability of the user profile to each user’s individual interests is improved, as well as keeping the size of the user profile relatively small for practical considerations.
The mappings representing the user’s profile, as well as the other information are stored within a profile database. The user profile database is the topic of discussion in section 8.3.
Relevance feedback The user profiling strategy presented in this subsection relies on relevance feedback from the user to be effective. When results are ultimately returned to the user, some type of mechanism must be in place for the user agent to receive feedback from its user on the
usefulness of the results returned. In the framework of the approach presented here, this means that the agent revises its belief of which keywords or key phrases are associated with which topics and even which keywords are especially not wanted by a user in the context of a given topic. The ability for an agent to also learn what is not wanted by its user gives it the ability to learn from its mistakes and attempt to avoid them in the future. The collection of positive as well as negative feedback is of great interest in the profiling approach presented here. The aspect of user feedback is discussed in more detail at the end of this chapter in section 8.5.