II. MARCO TEÓRICO
2.6. JUSTIFICACIÓN
The hybrid social recommender system will rank the recommended attendees by their rel- evance to the target user. It is a content-based recommender system that personalizes the information to a user based on users’ interests or relevance. The system uses five separate recommender engines (models) that rank other attendees along five dimensions. I select the similarity measure through the nature of each recommendation model. First, the Publication Similarity is calculated by the text-similarity of their academic publication text. I use co- sine similarity to measure the similarity between two termvectors. It is a common measure for comparing the similarity between two documents in the area of text mining. Second, Topic Similarity is calculated by their research interests. I use Jaccard similarity to measure the similarity between sets of topical words, which are generated by the topic modeling ap- proach. Third, the Co-authorship Similarity is the overlap and distance of the co-authorship network. I measure the similarity through network distance and overlaps. Fourth, the CN3 Interest Similarity is the similarity of their bookmarks in the Conference Navigator system. I use Jaccard similarity to measure the similarity of two sets of bookmarked items. Fifth, the Geographic Distance is representing the distance between the user’s affiliation. I use an ad-hoc approach to measure the geo-distance between two locations.
The recommendation models are discussed as below:
1. Publication Similarity is determined by the degree of publication similarity between two attendees using cosine similarity [89, 138]. The function is defined as:
where t is word vectors for user x and y. I used TF-IDF (Term FrequencyInverse Docu- ment Frequency) to create the vector with a word frequency upper bound of 0.5 and lower bound of 0.01 to eliminate both common and rarely used words. The TF-IDF method is widely used in information retrieval systems and a content-based recommender system. The formula is shown below:
tf (t, d) = f t, d (3.2)
idf (t, D) = log N
|d ∈ D : t ∈ d| (3.3)
where the t represents the word, d represents a certain document and D represents a set of documents. “tf” stands for the frequency of a word in a document. “idf” represents the inverse of the document frequency among the whole corpus of documents. The purpose of tf-idf is to highlight the importance of a certain word in a document. For example, if one word appears in all documents, it may refer to a preposition which with no actual meaning. So I choose a ratio from 0.01 to 0.5 to eliminate both common and rarely used words.
2. Topic Similarity is a metric that measures the Distance between topic distributions [31]. The approach assumes that a mixture of topics is used to generate a string (document), where each topic is a distribution of topical words. In my dissertation, the topics were generated by topic modeling, Latent Dirichlet Allocation (LDA), by classifying their publication text [31]. A higher topic similarity means a shorter distance between the two scholars’ research interests, i.e., the two scholars shared more common research topics. 3. Co-authorship Similarity approximates the social Similarity between the target and
recommended users by combining co-authorship network distance and common neighbor similarity from published data. In pre-study, I adopted the depth-first search (DFS) method to calculate the shortest path p [121] and common neighborhood (CN) [95] for the number n of coauthors overlapping in two degrees for user x and y.
Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. I plan to formalize the co-authorship as a graph. The shortest distance will determine the DFS from the original user to the target user. It is a method to measure how close the two scholars link to each other. The formula is shown as: Let G = (V, E) be a graph with n vertices of V. For α = (v1, ..., vm) be a list of distinct elements of V ,
for v ∈ V (v1, ..., vm), let vα(v) be the greatest i such that vi is a neighbor of v, if such i
exists or be 0 otherwise.
The common neighborhood (CN) [95] indicates the intersection set of neighbors of a given author. Here I define the set of neighbors as all co-authors observed at t. The formula is shown as: Let G = (V, E) be a graph with n vertices of V. For (v1, ..., vm)
be a list of distinct elements of V . The common neighborhood graph (congraph) of G is a graph with vertex set (v1, ..., vm) in which two vertices are adjacent if they have at
least one common neighbor in the graph G. The formula will return the total number of common neighborhood or 0 otherwise. I consider only the one-degree relationship, which is also possible to extend to more degrees based on the system’s needs.
In study 5-6, I further extend the method to Personalized Hitting Time [85]. The method adopted the theory of random walk, which provides a more sophisticated performance in ranking the recommendations. Assuming given a weighted digraph G, let (xt)t >= 0
be a standard random walk on G. Define the random variable ρj = intt : Xt= j. The
hitting time between two nodes i and j is
SimHittingT ime(i, j) = E (ρj|X0 = i) (3.5)
4. The CN3 Interest Similarity is determined by the number of co-bookmarked papers and co-connected authors within the experimental social system [10]. The function is defined as
SimCN 3(x, y) = (bx) ∩ (by) + (cx) ∩ (cy) (3.6)
where bx, by represent the paper bookmarking of user x and y; cx, cy represents the friend
The Jaccard Coefficient (JC) [21] measures similarity between finite neighbor sets. Here I defined neighbors sets as co-bookmark or co-connection sets at t. For any two given authors, it is the intersection of their co-authors sets divided by the union of their co- authors sets. It is computed as SimJ C = kΓ(x) ∩ Γ(y)k/kΓ(x) ∪ Γ(y)k, where x or y is
the given author and Γ(· ) represents the co-bookmark or co-connection they have.
5. The Geographic Distance is a measure of geographic Distance between attendees. I retrieve longitude and latitude data based on attendees’ affiliation information. I used the Haversine formula to compute the geographic Distance between any pair of attendees [138].
SimDistance(x, y) = Haversine(Geox, Geoy) (3.7)
where Geo are pairs of latitude and longitude coordinates for user x and y, the Geo information is determined by the users’ affiliation data. For instance, for a scholar who comes from the University of Pittsburgh, the latitude and longitude coordinate as (40.440625, −79.995886). I use Google Map API to convert the affiliation information (city, country) to the latitude and longitude format.
The Haversine formula can be used to calculate any two points on a sphere,: gives the Haversine of the central angle between them.
hav(d
r) = hav(ρ2− ρ1) + cos(ρ1)cos(ρ2)hav(λ2− λ1)) (3.8)
where hav is Haversine function stands for hav(θ) = sin2(θ 2) =
1−cos(θ)
2 . d is the distance
between the two points (along a great circle of the sphere), r is the radius of the sphere. ρ1, ρ2 are latitude of point 1 and latitude of point 2, in radians. λ1, λ2 are longitude of