CAPÍTULO 3. MARCO TEÓRICO
3.1 Administración de la Cadena de Suministro
3.1.6 Integración de la Cadena de Suministro
3.1.6.2 Objetivo
In this section we share technical details and challenges involved in mapping the social media datasets to the five facets we chose based on the PMEST classification scheme. As shown in Figure 2, given any type of social media dataset, we extract the content using the extractors and APIs available in the Content Capture module. This module is also responsible for removing unwanted metadata and links from content pages. We clean the extracted information by removing unnecessary tags from the post-body. The next module Feature Engineering uses a series of steps aimed at Cleaning the content by running word stemmers. Then we remove frequently occurring terms (stop word
elimination) and N-gram extraction as these are well-documented methods used by most document processing systems. We consider each social media page as a document with some uniform characteristics - namely, the presence of a permalink, post id, unique identifier and the post-body.
Fig. 2. Flowgraph of our System
Once the features are ex- tracted, we use the cleaned content to perform Topic Learning, Purpose Identifica- tion, Geo-Tagging and and Profile creation. We believe the techniques used for extracting topics can be used to also learn the purpose with the appropri- ate terms being used to model the learning process.
3.1 Topic Extraction
Topic extraction is a well stud- ied field and several methods have been proposed to extract topics from documents. Since we are not making any as- sumptions about the domain, knowledge of writers, or the in- terests of the users we use a completely unsupervised data driven approach to automatically extract topics from blogs.
Topics Using LDA. In particular, we use the Latent Dirichlet Allocation (LDA) model
[15], that considers topics to be multinomial probability distribution over the vocabulary of all the blogs under consideration. This model has the flexibility that the topics can be learned in a completely unsupervised manner.
If there are D blogs under consideration such that dthblog has N
dwords represented as wd, picked from a vocabulary of size V , and the total number of topics talked by bloggers is K then the generative process is as follows
1. For each topic k ∈ {1 . . . K} choose a topic as a V dimensional multinomial dis- tribution φkover V words. This is from a V dimensional Dirichlet distribution β 2. For each blog d∈ {1 . . . D} having Ndwords
(a) Choose the topics talked about in that blog as a K-dimensional multinomial θd over the K topics. This is from a K-dimensional Dirichlet distribution α (b) For each word position j∈ {1 . . . Nd} in blog d
i. Select the topic zjd∈ {1 . . . K} for the position from the multinomial θd ii. Select the word wjdfor this topic drawn from the multinomial φzjd In the above process φ and θ are unknown parameters of the model. We use Gibbs sampling [16] to find these parameters. Since Dirichlet is the conjugate prior of the multinomial distribution we can collapse the parameters θ and φ and only sample the topics assignments z.
Topics Based on Word Clouds. A word cloud is a visual representation of words,
where the importance of each word is represented by its font size with the most impor- tant words the largest. Here each word is representative of a topic. An entry (such as blog post or tweet) t can be decomposed into its constituent words (unigrams). One way to determine the importance of each word twin the tweet is by computing its TF-IDF score. The importance score of twis
I(tw) = Nw N × log
|T |
|Tw|, (1)
where Nwis the number of times a word w has occurred in T, N is the vocabulary size of the corpus,|T | = M, and |Tw| is the count of the number of entries that contain the word w. The first part of the product is the term frequency score(TF) of the word and the second part is the inverse document frequency score(IDF). Given a desired maximum font size of Fmaxand theI(tw), the font size ftwof word twis
ftw =Fmax× I(tw). (2)
Computation of IDF requires too many queries to the underlying storage system result- ing in high I/O cost. In our system, we are more concerned about identifying frequent words that can help us convey the topic expressed in the entries so we do not compute IDF scores and simply use TF after the removal of stop words.
3.2 Extracting Author Profiles from Social Media
One of the dimensions along which we want to enable the social media search is the profile of the author creating the content. Our interest is in identifying profiles or roles that can span many domains e.g. expert/specialist, generalist, self-blogger, ob- server/commentator, critique, etc. The easiest to identify is whether a person is an expert in a given area or if she has an opinion on everything - a generalist . We believe experts would have less noise or randomness in their posts. Entropy [17] estimation is used in Information Theory to judge the disorder in a system. Shannon’s entropy is defined as follows. Let X be a discrete random variable on a finite setX = {x1, . . . , xn}, with
probability distribution function p(x) = Pr(X = x). The entropy H(X) of X is
H(X) =− x∈X
Table 1. Topic Labels learned from Wiki Categories
Topic Keywords Mid-Sup Max-Sup
1 just, christmas, fun, home Band Album
2 war, being, human, actor Book Actor
3 things, looking, world Supernatural Place
4 rss, search, score, news Single Place
5 users, pretty, little Book Place
6 people, police, year, canada Company Place
7 university, body, quizfarm Bird of Suriname Place
8 free, body, news, pictures Place Place
9 need, home, work, hand Military Conflict Vehicle
10 charitycam, kitty, mood, friends Political Entity Album
In our case, the system of interest is the social media page or document. For a blog or a twitter account, one can assume that as the number of posts or tweets increase, the disorder and hence the entropy would increase, if the author did not devote himself to commenting on a small subset of topics. The set of topics that we extracted as described above is the finite set X that we use. For each blog post or tweet we then find the probability of mapping it to one of the topic x∈ X by counting the number of keywords in the post or tweet which fall under x. Thus, by using entropy we can identify bloggers as specializing in a few concepts or as generalists who discuss varied concepts.
3.3 Mapping Location and Date
The task of detecting and mapping the Location and Date of the social media entry felt straight forward to us. Most social content authoring platforms capture the location and the timestamp at which the content was submitted to the system. We found that often, the content itself might be talking about events that happened at a location that was distant from where the author was posting the content. The time of incident de- scribed could also vary. Capturing these and using them in mapping requires advanced NLP techniques. We are building annotators to extract and use such information. In this paper, we will not go into the details of those methods.