CAPÍTULO II 2 MARCO TEÓRICO
2.8 LA MEZCLA DE MARKETING
2.8.1 ELEMENTOS DE LA MEZCLA DE MARKETING
2.8.1.3 DISTRIBUCIÓN O PLAZA
Based on the ranking functions discussed above, we now describe the data set that we use, and the system architecture for extracting the data and processing queries on the resulting implicit network of entities. For an overview of the system architecture, see Figure4.1.
Data pre-processing
The extraction of an implicit network is possible from any document collection in which some type of entities can be identified. Here, in order to demonstrate the feasibility of the approach for large document collections and to provide comprehensive query choices, we again use the unstructured text of all English Wikipedia articles, from the dump of May 1, 2016. In contrast to the previous network in Chapter3.4, we do not use automated named entity recognition but instead follow Wikipedia links to their Wikipedia pages, from which we can extract a Wikidata identifier. This largely avoids the imprecision that is inherent to
4 Applications of Implicit Networks
named entity recognition, since Wikipedia links are essentially manual annotations by the Wikipedia editors. One possible problem with this approach is the policy of only linking the first mention of an entity on a Wikipedia page, which we address by a subsequent string search of already mentioned surface forms of Wikipedia links and their Wikidata labels. We then link discovered entities to Wikidata identifiers, which disambiguates all mentions of any one entity to its common Wikidata entry. Using these identifiers, we then classify the Wikidata entities into the desired classes of location, organization, and actors. For the extraction and normalization of dates, we use HeidelTime [188]. We construct an implicit (LOAD) network from all discovered entities of types location, organization, actor, and date with a maximum window size ofc = 5 sentences. The resulting graph is constructed from 4.5M Wikipedia articles with 43.6M sentences that have at least one entity, containing 2.0M named entities, 5.2M distinct terms, and 1.3B edges. In addition to the implicit network structure between the entities, terms, sentences, and documents, we now also have Wikidata information for all discovered entities. Specifically, we can use the Wikidata labels as unique entity labels, and retrieve short entity descriptions from the knowledge base to display entity information to the user.
Application layer
As discussed in Chapter3.3, an in-memory representation of the implicit network of the entire English Wikipedia is possible and allows extremely fast queries in the order of mil- liseconds. However, due to the high memory requirements (around 200GB for full ef- ficiency), this is infeasible for a long-running, non-commercial application. Conversely, query speeds in the order of few milliseconds are not necessary for an interactive user experience, where speeds of a few hundred milliseconds are sufficient. Therefore, we rely on a fully external storage architecture by storing the data in a MongoDB, with separate collections for entities, terms, sentences, documents, and edges. This architecture then runs on desktop-level hardware such as the Core i7 with 32GB main memory and an SSD drive that we use to run the Web demonstration. Since entities are enriched with Wikidata information to obtain entity descriptions and canonical labels, a text index on the English canonical label can be used for searching entities in the database by their label, and for compiling a list of entity suggestions to the user, based on input strings. An alternative solution is the use of prefix tries [94],which is faster than an index. However, it would have provided little improvement since entity retrieval by label is not a bottleneck even for millions of entities, and would have reduced the portability of the implementation. We thus rank entity suggestions by the text match score and break ties by the number of
4.2 Interactive Entity and Event Exploration
connected sentences in the network (that is, by frequency). Graph edges are stored with precomputed directed importance weights ~ω. For edges that involve sentences, a collapsed storage format that contains both the sentence and the respective document of the entity or term allows faster retrieval speed at the cost of storing one additional integer per sen- tence edge. The query processing routines described above are implemented in Java, and enable query processing speeds in the order of a few seconds or less for all but the exper- imental subgraph queries. While individual queries are easy to parallelize by input entity for queries containing multiple query entities, we do not include a parallel implementation for single queries. Instead, to allow the application to serve queries from multiple users simultaneously, query processing allocates one thread per query per user. To avoid system overload in the case of multiple users that spam queries, an internal mapping of queries to browser fingerprints allows us to limit the number of active queries per user.
Presentation layer
The web interface of EVELIN is implemented via HTML and JavaScript, and serves two primary purposes: classifying input terms and entities according to their entity type, and visualizing the output of ranked entity lists and subgraphs. For handling entity input and sending queries to the application layer, we use jQuery and pass entity information in both directions as JSON objects. The Bootstrap library[33]and Mustache web templates[210]
are used for the responsive layout and for displaying data tables. To recognize, classify and color input entities as they are entered, we use the tags-input and typeahead libraries of Bootstrap, which are extended to include the required functionality for color-coding nodes according to their (entity) type. The interactive visualization of subgraphs is handled by the D3 JavaScript library[34],which uses a combination of HTML, CSS, and SVG to display data. Graphs are visualized with a force-directed layout. The web server itself is realized on top of the Java Spark micro framework[213],and is directly integrated with the application layer into a single application. The Communication between the user interface and the server is built on AJAX and uses JSON for transmitting entity information in both directions, including input query entities, output entity rankings, and graph data. Due to the Bootstrap library, EVELIN is fully compatible with mobile devices. Examples of the interface for inputting query entities is shown in Figure4.2.