P REGUNTAS DE R ESPUESTA O RAL EN P LENO

PROPOSICIÓN NO DE LEY

5.2 P REGUNTAS DE R ESPUESTA O RAL

5.2.1 P REGUNTAS DE R ESPUESTA O RAL EN P LENO

In this section, we discuss our novel entity augmentation system REA2_{. It implements the} abstract method of consistent top-k set covering for the specific case of Web table-based entity augmentation. This implementation requires the following components:

• Candidate Source Selection: a method for generating the set D of candidate Web Tables for use in the set covering algorithms, given a specific augmentation query QEA. • Source Relevance Estimation: an instantiation of rel : D → [0, 1] for Web tables, i.e.,

a function that scores a Web table with respect to QEA.

• Source Consistency Estimation: an instantiation of sim : D × D → [0, 1] for Web tables, i.e., a function calculates the consistency of two Web tables with respect to QEA. In Figure 2.5 we give an architectural overview of our proposed system fulfilling these require- ments. The system consists of a series of layers providing increasingly higher-level services.

The bottom layer, the Data Source Management system, provides storage and indexing facili- ties for Web tables, enabling the higher layers to retrieve raw Web tables based on keyword matches in data, schema or other metadata. The next layer contains a Schema- and Instance Matching Systemfor generating mappings between the query attributes and entities and those found in the Web tables. We will detail these two systems operation in Section 2.4.1. It also contains a Knowledge Repository for managing and accessing external knowledge sources that are employed in the retrieval and matching process. Currently it offers synonym lookups via Wordnet3_{, Web domain popularity and category information via Alexa Web Services}4_{and term} frequencies extracted from the DWTC5(Eberius et al., 2015a), which we use for tf-idf scoring throughout the matching process.

All these lower level services are utilized in the Candidate Source Selection component, which orchestrates them to create a candidate dataset D given an augmentation query QEA. For this task, it implements the rel and sim functions, which we will detail in Sections 2.4.2 and 2.4.3. Using the candidate set D including relevance and consistency scores, the Entity Augmentationsystem then creates the top-k augmentation result as described in Section 2.3.2. Finally, REA includes a JSON-based REST API, which enables other systems to easily integrate with it and pose entity augmentation queries.

2.4.1 Web tables Retrieval and Matching

REA uses uses both relational Web tables, also called “entity-value” tables, and entity-tables, also called “attribute-value” tables, which focus on one entity only. This allows consistent results for domains where relational tables are available, but increases recall in domains where no or only few such tables are ever published.

The retrieval of candidate Web tables works as follows: we build several indices on the cor- pus of available Web tables, both on attribute names (AI), content cells (CI), as well as metadata, e.g., page title and most frequent terms in the context (MI). For attributes, we optimistically assume that, after basic cleaning, the attribute names are located in the first row for relational tables, or the first column for entity tables. There are more precise ways of analyzing tables to identify attributes and concepts (see Section 2.6), but they are not in focus of this work. We then use a simple but effective approach to identify relational- and entity candidate tables in the corpus: we build not only one attribute query, but one on the first row (ARI) and one on the first column (ACI). For each entity augmentation query Q(a, E) we then run two queries against the corpus: one query ARI(a) ∧ CI(E) to identify relational candidate tables for the input set of entities, and one ACI(a) ∧ ∀e∈EM I(e)to identify entity candidate tables for each entity.

The respective candidate table sets are then processed in two separate pipelines, which we mostly sketch as many parts correspond to state-of-the-art techniques in schema and instance matching. String distance functions and synonym dictionaries are used to determine where the entities and attributes are located in the table, which enables extraction of values for augmen- tation. Specifically, we use a measure which we wall WeightedTokenLevenshtein (WTL) which

3_{http://wordnet.princeton.edu/}

4_{http://aws.amazon.com/awis/}

is similar in style to the one used in (Chaudhuri et al., 2003). The WTL function decomposes strings into token sets, weights the tokens according to their inverse document frequency in our table corpus, and then compares the individual tokens using a regular thresholded Leven- shtein distance. Matching token pairs contribute their IDF weight to the final matching score for the two original strings.

W T L(a, b) = Ptok(a)×tok(b) (t1,t2) tokSim(t1, t2) Ptok(a)∪tok(b) t idf (t) tokSim(t1, t2) = (

idf (t1) +idf (t2) if lev(t1, t2) ≥thsim

0 otherwise

(2.12)

We found that WTL outperforms traditional string distances such as Edit distance or nGram distance, especially for multi-word entity names such as “agricultural bank of china” and “in- dustrial bank of china” . This is due to the IDF weighting, which places emphasizes on tokens that strongly differentiate strings, and ignores generic terms.

In the case of entity tables, the entity itself is often not present in the table at all, so we attempt to locate it in the URL or page title to verify that the table is about the queried entity. While simplistic, this approach has been shown to be reasonably effective in related work (Yin et al., 2011), and proved sufficient for our needs of retrieving data sources for our covering algorithms. Additionally, the matching step also supplies parts of the scoring information through the distance functions’ confidence values, and also through the location of the matches. We detail scoring of candidates in the next section (2.4.2).

The matching step also discards candidates if the attributes or entities do not match closely enough or if no values could be extracted. For example, an index lookup for “Bank of China” might return datasets about the “Industrial and Commercial Bank of China” or the “Agricul- tural Bank of China”, or a column named “Revenue” might be present, but the corresponding content cell might be empty, or not contain a number.

2.4.2 Relevance Scoring

As described in Section 2.3.2, ranked consistent set covering approaches require all data sources to be relevance scored. After the candidates have been processed and filtered as described in the last section, we score them according to the following criteria:

• Quality of the schema-/instance match: The confidences returned by the string dis- tance functions, both for matches between the queries’ attributes and entities, as described in the previous section.

• Quality of the metadata match: We use set distances between top terms in the query and the table’s metadata, such as top terms extracted from page context, from the title and URL.

• Quality of the data source: A perfect match to an untrustworthy source may be less desirable than an almost perfect one to an established source. There are many techniques to measure trust in a Web page, such as the well-known PageRank algorithm, but these are out of scope for the REA system. We approximate source quality using the popularity

scores returned by the Alexa Web information service available through Amazon Web Services6_.

We use a weighted sum of these scores as the final relevance score of each Web table.

2.4.3 Web table Similarity

The second requirement for applying the algorithms discussed in Section 2.3.2 to Web tables is a similarity function that can be used to calculate consistency of covers, and by extension, diversity of a number of covers.

• Attribute Similarity: As mentioned in Section 2.4.1, an entity augmentation query for an attribute “revenue” might return sources with many variants of the attribute, such as “revenue 2010” or “revenue change”. Since we are aiming for covers that are consistent internally but diverse between them it makes sense to compare their extracted attribute names, to group similar attribute names into one cover. Specifically, we re- move the queried attribute name, spelling variations, expanded synonyms, e.g. “sales” for “revenues”, and then compute string distances between leftover tokens of the candidate source to identify similar attribute variants.

• Value Similarity: While different data sources may be similar in the attribute name they specify, they may still mean different aspects of the attribute, e.g. change in the attribute instead of absolute values, or use different units. We therefore also compute similari- ties between the value sets extracted from the data source. Since we aim for supporting analytical queries, we currently only query numeric attributes and can therefore use numeric set similarity measures. In our current implementation we compare the order of magnitude of the value set’s averages as a similarity measure. When comparing string val- ued attribute sets, measures such as average string lengths, length standard deviations, number of words or similar could be used.

• Metadata Similarity: We use the Jaccard distance between the term sets extracted for two candidates as another similarity measure. Our extraction pipeline extracts most frequent terms from the table’s page context, but we also compare terms extracted from page title and URL separately.

• Tag Similarity: Similarly to (Zhang and Chakrabarti, 2013), we use extractors for fre- quent attribute variations, such as years, units of measurement or currencies. These extractors attach tags to candidate Web tables, and we use Jaccard distances between the extracted tags of both candidates as a further measure of similarity.

• Domain Similarity: The intuition here is that tables from similar Web domains are more likely to be consistent with each other than those from different domains. Since Web domain similarity is not in the focus of the thesis, we employ a simple model in which two sources have a domain similarity of 1.0 if they stem from the exact same domain. If not, we compare domain keywords and categories as returned by the same Alexa Web information service we used in Section 2.4.2.

Again, we employ a weighted sum of these factors for the final similarity between two Web tables, which is then used as the similarity function sim(d1, d2)required by the top-k consistent

set covering algorithms introduced in Section 2.3.2.

Having introduced the REA Web table retrieval and matching system, we can now evaluate our top-k consistent set covering algorithms in the next section.

In document Boletín Oficial de la Asamblea de Madrid (página 42-57)