Information retrieval systems

In the previous section, we obtained a dataset containing all the information present in a legal document ideally classified in such a way that each row corresponded to each of the articles present in the document, while the columns of the dataset represented relevant information characterising each article (article title, section, chapter, text...). Thus, with a

differentiated dataset, we can move on to the next step in creating a complete information retrieval system. This step, which will be dealt with in this section, will be based on the search and study of an information retrieval system capable of identifying which dataset rows are most similar to a given sentence. Specifically, our objective will be the creation of an advanced search system capable of matching a given question asked by the user with one of these rows with sufficient similarity to assume that the information presented in this article is the one that answers the question asked by our user.

As we analysed it in section 2.2, one of the best solutions that fit the best in this project was the one called ElasticSearch. This is so because several reasons:

• Used as the leading search engine, among other algorithms, the so-called TF-IDF, which was one of the main algorithms we wanted to study and apply in this project.

In turn, as we have seen in the previous section, this algorithm allows the use of the created dataset more efficiently than other types of options.

• It is a dynamic database focused on the indexed search of its rows from a query performed. This makes it especially attractive for our use case since it adapts perfectly to our needs.

• It is a system widely adopted by the multiple cloud services that are available today.

In addition, the provider chosen for this project, AWS, natively offers a service that integrates this search engine. This greatly facilitates its deployment and execution within the entire system.

Therefore, as mentioned before, ElasticSearch is the option that best suits our use case and the one that will be applied to create the information retrieval module on which the final solution is based.

ElasticSearch [45] is a distributed search and analytics engine that enables near real- time search of all types of data. This data can be structured, unstructured, numerical or even geospatial. Due to its wide range of performance, it is used in multiple use cases, including search boxes for websites or applications, data analysis in real-time or data storage produced by workflows in automated systems, among many other cases.

Many consider ElasticSearch (ES) as a dynamic database that applies text similarity algorithms to retrieve entries quickly. However, ES stores information as complex serialised data structures in JSON format. So, when you store a document within ES, this document is indexed to the data structure set up from the beginning, allowing you to access the information much more efficiently than in a standard database. The structure of each index is established at the outset when setting the ES parameters.

Each index can therefore be thought of as an optimised collection of documents and each document as a collection of fields in key-value format. This type of serialised structure is very beneficial for the use case of this project, as the initial legal document was disaggregated into a dataset that can be perfectly serialised in a key-value format. Fur- thermore, this structure allows for differentiation between algorithms applied for each field type. Thus, textual fields are stored following an inverted index structure, while numeric and geo fields are stored in BKD trees.

Another great benefit of this system is that it has an SDK developed for python to

interact with the system. This SDK will be used to perform queries to the database, as well as to update the ES indexes with new documents.

Queries made to ElasticSearch are made through the REST API enabled within the search engine itself, which allows complex, structured queries or a combination of several fields in one to be handled. This point will be addressed later, and we will study the best combination of queries for the specific use case.

We will now explain in detail how the ES search engine works, what kind of data structure it allows to define in each index, and the specific parameters it allows to include.

3.3.1 Indexes and search engine

As mentioned above, ElasticSearch is composed of indexes, which offer a serialised and user-customisable data structure where documents will be stored in the different fields.

This structure will be used by the search engine established during the service creation to compare the queries made with the stored index fields. So, ES workflow is standing for figure3.4.

Figure 3.4: ElasticSearch WorkFlow

So, to start using the ES service, it is necessary first to define the index used for the documents. It should be noted that in this case, each document in the index will be each of the items previously differentiated by the processing of the legal document. In turn, each of the fields contained in the dataset where the articles are found will be the fields previously defined in the data structure of the index.

The number of settings this service allows to set up goes far beyond this project’s scope. However, we can find a list of all available settings for an index in [46]. At this stage, a list of the parameters to be set for this use case can be displayed:

• Analysis: Settings to define analyzers, tokenizers, token filters and character filters. ES uses this setting to process unstructured text into structured text, applying text cleaning techniques such as stemming and stopwords removal, among other parameters. This processing will be applied when indexing new text or searching text fields.

• Mapping: Enable or disable dynamic mapping for an index. This setting is used to define how a document, and its constituent fields, are indexed and stored in a given index.

• Similarities: Configure custom similarity settings to customize how search results are scored. This setting represents the search engine used by ES to compare the queries made with the fields of the indexed document.

Therefore, to obtain an index that worked correctly, the different settings of the ES service were modified to get the most optimal combination for this project. We should

note that these settings work correctly for this particular use case, i.e. analysis of legal documents. The language used in this case is very different from that used in other use cases (e.g. analysis of customer comments in e-commerce or analysis of tweets on a trending topic), so these settings should be studied for this specific use case.

Analyzers

Therefore, to analyse which parameters are best suited for this use case, we will explore the settings presented above. We will start with the parsing settings, text cleaning, and preprocessing techniques.

ES performs what is called a full-text search. This fact means that it shows the most relevant results, those that are most similar to the query, rather than those that exactly match the words and structure of the query. So, for a query like: Where can I find information about tall facades in urban environments? Of course, we can expect to return a document containing the terms "urban environments", but it would also be interesting to obtain the same action documents containing words such as "urban information" or

"urban facades". To obtain documents that present similar information in a more precise way, ES applies a parser. This parser is used both for the query performed and for the fields of the indexed documents. This parser is mainly composed of three blocks:

• Character filters: A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.

In this case, this character filter is performed outside ES, so we will not use any character filter inside the ES service. Outside this service, character filtering will be performed using the regular expression library, explained in the previous section, eliminating special characters, double spacing and punctuation, and replacing accented vowels with normal vowels.

• Tokenizers: A tokeniser receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. The tokeniser we will use for this case is the default one that separates character streams by whitespace.

Thus, a sentence like "Información urbanistica en Madrid" will be separated into ["información," urbanisitica", "Madrid"].

• Token filters: A token filter receives the token stream and may add, remove, or change tokens. This is the most used field in this project. We applied a total of 5 filters. These filters are lowercase (converts uppercase to lowercase), stopwords (removes very repetitive words from the text), stemmer (gets the roots of each word), ASCII folding filter (converts alphanumeric characters and symbols that are not in Basic Latin Unicode to the ASCII equivalent) and finally synonyms (converts specific terms to other standard terms).

After applying these terms, we obtain a list of specific tokens that will be stored in their respective fields. It should be noted that this treatment will only be applied to text fields.

Mappers

As mentioned above, the mapping process within the ElasticSearch ecosystem is the process of defining how a document and its respective fields are stored and indexed within the dynamic database.

This parameter, within all ES settings, is therefore in charge of defining the type of data stored in each field that constitutes each document. This mapping can be of two forms: explicit, where the fields are determined from the creation of the dynamic database and have a fixed data type, and dynamic, where new fields can be added automatically by indexing them in the document. The latter kind of mapping is used when you want to experiment and explore the data that can be stored within ES. However, the best use case is explicit mapping for more control over what type of data is stored and in what form. Therefore, the explicit mapping will be used for this use case.

Then, once the type of mapping is defined, the fields to be stored in this document will be the same as those mentioned in section3.2.3. The data stored in these fields will be of three different types: text, integer and keyword. The first two are the basic types assigned for textual and numeric data. The third, the so-called keyword, is the most particular data type. This data type has been given to the field "section" and "condition" and is used to name structured contents that can identify the element stored in the database. These are identifying fields with few elements among their options, so they are clear candidates for this data type.

After adjusting the mapping for this index, we move on to the most critical point within the index parameters, the similarity engine.

Similarities

This last parameter within the ES index configuration is the one that defines which search engine will be used when comparing the queries made with the fields of the stored documents.

As discussed in section 2.2, the default search engine used by ES is based on the TF-IDF word embedding technique. Specifically, the default search engine is BM25 [47].

However, ES offers a variety of options to choose from as search engines. They are based to a greater or lesser extent on information retrieval techniques. In turn, they all offer algorithm customisation parameters (e.g. normalisation parameters). Therefore, to choose the search engine best suited to this specific case, it was decided to perform a performance test, creating an ES database for each available search engine and filling it with the articles obtained in3.2.3. After that, a battery of simple queries was performed in each database to see how efficient the result of each one was. At the end of the test, the most efficient search engine was based on the divergence from the randomness algorithm [48]. Thus, the best search engine for this use case was established after some readjustment of algorithm customisation parameters.

3.3.2 Queries ElasticSearch

Once you have established the best configuration for the index that will store the document, you can address which type of query will be used for this particular case. Like the indexes, the queries used to obtain the indexes can also be customised [49]. Specifically, two of all the parameters available in the query have been used.

The first one is the filter itself applied to the query. This parameter filters the number of documents to be queried by setting specific document fields. As mentioned above, the use case in question is limited to title 8 of the legal document. However, this legal document is structured so that it is possible to categorise by chapters each residence to which this rule applies. Thus, a single zoning rule applies to a home located on a particular street, classified by titles and chapters. In turn, using one of the technological systems offered by the Madrid City Council, called Callejero, it is possible to obtain to which title and chapter each address corresponds. Therefore, by making an API call to this system with the address of the place of residence you want to obtain information, it is possible to get which title and chapter correspond to that address. With this information, it is possible to filter the articles to which the query corresponds, obtaining a higher performance of the ES system.

The second one is related to the number of fields used to obtain the comparisons and the relations between these fields. ES offers a wide variety of options regarding which document fields to use for the comparison. Therefore, given that the queries would be textual, it was decided that the fields to be studied and which would provide the most information would be the title of the article and the body of the article.

In addition, of all the options offered by ES for queries, those based on text comparisons have been used. Within these types of queries, ES offers a wide variety [50], but three categories have been studied:

• Match query: Returns documents that match a provided text value. The provided text is compared with a specific field.

• Multi-match query: Returns documents that match a provided text value. The provided text is compared with multiple fields. At the same time, this option offers multiple configuration options:

– Best fields: Finds documents which match any field but use the score value from the best field to determine which document is returned.

– Cross fields: Treats fields with the same analyser as though they were one big field. In other words, provided text must be present in at least one field for a document to match.

– Most fields: Finds documents which match any field and combines the score value from each field to determine which document is returned.

• Simple query string: Returns documents based on a provided query string, using a parser with a limited but fault-tolerant syntax.

It should be noted that the same parser established during the creation of the index is applied to each query, obtaining the exact processed text stored in each field of each of the two elements.

As in the previous section, all the options presented were tested using a battery of questions. Likewise, the same index configuration during the last section has been used.

Thus, and applying this benchmark, the results are shown in figure 3.5.

Figure 3.5: Queries ES

As can be seen, each graph consists of a two-bar histogram where the first bar indicates the number of errors and the second bar shows the number of correct answers. A hit is indicated as those questions asked whose returned items correspond to the requested information. Otherwise, it is stated as wrong. Thus, it can be seen that among the 15 questions asked, three configurations obtained a 100% success rate. These are the multi-matching configuration with best fields, most fields, and single query string. Given that these three configurations show similar results, the one we have decided to use for the system is the best fields configuration since it is the default configuration for using multiple fields and, therefore, the best optimised for the ES search engine.

Finally, it should be noted that each result obtained from the comparison has a score value that indicates the level of similarity between the returned document and the query performed. If we get these scores from the set of queries performed and plot them, we obtain the results shown in figure3.6. As we can see, there are questions where the score is below 4 points despite giving a correct result. In later sections, we will look at how to deal with such low scores and what strategies to follow to avoid erroneous results. For the time being, by setting a numerical threshold, we could automatically classify the answers obtained as wrong or correct.

Figure 3.6: Score obtained from queries

As can be seen, ES is an excellent tool for the indexing of large amounts of documents and the creation of complex information retrieval and question answering systems.

Throughout this section, we have tested multiple configurations for indexing documents and creating efficient queries optimally suited for our use case. Finally, we have obtained a system capable of obtaining from a large document a minimum portion of information where the answer to the question asked by the user can be found, with very high efficiency.

Now the problem lies in how to obtain a concrete answer from that portion of information.

This problem will be addressed in the following section.

In document Development of a Question Answering System for Legal Urban Information Retrieval (página 36-43)

Information retrieval systems | ElasticSearch