Question Answering systems - Development of a Question Answering System for Legal Urban Informa

• Multilingual XLM-Roberta: This model [55] is based on the improved version of the BERT model called RoBERTa. In turn, it is trained using the SQUAD dataset [13]. This model allows Question Answering tasks to be carried out in different languages, including Spanish.

• Multilingual BERT fine-tuned on XQUAD dataset: This model [56] is based on the official version of BERT. However, this model is fine-tuned using XQUAD dataset [57], specifically designed for multilingual Question Answering tasks.

• Multilingual BERT fine-tuned on SQUADv1.1: This model [58], among all the models listed, is the closest in terms of performance to the official BERT model released by Google. It is a multilingual model, so it works perfectly well in Spanish.

• Multilingual BERT fine-tuned on Tydi QA dataset: This latest model [59] is based on the previously mentioned version of the BERT model fine-tuned in XQUAD.

However, this model is fine-tuned using the Tidi QA dataset [60], a question-answering dataset covering 11 typologically diverse languages with 204K question-answer pairs.

It is worth noting the use of the term fine-tuning in this context. When it is said that a model will be fine-tuned, it is meant that this model, already trained by a previously constructed dataset (e.g. SQUAD dataset), will be retrained with an additional dataset created by the user who wants to perform this action. Thus, after being retrained by a dataset customised by the user, the model will be able to adapt much better to the use case addressed by the customised dataset.

Once the models to be tested have been chosen to create a question-answering system that fits the project’s requirements, a test environment can be created to test the performance and efficiency of each model. First, it is necessary to understand how these models behave. To use them correctly, they need to understand three basic elements:

• context: This is the information from which you will extract the answer to the question asked. In this case, it will be the article itself obtained from ES.

• input: This is the input necessary to obtain an output from the model. In this case, it is the question asked by the user.

• Output: This is the outcome obtained from the model once it has processed the given input and context. In this case, it is the possible answer to the user question.

Thus, a standard behaviour of these models is shown in figure 3.7.

As the QA system to be designed expects to analyse an article obtained through the ES database and get an answer from it, it is coherent to create a dataset that offers as context value this type of article and as input value questions related to it. Thus, a test set has been developed, including 15 questions related to 11 different articles of the legal document. The test set can be found at 5.2. So, using this set and testing it in each of the models described above, we obtain the following results table 3.1.

If we look at the table 3.1, we can see that there are three parameters with which the performance of each model has been measured. The first one is the charging time of each model. This parameter indicates the time the machine needs to load the model

Figure 3.7: Model Behaviour

Model Upload time(s) Avg score Number of correct answer

Spanish DistBERT 6.681 0.1659 2/15

Multilingual XLM-Roberta 8.890 0.3119 10/15

Multilingual BERT fine-tuned on XQUAD dataset 21.933 0.3982 6/15 Multilingual BERT fine-tuned on SQUADv1.1 9.398 0.5068 10/15 Multilingual BERT finetuned on Tydi QA dataset 9.0957 0.2473 6/15

Table 3.1: BERT models performance used for QA purpose

and start running it. The second parameter is the average score obtained for each of the 15 questions asked to each model. The higher the value, the more likely it is that the answer obtained is correct. Finally, the parameter of correct answers indicates how many of the responses received efficiently answer the question asked. Thus, analysing the results obtained from this test bench, we can draw several conclusions:

• Regarding the loading time, we can observe that the models based on the stylised version of BERT take slightly less time to load (a few seconds less) than their original counterparts. However, this is not the case for the improved and enhanced version of the BERT model, called RoBERTa, which has a much longer loading time than the average (around 20 seconds). This fact practically rules out this model as a possible option since such a long waiting time is not acceptable for users who enquire through either of the two available communication channels.

• As for the average score obtained by each model, we can see that all models have similar performance, with none exceeding one unit of the score. The best performers concerning this parameter are the RoBERTa model and the original BERT model, with scores of 0.3119 and 0.5060, respectively.

• Regarding the correct answers obtained by each model, it can be observed that there are models that perform poorly for this use case, as is the example of DistBERT or the multilingual BERT models retrained with the Tydi QA and XQUAD datasets.

On the other hand, there are two models whose results are more than acceptable, with ten correct questions out of 15. These two are, again, the RoBERTa model and the original BERT model.

• Another point that can be drawn from this test is that the QA models tested work

well with direct questions where the information is found directly in the text. An example of such a question could be: What is the height of the ledge? On the other hand, these models perform poorly with more open questions where a more in-depth text analysis is needed to extract a correct answer. An example of this question is:

Can I build in the setback space? As can be seen, this question requires the model to analyse the text and the conditions of the property being asked about to generate a concise yes or no answer.

As can be seen, after this analysis, the model that best fits the specific use case in terms of performance and execution time is the original BERT model trained with the SQUAD dataset made by the Google team. However, at the same time, this model and the others present specific difficulties when answering questions that need to elaborate a more complex answer instead of finding the answer in the text. Therefore, to solve this setback, just as it has been decided to create a threshold value for the articles obtained in ES, a threshold value will also be designed to indicate to the system whether the answer is good enough. The answer given will be only the item obtained if it is not.

This chapter has addressed the creation and testing of the three main modules that support the NLP pipeline proposed for this project. The first was the generation of a dataset composed of atomised and easily analysable elements from a legal document. The second of these modules consisted of creating an information retrieval system that used an intelligent search engine based on word embedding techniques to index these atomised elements and perform smart searches. The last module consisted of creating an advanced QA system using the latest techniques in Deep Learning to obtain a concrete answer to a question from these atomised elements. However, these elements alone do not add much value, so in the next chapter, we will see how to connect all these modules and deploy them in the cloud for use by users.

Chapter 4

Cloud deployment

The previous chapter dealt with the complete development of the NLP pipeline to be applied to provide a Question-Answering system to answer questions about urban information at a given address. However, all this development has been done locally, testing each implemented module’s scope and performance. Once the solution’s effectiveness has been tested separately, it is time to put all the modules together and upload them to the cloud to offer it to the end-user through the proposed communication channels.

This chapter will therefore address the deployment of the final solution. It will ex- plain the components needed to support the proposed solution for the information retrieval/question answering system and the different web services required to provide extra functionalities to the user. It will also discuss the creation of the two proposed communication channels for the user to interact with the system and the management platform created to configure and maintain the whole system.

In document Development of a Question Answering System for Legal Urban Information Retrieval (página 43-48)