Voice interface through Alexa - Development of a Question Answering System for Legal Urban Info

4.2.1 What is Amazon Alexa?

As mentioned, Amazon Alexa [78] is a set of devices and software that offers the possibility of developing applications, called skills, with which the user can interact mainly through voice. However, unlike standard mobile applications, the user experience has to be focused on solving all queries through voice, so the design of this type of skill must be addressed from a more human approach as if we were talking to another person.

Two elements must be distinguished to understand how the whole system works. The first is the physical device (the hardware), and the second is the developed skill (the software). There is a wide variety of devices where the skill can be executed. First, there are the essential devices called dots. From here, devices add new features and improve the hardware; some even add a touch screen to enrich the voice experience with messages or actions taken by the screen itself. The skill has been designed to run on an Echo Dot device in this use case. This is a standard device within the range of devices offered by Amazon. It has all of the remarkable functionalities needed to run a skill but without having a touch screen to reinforce the user experience. Thus, as mentioned above, the entire user experience will be voice-driven.

Having clear what will be the target device where the communication with the user will be established, it is interesting to start explaining what an Alexa skill is composed of.

During the development of an Alexa skill, it is necessary to differentiate between multiple key elements:

• Utterances: These elements represent the sentence made by the user, that is, what the user wants to transmit in a literal way.

• Intents: These elements represent the intention captured by the skill during the ASR. These elements must be defined so that the skill can detect the user’s intention and act accordingly.

• Slots: These elements correspond to specific terms or phrases within the intent that must be obtained as additional information.

• Invocation Name: This element defines the name with which the device wakes up the skill to be tested.

• Response: This element corresponds to the response delivered by the skill. This response may have been constructed by the system’s backend or may even be declared in the development console of the skill itself.

These are the essential elements within any skill and the ones on which we will focus the explanation. A diagram will be used to visually explain how a skill detects and manages a complete conversation using these essential elements. Figure4.3 shows an example of a conversation.

As shown in figure 4.3, the first interaction the user has to perform is to wake up the skill they want to use. The most frequently used utterance is "Alexa, open invocation name". After that, the skill will start with an initial phrase built in the back-end of the skill itself. This section will be covered later, but all skills can access certain user data, either natively through the echo device configuration (with the necessary permissions)

Figure 4.3: Example of a conversation with Alexa skill and detection of basic elements

or attributes obtained during past conversations and stored in the DynamoDB database.

Regardless of the data origin, this data can use to construct a response, as in the case of the first response offered by the skill, which contains the user’s name. After that, the user indicates his intention to inquire about urban planning information on a particular street.

This highlights several interesting aspects. The first one concerns the intent detected in this second interaction. Within a skill, you can declare as many intents as you want.

Some of them are already defined by the Alexa team itself, and these can be:

• NoIntent: User says no.

• YesIntent: User says yes.

• HelpIntent: User asks for help.

• StopIntent: User wants to stop and exit the skill.

In turn, we can declare customizable intents. Unlike the built ones, in these, it must be defined which utterances said by the user need to be captured by the Alexa ASR to launch that intent. Thus, in the intent called InfoUrbanIntent, it has been possible to define utterances such as: "I want urban information", "Give me urban information", and "I need urban information", among other examples. At the same time, inside an intent, you can define the slots you want to take from the utterances and the additional information to complete a given function. These slots can be obtained in several ways.

For example, in the 4.3, we can see two possible ways to get the slots.

The first one is to obtain the slots’ information directly from the user’s utterances. For this, it is necessary to define the slots you expect to get among the utterances belonging to that intent. In this case, one of the utterances that was defined is: "I would like to know more urban information about street number". These two slots, street and number, have been obtained because they have coincided in the utterance made by the user. It should be noted that all slots, as well as the intents, must be previously defined. It is necessary to declare what kind of information each intent and slot will capture. In this case, number captures a cardinal number, while street captures the address name previously declared in this slot. Specifically, in the street slot, we have defined all addresses present in Madrid territory. Thus, after the declaration of both intentInfoUrbanIntent and slots street and number, it is possible to capture all this information through a single interaction and build a response based on it.

On the other hand, the other way to capture a slot is to declare, within the Alexa console, that this slot should be queried by the skill independently. That is the case in the third interaction, where the skill asks about thequestion the user wants to ask. The user then responds with that question. In this case, the declared slot is the whole question asked by the user. Once this interaction is finished, we have all the information required to construct a concrete answer. So, the skill passes all this information to the system´s back-end where, after applying all the necessary processes, it returns the information required by the user.

This whole process describes a basic interaction of the user with the system. This process is executed and carried out by the system’s front end, i.e. by the Alexa skill itself.

However, the responses generated and the dialogue logic is, for the most part, located in the back-end of the system. Figure4.2shows the services that come into play in just such an interaction. Thus, we proceed to the explanation of the back-end of the system.

4.2.2 System backend

In the previous section, we discussed what could be called the front end of the system, that is, the part of the system that interacts directly with the user. It was explained that this part is an Alexa skill that uses voice to interact with users. In turn, an example of how the skill handles the interactions with the user has been given, i.e., how it can detect the user’s intention and act accordingly.

In this section, we will deal with the back part of the system, which starts by receiving from the skill both the user’s intention and the information obtained by the slots. Next, this information reaches the lambda service, which, as discussed in previous sections, is a serverless service offered by AWS that allows programming and creating services without focusing on the hardware part. This lambda will be the orchestrator of the entire back end of the system. One of the initial steps of this lambda is to call the DynamoDB database to retrieve information about the user and the final state of the last conversation offered.

With this information retrieved from the database and the information obtained from the skill, it is possible to start constructing a response.

Depending on the user’s intention, the steps to follow are different. As mentioned at the beginning of this thesis, this project was conceived as an extension of a previous project where a series of intents were set. In this new project, a new intent has been

added, the urban information search. So, depending on whether the integration belongs to the old set of intents or is the new intent added, step 3. b will be included or not.

In any case, regardless of the detected intent, the first step is to obtain the information related to the requested address. In the example of4.3 and 4.2, we have that the address is Montalbaz street, number 2, so this address is consulted in the database of the city council through an API request to its servers. The information retrieved includes a large amount of urban data of that address: Buildability, protection if it is a property of cultural or patrimonial interest, among other parameters. However, if you want to ask for more information about this address is necessary to consult the new functionality implemented.

This new functionality is included in step 3.b of figure4.2 and stands for all development done in chapter 3. Thus, after obtaining the general urban information of the address asked by the user, it is possible to pass this additional data and the question asked by the user to the NLP pipeline. This section comprises the same elements discussed in chapter 3. However, these developed modules are deployed in the services necessary for their use.

Thus, the interaction would proceed as follows:

• Both the general information and the question asked by the user are sent to the first service, OpenSearch. This service offers the same features offered by Elastic- Search within the AWS ecosystem. After pre-configuration of the parameters of this database (discussed in 3.3), it is possible to perform a query with the question and the chapter to which the consulting domain corresponds (this information is in the database of the city council). After this query, an OpenSearch response is obtained, which will continue the response construction depending on the score obtained.

• After obtaining the item corresponding to the consulted address, this information is sent to the EC2 instance where the Deep Learning model studied in 3.4 is hosted.

In this case, a configurable EC2 instance with the necessary computational require- ments has been chosen since specific software is required for the model to work. Once configured, this instance waits for input and a context (in this case, question and ar- ticle, respectively) and returns the answer to the question asked by the user. Finally, this answer is delivered back to the lambda.

After all these steps, the lambda function can construct a concise answer that is re- turned to the user by the Alexa device. If the user requires more information, the skill can detect this intent, and the lambda can construct a message with all the detailed knowledge of the consulting address. This message will be sent from the lambda through Amazon’s messaging service, SES, to the user’s email.

These steps would be the ones that would make up the complete interaction of a user with the system, from the initial user interaction to the delivery of the entire information through direct messages.

In document Development of a Question Answering System for Legal Urban Information Retrieval (página 53-57)