• No se han encontrado resultados

Codificación de reactivos (continuación) Código Reactivo

Crowdsourcing systems (CS) have provided researchers and practitioners with opportuni- ties to solve a wide variety of problems in Web-related contexts. Wikipedia and Linux are among the well-known examples in which a crowd of users explicitly collaborates to build a long lasting artifact that is beneficial to the whole community (Doan et al., 2011). In other examples of crowdsourcing systems, users may implicitly collaborate. For instance, in the ESP game (Von Ahn and Dabbish, 2004), users label images as an implicit effect of playing the game. Another class of crowdsourcing systems, such as Amazon Mechanical Turk (2009), also benefits from users, but the users are coming together for a particular task. The general goal of this group is solving problems, where nothing is long lasting and no community exists. There are various types of crowdsourcing systems that can be classified along many dimensions. The details of these can be found in the work by Doan et al. (2011). In particular, Doan et al. define a crowdsourcing system as follows:

A system is a CS system if it enlists a crowd of humans to help solve a problem defined by the system owners, and if in doing so, it addresses the following four fundamental challenges: How to recruit and retain users? What contributions can users make? How to combine user contributions to solve the target problem? How to evaluate users and their contributions?

The remainder of this section and the next aim at addressing these issues for the crowd- sourcing process used in this thesis, which has the purpose of annotating queries along various dimensions of query intent.

According to Amazon 2, “Mechanical Turk is based on the idea that there are still

tasks that human beings can do much more effectively than computers, such as identifying objects in a photo or video, transcribing audio recordings”, or in the case of this research work manually labeling queries. Amazon calls these tasks HITs (human intelligence tasks). A HIT represents a single, self-contained task that a so-called worker can work on, submit an answer, and collect a reward for completing.

In order to obtain a larger set of labeled queries as the ground truth for the training and evaluation purposes, we selected an additional set of 3000 queries from set A(1) to augment the existing set of 1700 manually labeled queries from the previous section. Starting from an arbitrary point in the sorted impression file (approximately 15 of the length of the file from the beginning), 3000 queries were selected, where the query was contained in the set A(1) and was not among the previously labeled 1700 queries. This approach to selection

assures that the set of 3000 queries is selected from a continuous period of time in set A(1)

(similar to the previous set of 1700 queries). We refer to this set as the MTurk set.

In addition, a set of 1000 queries was randomly selected from the manually labeled queries as a seed set in order to be used to validate the results obtained from Mechanical Turk. Consequently, a total of 4000 queries were obtained to be labeled by Mechanical Turk and to eventually be used for training and evaluation purposes.

Figure 3.2: The labeling process through the Amazon Mechanical Turk.

The entire set of selected queries (i.e. 4000 queries) was then divided into 40 batches of 100 queries, with each batch containing 25 seed queries and 75 MTurk queries. The labeling process is depicted in Figure 3.2. These batches were submitted to Mechanical Turk, each as a single HIT, in order to be labeled according to the instructions that were provided for the annotators. The annotators were asked to judge the presumed intent of

the search queries from the perspective of a general user as follows:

If the presumed purpose of submitting a query is to make an immediate or future purchase of a product or service, the query is labeled as “commercial”. Otherwise, it is labeled as “noncommercial”. If the presumed purpose of a query is to locate a specific Website, the query is labeled as “navigational”. Everything else is considered “informational”.

For each batch labeled and submitted by an annotator, the labels assigned to the seed queries of the batch were compared against the actual labels of those queries (previously determined by the three local annotators). If the agreement of the annotator with the local annotators was found to be above 60%, the labels assigned by this annotator were accepted. Otherwise, the labels were ignored and the same batch of queries was submitted for an extra round of labeling. If the agreement was found to be above 75%, a bonus was awarded to the annotator.

This process was continued until all batches were successfully labeled by five different annotators. The final label of each query has been assigned based on the majority of the labels obtained for the query. At the end, 42% of queries were labeled as commercial and 58% were labeled as noncommercial, while 55% of queries were labeled as navigational and 45% were labeled as informational.

A similar process was repeated in order to obtain labeled queries for the specific sub- categories of the commercial intent. A set of 510 manually labeled commercial queries was considered as the seed set, while a set of 1500 queries, which were all labeled as commercial queries from the previous process, was considered as the MTurk set. A total of 15 batches (each containing 134 queries) were created with each batch containing 34 seed queries and 100 MTurk queries. These batches were submitted to Mechanical Turk, each as a single HIT, in three rounds. Each round corresponded to one of the three sub-categories of commercial intent: product, brand, and retailer. The annotators were asked to judge the presumed intent of the search queries from the perspective of a general user as follows:

If the query is related to a specific product, it is labeled as a “specific product”, otherwise as a “broad category of products”. If the query is related to a specific

retailer, it is labeled as a “specific retailer”, otherwise as an “unknown retailer”. If the query is related to a specific brand, it is labeled as a “specific brand”, otherwise as an “unknown brand”.

For instance, the query “Walmart” is considered to represent a broad category of products, with unknown brand, but retailer specific. The query “used car” is considered to be product specific, with unknown brand and unknown retailer. The same strategy for accepting or ignoring the HITs was used, and the final label of each query was assigned based on the majority of the labels obtained for the query in each category.

Documento similar