3.2.3 Síntesis cualitativa: Esquemas DAFO
DAFO Nº 3 – CARLOS
3.3. ANÁLISIS CUANTITATIVO: INDICADORES (II, V)
There were a total of 500 worker-generated facts distributed across the 10 dialogs, with an average of 5 facts per worker (σ = 2.87). We manually annotated each worker fact as either being a true positive (match a ground truth fact) or a false positive (do not match any ground truth facts). When averaged across all conversations, individual workers’ pre- cision and recall are 44% and 42%, respectively. We also see that individual precision and recall values are similarly consistent such that no one topic outperformed all others, as seen in Figure 3.3.
3.6.1
Characterizing Worker Errors
Of the 500 worker-generated facts, 214 were classified as true positives (42.8%), and the remaining 286 (57.2 %) were considered as false positives. In order to present a systemat- ical analysis of workers’ errors, we manually categorized each of the false positive facts into six categories depending on the error types (See Table 3.2).
Figure 3.3: Precision and Recall for Individual dialogs. We can see that worker perfor- mance is relatively consistent across all the different dialogs.
A: Completely wrong
13 facts, or 4.6% of all the 286 false positives, were classified as being completely wrong when looking at the worker-generated summaries. These facts can be gleaned from pieces in the dialog, but no useful summarization took place. An example of a Category A fact: “very fact to the myself.”
B: Missing context
58 facts, or 20.3%, were missing a key piece of context for the fact. For instance, the fact “Green Bay, Wisconsin” is missing the critical piece of “located in” or “lives in”; the facts are not relevant without these critical pieces. Indeed, if the key pieces of information had been included in the worker summaries, these Category B facts would all be considered true positives.
C: Statement about the world
18 facts, or 6.3%, were statements that were made about the world at large, rather than specifically about the requester. Example of a Category C fact: “tanks are like a glass or plastic cages that have no holes in them, you can keep fish in them.”
Category Description Makeup
A Completely wrong 13 (4.6%)
B Missing context 58 (20.3%)
C Statement about the world 18 (6.3%)
D About requester, but short-term only 121 (42.3%) E About requester, but not in ground truth 45 (15.7%)
F Presupposed information 31 (10.8%)
Table 3.2: Summary stats for all 10 dialogs. There are two scenarios per dialog topic. Length = number of lines in each dialog; WF = the average number of worker facts for that dialog; and GTF = the number of ground truth facts.
D: Statement about the requester, but short-term relevance only
121 facts, or 42.3% of all false positives, were statements about (or involved) the requester, but are classified as only being relevant in the short term. These facts include any specific preferences the requester expressed, as well as general information from the requester’s conversation. An example includes: “Spilt his coffee over laptop, the laptop is fried.”
E: Statement about the requester, but not picked up in ground truth
45 facts, or 15.7%, of the facts were statements about the requester, but neither author saved this information as part of the ground truth. Though we consider these facts as being wrong for our evaluation, we nevertheless recognize that they are important. In fact, these are facts that workers recognized as being relevant for the future, even if the authors did not, so these could be viewed as belonging to the true positive pile as well. An example: “Gives 20% tips for good service.”
F: Presupposed information
31 facts, or 10.8%, were summaries that contain presupposed information necessary for the facts in the ground truth. We defined “presupposed information” as background or expositional information about the requester related to the current discussed topic, but not helpful for future conversation. For example, these include: “lives in an apartment.”
3.6.2
Case Study: Focusing Workers on Specific Time Frames
We see that worker responses contain numerous false positives, but not all of them are completely wrong. Some of the categories were marked as irrelevant because they be- longed to a shorter time frame. However, what happens to worker performance if we have a time frame already in mind for relevance?
We investigate whether focusing workers on specific time frames through guided in- structions can improve performance and choose a time frame of six months (average of the two long term conditions of one month and one year). For this experiment, workers use Mnemo’s interface to create facts, but they are not shown the TimeSelect view any- more; rather, they are explicitly instructed at the beginning of the task to save facts that will still be relevant six months or longer. We randomly selected Topic 1 and reran data collection for the two scenarios, again with 10 workers each.
Table 3.3 shows the results from this experiment: we can see that worker performance improves when they are given a focused time frame in the task query, as they capture more true positives with fewer overall facts created. When compared with the facts for Topic 1 in the “TimeSelect” condition, we can see that a lot of worker error categories (A, B, and F) have disappeared in the “FocusedSelect” condition.