• Lifelog Dataset: This dataset consists of the lifelog of one participant con- taining three media types: (1) the transcribed audio recordings of all con- versations that a participant has had in his daily life, (2) images that were captured at regular time intervals of 30 seconds using the Narrative Clip camera, and (3) the GPS coordinates that a person was located at, when a conversation took place. We describe the statistics of our dataset as follows: The length of the dataset, i.e., the number of days that data of social inter- actions was collected was 60 days, the number of people with whom our participant has had conversations during the data gathering period is 32. The conversations were recorded with full awareness and consent of all the people involved. The statistics of our text data (i.e., transcriptions of audio recordings) are as follows: the number of conversations in total is 375, the number of hours for the entire conversations is virtually 21 hours, and the number of unique tokens in the dataset is 27,476. Each conversation was manually transcribed using an online professional transcription service for a price of 1 USD per minute. The transcribers occasionally marked words that they were not confident with their transcription. Using the markers, we hand corrected those words. The statistics of our image data is as fol- lows: The number of images captured on average per day is 783, and the overall number of images captured during conversations is 2493. The Nar- rative Clip would not take photos if its lens is blocked, hence the variability in the number of captured images.
• Meetings Dataset: Our meetings dataset consists of two media types: (1) the transcribed audio recordings of all conversations of each participant, (2) images that were captured at regular time intervals of 30 seconds (using the Narrative Clip camera) by each participant’s wearable camera. This camera clips on a shirt and captures first-person-view images.
Our participants consist of five groups where each group is comprised of two fixed individuals with rare participation of a third individual. In total, we recorded the data of nine unique participants in our dataset
39 3.5 Evaluation
Group 1 Group 2 Group 3 Group 4 Group 5
Total # of Words 21336 8642 14376 22122 24411
Ave. # of Words (per
meeting) 4267 2160.5 3594 5530 6102
Total # of Unique Words 4240 2225 3043 4152 3029
Ave. Duration of a Meeting
(Seconds) 2337 1037 2511 2539 3079 0 5000 10000 15000 20000 25000 30000 Va lu e
Figure 3.4. Basic statistics of our dataset
ings using an online transcription service1at a cost. The transcription error
(due to human error) according to the service is 1%. The transcriptions of the conversations are time-stamped at fixed time intervals of one minute. Later in this section, we explain how we use the time stamps for synchro- nizing the transcribed text with other signals.
Basic statistics: Figure 3.4 presents some basic statistics of our dataset The report includes per-group statistics, such as total number of words in all four meetings, average number of words per meeting and the number of unique words in all four meetings. Since our dataset is real-world, there are visible differences between statistics and meeting behavior of different groups.
This will enable us to examine the effectiveness of our memory augmenta- tion system in a real-world setting.
Extracting Text Segments: we first extract text segments using the Texttil- ing[65] segmentation algorithm. This algorithm uses word co-occurrence patterns in sentences to detect changes in the topic of a segment.
Texttiling [65] is “a technique for subdividing texts into multi-paragraph units that represent passages or subtopics”. It utilizes patterns of lexical co- occurrence and distribution as discourse cues for identifying major subtopic shifts. We note that the texttiling algorithm cuts segments in documents only at sentence endings. Therefore, one segment would contain one sen-
40 3.5 Evaluation
tence at least. Our purpose behind using texttiling is to split a conversation into topically coherent segments, such that we would be able to assess the similarity of each segment of a conversation to what a participant recalls about that conversation.
Computing memorability: we recorded four meetings per each group over four weeks. Immediately before the start of each meeting we held an interview with each participant, asking them to describe everything they remembered from their previous meeting. Thus one week after each meet- ing, we held what we call a recall session where each participant described everything one could recall while being audio recorded. Then, similarly to the meetings, the recordings of the recall sessions were transcribed. Finally, by computing the Latent Semantic Indexing (LSI) [67] topic simi- larity (after preprocessing steps such as stop words removal, converting all words to lower case, etc.) on all segments of a meeting we created a topic model of that meeting. The number of topics per each conversation was set to 20 in order to be able to compare the results. We note that since the number of topics is kept the same across all meetings and this is merely for similarity comparison between meetings and recall sessions we only used one number of topics. Subsequently, by querying the model with the corresponding recall sessions, we automatically computed how memorable each segment of a meeting was for an involved participant. This was done by comparing every segment of the meeting with the corresponding recall sessions based on the LSI topic model on the segments. Finally, the sim- ilarity between each segment and the corresponding recall sessions were computed based on cosine similarity. Therefore, by computing the seman- tic similarity between each segment and a segment we produce objective labels of how much a participant remembered or forgot.
Finally, we compute the average sum of all similarity scores for each meet- ing to compute one similarity score per each participant and per each meet- ing. Moreover, for each set of four meetings of two participants we use a softmax function to normalize the similarity scores against one another. Softmax function takes a number of input scores and normalizes each of them to a score between 0 and 1, such that the sum of all input score would be 1 in the output. By doing so we compute the similarity scores presented in Tables 3.2 and 3.2. We note that the scores reported in the two tables all represent the average scores of two similar conditions (e.g. two meetings with condition B).
41 3.5 Evaluation