For our investigation of meeting search we assume the scenario of a meeting partic- ipant who wants to find all locations in meetings where the topic of a PowerPoint slide was discussed regardless of whether it was projected at that time or not.
The PowerPoint slides provided with the AMI corpus were used in multiple meetings thus creating di↵erent instances of conversations by di↵erent participants about the same topic. The topic on the projected slide could also have been discussed at other points in this or other meetings while the slide was not actually being projected, leading to other conversations about the same topics possibly by the same people.
Query set
For our investigation, we took a subset of 35 of the PowerPoint slides provided with the AMI corpus as a topic set based on the following criteria: the text of the slide should be reasonably detailed (more than 15 content words), diverse in structure (lists of actions, sentences describing work to be done), diverse in situation of use (beginning of the meeting or closing), have a di↵erent number of possible relevant documents (from uniquely used ones as in a known-item search, to slides that were
used in almost every set of 4 meetings). Figure 4.3 shows an example of the contents of a slide used as search topic. We split this set into 10 and 25 queries for the development and test set respectively, they are all listed in Appendix C.
The search task was to retrieve all segments relevant to the topic being discussed in the slide. It thus represents a recall-focused search task which aims to support meeting participants looking to find all discussed material relevant to each query slide, i.e. missing even one instance of the relevant content is considered as failure to provide the user with the requested information, because any individual relevant segment may be the one that the user is looking for, and finding only some instances of relevant information being mentioned at points in meetings may not be enough to fulfill the task goal if the target information is not among that retrieved. This target relevant material may be taken from discussions by the same participants, or by participants in another discussion examining topically related issues.
Relevance Assessment
In order to carry out our search experiments, corresponding manual relevance as- sessments identifying the relevant content for each slide topic was generated using a pooling procedure. As it would have been impractical and overly time consuming to create the pooling union based on the documents containing the whole meetings and to look for parts of the relevant documents in them, we started with segmentation of the content using varying strategies, and then carried out the retrieval on those ver- sions of test collection. These initial runs that were created to collect the union for the pooling assessment had to be representative to our further experiments, there- fore we used varying segmentation methods that correspond to the approaches that we are interested to explore. In next subsections we describe how we ran various segmentation methods on the collection, carried out the retrieval procedure, and these retrieval results were used within our pooling basis. A special tool was writ- ten in Java that mapped retrieval results from di↵erent runs back to the manual transcript of the initial document and highlighted the portions that needed to be
checked manually for relevance. All transcript words have time stamps, therefore once the relevant segment is defined it can be used for the assessment of any other run with potentially di↵erent segment boundaries, as long as the time information is preserved for the new runs. In the rest of this section we outline the segmenta- tion methods used for creation of initial results that can be further assessed using pooling procedure, previously used for broadcast news SCR results assessment, as mentioned in Section 3.1.
Segmentation The AMI collection as provided already contains manually cre- ated topic segmentations of the transcript. Topics and subtopics form a hierarchical structure, where labels have been assigned by annotators choosing tags from a list of suggestions. This topic segmentation was made based on the manual transcripts, but does not cover all of the meetings in the dataset. These segments are provided for only for a subset of 139 out of the total of 173 meetings. Since our goal is to investigate the impact of the segmentation of spoken material on retrieval results where no manual segmentations are provided, we decided to automatically segment the AMI meeting transcripts ourselves. We segmented the manual transcript us- ing simple time- or length-based methods, and content-based algorithms. For the content-based segmentation we used Choi’s popular C99 algorithm (Choi, 2000) and Hearst’s TextTiling algorithm (Hearst, 1997), Minimum Cut (Malioutov and Barzi- lay, 2006), and the method of Hsueh and Moore (Hsueh and Moore, 2006). All these methods work on the level of sentences. However current ASR systems do not by default provide punctuation in their transcript output. Thus we needed to use pseudo-sentences of reasonable length for this collection. The length value was calculated as the average length of the sentences in the manual transcript.
Retrieval The segments obtained using each segmentation technique from the manual transcripts were indexed for search using a version of SMART informa- tion retrieval system5 extended to use language modelling (a multinomial model
with Jelinek-Mercer smoothing) with a uniform document6 prior probability (Hiem-
stra, 2001), as introduced in equation 2.3 in Section 2.1.2. The retrieval model used i = 0.3 for all qi, the value being optimized on the TREC-8 ad-hoc retrieval
dataset. Stopwords were removed using the standard SMART stopword list, and the remaining content words stemmed using a variant of the Lovins stemmer (Lovins, 1968) which is packaged in SMART by default.
Pooling procedure To carry out the relevance assessments, the following pooling procedure was adopted.
1. Retrieval runs were carried out for each topic using segments created using the di↵erent segmentation schemes.
2. The top 50 retrieved results for each run were collected and compiled into a pool for each of the topics. (We chose the number of top ranked documents to be assessed empirically, assuming that it is reasonable to expect the user in a real case scenario to try to browse through this number of retrieved docu- ments.)
3. An interactive application was developed which highlighted the union of the retrieved segments in the original documents they belong to, i.e. if one meet- ing had several di↵erent segments in the pool, the whole area between the beginning of the first of the segments and the end of the last of the segments was highlighted for assessment.
4. Relevant regions between the beginning of the first segment of each segment group and the end of the last one were marked manually by an assessor. 5. After defining the relevant region for the manual transcript for each topic, this
information was projected onto each segment unit based on time correspon- dence in order to create individual relevance files for each of the segmentation techniques. Thus for each segment in each segment set, we know the beginning and end points of all assessed relevant content.
Figure 4.4: Number of terms with collection frequency equal to 1-10.
While this pooling procedure only uses results generated using one retrieval model, the diversity of the segmentation schemes used for the content means that we get a wide variety of content originating in segments generated using di↵erent methods. Additionally since the relevance labelling tool requires the assessor to ex- amine all data from the beginning point of the first retrieved content for a meeting to the end of the last one, the assessor actually looks at content not retrieved by any schemes giving a greater coverage of the relevance assessment than would otherwise be the case.