Classification of texts according to their genres can be achieved by extracting a range of higher-level features, such as combinations of POS tags, parse trees or rhetorical relations (Santini, 2007). However, lower-level features based on character n-grams have been found to offer a surprisingly efficient method for detecting genres without requiring heavy linguistic resources (Kanaris and Stamatatos, 2007). In a comparative evaluation, their performance can exceed what is achieved by resource-heavier approaches. For example, pure n-grams can successfully generalise dates (.*day for yesterday, today, Friday), which are typical in reporting, nominalisations (.*tion) or passives (.*ed by), which are typical of scientific discourse (Forsyth and Sharoff, 2014).
The frequencies of character n-grams can be directly used as features in algorithms of Machine Learning. Support Vector Machines (Smola and Schölkopf, 2004) or Relevance Vector Machines (Tipping, 2001) can be used to experiment with classification parameters.
The advantage of RVM is the ability to produce a small number of Support Vectors, leading to better learning generalisation in the case of relatively sparse data – for instance, only 25 positive examples were identified for A9 (legal texts). The task is to predict whether a webpage scores high in each FTD. The commonly used F1 measure is reported in Table 4.2 with cross-validation for detecting the FTDs.
After producing reliable classifiers for each dimension, these classifiers are then applied to the entire corpus of academic webpages, excluding texts that were used in the training set.
Each classifier predicts a score ranging from 0 to 1 for each page in the test set. A crucial part here is to translate a numerical output into meaningful data. In other words, classifiers produce a score that has to be interpreted in order to establish which pages score on each dimension with minimal noise outside the training set. The closer to 1, the more likely that a file scores highly on that dimension, but each dimension may have its own threshold for
4.4 Automatic classification 75
Table 4.1 Excerpts from positive examples for each FTD in the training set.
FTD Positive example
A7 – instruct Enrollment procedure. Before departure Latin American students who wish to apply to the Inclinados hacia América Latina must sub-mit a pre-enrollment application to the Italian diplomatic authority competent for their geographic region, specifying an interest to participate in the project.
A8 – hardnews Deeyah Khan, artist and champion of women’s rights, is awarded the University of Oslo’s Human Rights Award. Deeyah Khan has shed an important light on women’s rights and freedom of speech. As a young Norwegian-Pakistani musician in Norway she experienced being threatened to silence by conservative forces in the Pakistani environment and had to leave Norway at the age of 17.
A9 – legal III. Requirements, assignment and admissions procedure. Art.2.
Those students to which these regulations apply may not be first year students at their home university, nor join the first course that leads to no UAM qualification. Art.3. Students from overseas centres must have at least an intermediate level of Spanish in order to be able to study at UAM, except for those disciplines where the centre to which they wish to be assigned considers basic knowledge sufficient.
A12 – compuff The University of Freiburg is one of only six universities in Ger-many to be distinguished in the “Excellent Teaching” competition organized in October 2009 by the Stifterverband and the Standing Conference of the Ministers of Education and Cultural Affairs.The reasons for the university’s success in this competition are the quality of its existing course offerings and its overall concept for instructional development, “Windows for Higher Education.”
A14 – academ Trends in biodiversity dynamics are studied in a broader context of geological history, with a special emphasize on the Quaternary period. Phenomena such as speciation, hybridization, genome evolution, phenotypic plasticity, developmental processes, and changes of behavioural or morphological traits are investigated us-ing modern - omics techniques in combination with morphological, ecological and behavioural approaches.
A16 – info The Managing Board is an administrative body that primarily decides on economic matters and ensures the smooth material op-erations of the university.The Managing Board has a classification committee and may set up other committees and working bodies as required. The Board consists of nine members.
A21 – narrate Geology was first introduced at Moscow State University at the beginning of the 19th century. In 1804 the Chair of Mineralogy and Rural Home Economics was established at the Department of Physical and Mathematical Sciences. Within the same year the Department of Natural History and the Mineralogy museum were founded.
A7 A8 A9 A12 A14 A16 A21
% in training set 8.4 5.0 3.2 8.5 6.3 13.6 5.5 F-measure 0.95 0.92 0.96 0.85 0.93 0.79 0.94 Table 4.2 Manual annotation of the training set and F-measure.
establishing confidence. For example, the A7 classifier (instruct) assigned a 0.99 value to the text below, which is indeed instructing users on how to apply to university on paper:
Please send an email to [email protected] requesting a copy of the paper application form. Please include a brief explanation of why a paper application form is required; in many cases the Graduate Admissions Office may be able to suggest a more preferable application method.4
However, there is no straightforward way of establishing which specific value differenti-ates between pages that highly represent that dimension and pages that do not include that dimension at all. There are two possibilities to perform this task. One can experiment with thresholds to achieve the desired precision – for instance through a post-hoc evaluation of texts – or fix an arbitrary threshold with no subjective interpretation at all. The former implies a second round of human evaluation, which may lead to a circular process, since automatic classification includes human ratings in the first place. The latter can be accomplished e.g.
by choosing the top -n values as highly representative of any dimension. Although this second approach may not account for natural variation in the distribution of text types over university websites (e.g. it is likely that legal texts account for a very low proportion of pages as compared to instructional texts), it can be viewed as an unbiased way of establishing reliable thresholds. Therefore, the scores produced for each dimension have been divided into deciles where the first decile (1) identifies pages scoring highest on that dimension, whereas the last decile (10) identifies pages scoring lowest on that dimension. Each decile accounts for about 3,000 texts. Functional Text Dimensions (FTDs) and their corresponding values (translated into deciles) were encoded in the corpus as metadata (Figure 4.1).
Section 4.5 provides an exploratory post-hoc evaluation of top-ranked pages in the promotional dimension and a few examples of low-scoring pages.
4Full text available at: http://www.graduate.study.cam.ac.uk/applying-paper [last consulted on 15 December 2017].