This section provides a description of internal corpus composition, with respect to design criteria introduced in Section 3.1.
The final corpus contains approximately 20M tokens and 35K texts (see Table 3.2). The number of texts is equally split between ELF and native English countries (17,383 texts in ELF vs. 17,562 texts in native English), even though there are 27 countries in the ELF subcorpus against 2 countries in the native English one. This might be due to the fact that the English language webpages found in English speaking countries are clearly many times more numerous than those found in ELF countries. As for the number of tokens, the native subcorpus is slightly larger than its ELF counterpart, which may suggest that (some) ELF texts are somewhat shorter.
As for the number of universities per country, a sample of 100 universities overall was selected in the first place, on a proportional basis according to the number of universities listed in QS World University Rankings (cf. Section 3.2). Approximately 10 universities overall were discarded during crawling and post-processing, either because their website homepage was not fetched or because the retrieved webpages were discarded during language identification processes. The final corpus includes 91 universities, 78 of which come from ELF countries whereas 13 come from European countries where English is a native language.
A map is also provided in Figure 3.2, displaying all universities included in the corpus.
From a geographical point of view, it is not surprising that most universities tend to coincide with capital cities, which host two or even three universities, while leaving other regions completely uncovered. Yet, geographical distribution within single countries is not a priority of this project, which aims at analysing English texts produced by leading universities in Europe. All of them are indeed ranked among the top positions worldwide.
In addition to the ELF vs. NAT perspective, the corpus can be split by language family too.
Language families of each country’s official language may serve to understand whether and
10The corpus is available at: https://corpora.dipintra.it/ [last consulted on 15 December 2017]. Permission can be requested from the author.
3.4 Corpus composition 67
Fig. 3.2 Map of universities included in the corpus.
Table 3.2 Corpus statistics by English language variety (ELF and native English).
ELF NAT Total
Tokens 9,375,739 11,813,692 21,189,431
Texts 17,383 17,562 34,945
Universities 78 13 91
Countries 27 2 29
how any difference or similarity in the use of English is related to the authors’ native language.
Table 3.3 offers detailed information on corpus composition, while Figure 3.3 illustrates the percentage of tokens in the corpus by language family. ELF content produced in a country whose official language is of Germanic origins account for 20% of the whole corpus, while textual content produced in a Romance-language country accounts for approximately 10% of the corpus. Uralic and Slavic languages are represented by 5% of corpus tokens each, while Hellenic (Greece) and Baltic (Lithuania and Latvia) regions include less then 1% of tokens in the corpus.
Table 3.3 Detailed corpus statistics. Native texts are highlighted in bold.
Language
Family Country Status N. of texts N. of tokens
Baltic Lithuania ELF 39 36,552
Latvia ELF 161 72,568
Germanic Austria ELF 352 185,224
Germany ELF 2,674 1,269,884
Denmark ELF 1,059 779,139
Netherlands ELF 1,845 801,244
Norway ELF 657 283,059
Sweden ELF 941 680,928
United Kingdom NAT 13,773 9,069,383
Germanic-Celtic Ireland NAT 3,789 2,744,309
Germanic-Romance Belgium ELF 722 408,088
Hellenic Greece ELF 30 14,881
Romance Spain ELF 1,155 603,882
France ELF 1,258 633,523
Italy ELF 1,263 620,940
Portugal ELF 234 117,919
Romania ELF 111 58,915
Romance-Germanic Switzerland ELF 1,767 807,456
Slavic Belarus ELF 81 46,291
Czech Republic ELF 299 183,370
Poland ELF 96 63,443
Serbia ELF 83 40,606
Russia ELF 554 530,522
Slovenia ELF 123 95,309
Slovakia ELF 18 7,905
Ukraine ELF 44 30,632
Uralic Estonia ELF 324 176,162
Finland ELF 1,382 771,860
Hungary ELF 111 55,437
3.4 Corpus composition 69
Fig. 3.3 N. of tokens (%) by language family.
The next Chapter describes automatic classification of texts according to Functional Text Dimensions (FTDs) and reports on the post-hoc evaluation that was carried out for texts scoring high on the promotional dimension.
Chapter 4
Automatic classification
4.1 Introduction
This Chapter deals with the classification of webpages in the corpus by measuring their linguistic distance according to pre-defined criteria that performed well in previous experi-ments (Forsyth and Sharoff, 2014). In particular, it aims at applying Sharoff’s method for quantifying text similarity using human judgment as a reference standard (Sharoff, 2018).
The classification is meant to increase the usability of the corpus, as well as to provide input for the analysis of promotional language described in Chapter 6.
Traditional approaches for document classification adopt either internal (linguistic) cri-teria or external (situational) cricri-teria. As already discussed in Section 2.4.2, the former are associated with bottom-up classifications – e.g. the Multi-Dimensional analysis conducted by Biber (1988) – the latter with top-down procedures where texts are categorised according to non-linguistic parameters such as author, recipient, field, and genre category. This latter approach is frequently chosen within sociolinguistics and discourse/rhetorical studies (e.g.
Swales (1990), see Biber et al. (2007) for a full account). Experiments conducted in Forsyth and Sharoff (2014) and Sharoff (2018) have moved into a new direction by adopting statistical measures to quantify linguistic (dis)similairty across documents and comparing the output with a text-external standard, i.e. human judgement. This approach is particularly relevant to this PhD project, which has a strong focus on readers’ perception and on how language impacts upon readers’ choices – rather than investigating how language reflects university power structures and their underlying ideologies. Therefore, the same methodology and criteria used in Sharoff (2018) can be applied to university webpages, so as to identify text types as perceived by website users. The main stages include:
1. development of a set of relevant descriptors that can represent the main text types of university websites (Section 4.2);
2. manual annotation of a random sample of academic pages serving as a training set for automatic classification (Section 4.3);
3. automatic classification of the remaining pages (test set) on each functional dimension (Section 4.4).
Each of these stages will be described in the next Sections.