2. Crear una encuesta 1 Descripción
6.4 Resultado esperado
To obtain the data, that is to say, the individual examples of uses of conditionals in scientific writing which are analysed in Chapter 5, searches for the relevant particles have been made in the selected subcorpora of the CC with the help of the Coruña Corpus Tool.
The Coruña Corpus Tool is a corpus management tool developed by the IRLab of the Universidade da Coruña in collaboration with the MuStE Research Group. The CCT has been specifically designed under the supervision of the compilers of the Coruña Corpus for its use with the texts of the CC “to help linguists to extract and condense valuable information for their research” (Parapar & Moskowich 2007: 290). Even though, as already explained in Section 1.3 above, the design of the CCT has influenced the process of texts computerisation by imposing some limits, it has been designed so as to fulfil the criteria of faithfulness and usability to the utmost degree.
The main feature of the CCT is that it creates an index from the set of samples compiled in the CC and, in order to conduct searches, it works with this index and not with plain text, as most concordance programmes do (Lareo & Moskowich 2012: 8). Moreover, the CCT can extract information using several criteria (date, genre, or discipline of the samples, among others), as it has been built with a multi-field index based on the metadata file accompanying each sample.
Once in the CCT, after the process of launching the Corpus is completed94, the user finds three main tabs: “Search”, “Tags” and “Info”.
3.1 Info Tab
The Info tab allows users to see a tree in which the different samples are presented, organised according to the subcorpus they belong to. Once any sample is selected, two files will be shown, called Metadata and Document.
The Metadata file, shown in Figure 4.9 below, presents information about the author and the sample.
This includes demographic data (sex, date of birth and death, occupation and place of education), a short biography of the author, and bibliographical information about the sample (data sampled, genre, year and place of publication, age of the author at publication and further information about the text, especially having to do with edition).
94 See Lareo & Moskowich’s (2012: 3-4).
130
Figure 4.9: Metadata file of William Whiston’s 1717 sample.
Figure 4.10: Document file of Curson’s 1702 sample.
131
The Document file, as shown in Figure 4.10 above, presents the computerised output of the file, representing the pertinent sample following the editorial decisions explained in Section 1.3 above.
This is “where the team’s editorial work is seen best” (Lareo & Moskowich, 2012: 7), as it features the maximum faithfulness towards the original text.
3.2. Search tab
The search tab is the main interface of the CCT. As can be seen in Figure 4.11 below, most of it is occupied by a big screen in which results are shown.
Figure 4.11: View of the Search window of the CCT.
Figure 4.12: View of the subset selection window of the CCT.
132
This results window is surrounded by the two main options of the Search tool, which allows performing searches and generating term lists, in a series of menus above and below the results window, respectively. In both features, the leftmost pull-down menu allows to restrict the number of text samples to analyse. Users can select to conduct searches in the whole index, a single document from the list, or a subset. In case the option “subset” is selected, a new window pops up, as shown in Figure 4.12 above.
This new window allows users to select different samples manually, as well as to select them with a series of tools. Among these, “select all” and “deselect all” are self-explanatory. By clicking “selection with metadata”, another window will pop up (shown in Figure 4.13 below) allowing to select texts according to a series of parameters as present in the metadata of the samples. These parameters are based on information on both the author and the documents. In what has to do with the documents, selections can be made based on their discipline or genre, as well as selecting a timespan for their publication. In what has to do with authors, one can select texts based on timespans to their dates of birth and death, as well as selecting authors of a given age, sex, or place of education. In this latter parameter, texts can be selected on a three-level basis. The first level (“Place 3”) identifies the continent, the second level (“Place 2”) identifies the country or US state, and the last level (“Place 1”) identifies the city or county.
Figure 4.13: Selection by metadata window of the CCT.
Finally, the tab “clone selection” allows to copy the subset of samples which has been selected, in order to export that selection to other features of the corpus. For instance, if this tab is pressed after selecting a subset to be searched with the Search Tool, it will copy the subset so that it can be used 133
with the Generate Term List Tool, and vice versa. Users can run both simple and advanced searches with the Search Function:
3.2.1. Simple searches
In Simple term searches a single word is searched, showing all the occurrences of that given word in the set of documents selected by the researcher. These occurrences are grouped according to the sample in which they belong and, as shown below in Figure 4.14, provide, from left to right, information about the sample, the numeric position the word occupies in the sample, the page of the sample, the position (expressed in percentage) of the word in that given page, and the context of the occurrence to the left and right, respectively.
Figure 4.14: Search results window.
Figure 4.15: Context location window.
134
If any of these results is clicked on, a new window will pop up showing the occurrence, highlighted, in the document in which it appears. This way, users can see more context than it is shown in the main results window. This can be seen in Figure 4.15 above.
The last row in the results window displays a summary of the results. It shows the total number of occurrences rendered by the search, the different types the search has found and the number of occurrences for each type95. If this last row is clicked on, a new window will appear (see Figure 4.16 below), showing the results grouped according to their type and to the sample in which they appear.
These results are tabulated, so that they can be easily pasted into a spreadsheet.
Figure 4.16: Results summary window.
3.2.2. Advanced searches
A series of advanced searches can also be carried. These include searches for multiple terms, searches with wildcards, and searches on tags and editorial marks. Everything which has been explained above for simple searches also applies to advanced ones.
a) Multiple terms
Multiple terms can be searched by introducing them in the search query. If nothing else is done, only occurrences in which both terms appear consecutively will appear in the results. In order to search for
95 This is interesting, because, as explained above, the CCT renders all the different spelling variables of a word when this is searched. For instance, after introducing a query such as “phenomena”, the CCT renders not only results for “phenomena” but also for “phœnomena” and “phænomena” (Lareo & Moskowich, 2012: 15).
135
near (non-contiguous) occurrences between these terms, one should select the maximum number of words which can appear between the two terms in the right-most pull-down menu named “Gap” (see Figure 4.11 above). By selecting a given number, the CCT will show all occurrences in which the two words being searched appear within the given number of words being selected (this is, by selecting a gap of 3, results will include occurrences of the two words searches in which there are three, two, one and no words between the terms queried). The maximum gap allowed by the CCT is nine words.
b) Wildcards
The CCT allows a number of wildcards in searches, which can furthermore be combined in order to make more complex searches possible.
A dot (.) can be used to stand for any single character, but just for one. For instance, if “ma.e” is queried, examples both of “made” and “make” will appear, among others, but not of “machine”, as each dot stands for a single character.
An asterisk (*) after a given character represents that that character may appear zero or more times.
For instance, if “be*” is queried, the results will include “be” and “bee”, but also “b” and “beee”. This wildcard is useful to search for differences in spelling, as for instance “armou*r” would render results both for “armor” and “armour”. In combination with a dot, an asterisk can be used to replace any string before or after a series of characters. For instance, a scholar studying prefixes could find all words including the prefix “dis-” by querying “dis.*” although this would also include words such as
“disc” which they would have to discard.
Plus (+) after a given character represents that that character may appear one or more times.
Following with the previous example, if “be+” is queried, the results will include “be”, “bee” and
“beee”, but not “b”, as the character before the plus must appear at least once.
Parentheses are used to make the characters they enclose an indivisible cluster. This is useful in combination with other wildcards, as, for instance, asterisks: If “(counter-)*espionage” is searched, the results will include “espionage” “counter-espionage” and “counter-counter-espionage”.
The vertical bar (|) stands for “either…or” this is, only allows for the presence of one of the two options being given. For instance, if “t(y|i)re” is queried, results will include “tyre” and “tire” but not “tore”.
Finally, [^X]* (substituting X by any string) restricts the presence of a given string in a search, with all other strings of characters being possible. So for instance, if “dr[^o]*w.*” is queried, results such as
“draw” “drew” and “drawing” appear, but the restriction of the presence of the string “o” between
136
“dr” and “w” avoids the presence of results such as “drowsy” or “drown” which would otherwise have appeared. A summary of these wildcards can be seen in Table 4.5 below.
Wildcard Effect
. To stand for any (but only one) character
* Stands for the previous character, which may appear zero or more times.
+ Stands for the previous character, which may appear one or more times.
() Characters enclosed function as an indivisible cluster.
| Either one or the other, used in combination with strings between brackets [^X]* Restricts a string.
Table 4.5: Wildcards accepted in the CCT (Taken from Lareo & Moskowich, 2012: 18-24)
c) Editorial marks and tags
Finally, searches can also locate particular editorial marks and tags in the text. The set of particular editorial marks and tags which the CCT is capable of searching is shown in Table 4.6 below. In order to search for a given editorial mark it is necessary to introduce a backward slash before each square bracket. For instance, if one wants to find all the fragments marked as [fragmentgreek], the query must be \[fragmentgreek\]. To search tags it is necessary to go to the Tags96 tab, in which a pull-down menu shows the different tags which can be searched.
Editorial Marks Tags
[quotation] <abbr>
[quotlat], [quotgreek] etc <emph>
[fragment] <note>
[fragmentlat], [fragmentgreek] etc [note], [endnote], [margin note]
[figure], [table], [poem], [formula]
[unclear]
[omitted pages], [missing page(s)]
Table 4.6: editorial marks and tags which the CCT identifies in searches
3.3. Term lists
To access to the function of generation of term lists, the user must go to the clickable button at the right-most bottom of the window (See Figure 4.11 above). In this function, apart from the left-most pull-down menu which, as already explained, allows the selection of the samples to be analysed, there is a second pull-down menu, called “for” in which users can decide to show the full term list or to restrict it to terms starting with a single character. There is also the option to select “special”, which
96 This tab will also be used to allow for morphological searches once the Coruña Corpus is POS-tagged.
137
will restrict the list to terms whose initial characters are not part of present-day alphabet, this is, numbers, symbols, squared bracketed words and characters no longer used, such as <ſ> or <æ>.
Once a term list is generated, it will pop up in a new window (Figure 4.17 below), showing the set of samples being analysed at the top, and then each term appearing in the set of samples selected with the information about the total number of times it appears.
Figure 4.17: Term list window.
The terms are ordered so that words whose first characters are symbols appear first, words whose first character is a number second, and later squared bracketed words, words ordained alphabetically, and, finally, words starting with a character not used in present-day English. It is interesting to note that, contrarily to searches, in this view words with more than one spelling throughout time will appear as two different words.
4. Methodology
The final section of this chapter is concerned with the process followed in order to obtain and classify the cases which are analysed in Chapter 5 below. It is divided in two sections. Section 4.1 explains the process to obtain the cases and Section 4.2 presents the parameters used to analyse the results and organise the study, together with the different variables in each of them and a short explanation of the way they will be used.
138
4.1. Searching and disambiguation
In order to obtain the data to analyse in Chapter 5 below, the selected particles presented in Section 2 of Chapter 2 have been searched in the selected subcorpora of the CC with the help of the Coruña Corpus Tool, using the functions of simple and multiple-term search. The spelling variants of each of these particles have also been searched for when data from the Oxford English Dictionary indicated that they were still in use during the period under study.
However, the Coruña Corpus is not POS-tagged and, consequently, it is impossible to refine the results of the searches propter hoc. This is, when conducting a query, all the occurrences fulfilling that query will appear, independently of their different meanings or functions. This was the case with the searches in this dissertation, in which, after searching for conditional cases with the selected particles, a good proportion of the results did not show any conditional reading.
Consequently, the list of occurrences obtained with the CCT has had to undergo manual disambiguation, in a laborious process in which all non-conditional uses of the particles have been eliminated. Among others, discarded occurrences include all the uses of inversion triggers in which these particles do not introduce conditionals, the uses of if as an interrogative, as long as and so long as when they were part of a comparative structure, in case of when followed by a noun phrase, or all the cases of if as part of the conjunctive locution as if.
After this process was finished, the number of occurrences of conditionals in the corpus was 3,735.
These examples were copied with sufficient context both before and after the occurrence of the searched query word. This was done in order to make possible a pragmatic analysis such as the one needed to ascertain the function the conditional was playing in discourse. After this, each occurrence was classified according to a number of parameters, which are explained in what follows.
4.2. Parameters of the study
The categories used as parameters in this study can be classified into two different groups: extra- linguistic and linguistic parameters.
4.2.1. Extra-linguistic parameters
The first group includes categories which are dependent on the extra-linguistic reality. They affect the results at the sample-level and they are used as criteria to classify different samples into sets whose combined results are compared. These categories and the variables in each of them have been provided by the compilers of the corpus (as explained in Sections 1 and 2 above), but are discussed here to indicate how they have been applied in the treatment of the data. Apart from the Corpus as a 139
whole, which is considered to be a representation of the whole scientific discourse of the eighteenth and nineteenth centuries, there are five such categories:
a) Classification according to diachronic variation
The first parameter used is the diachronic evolution. This parameter is analysed by means of three different groupings of samples. These three groupings are not always used: instead, they are used according to the requirements of the analysis. The first of them is the century in which each sample was written, thus providing a quick dichotomous comparison between the uses in eighteenth and nineteenth-century samples. A second level of classification takes into consideration the date (more concretely, the year) of publication of each sample on its own, allowing to run analyses for correlations with other factors97. Finally, there is a third, intermediate level of analysis, in which the samples are grouped according to the decade in which they were written. This third grouping provides sets of samples of c. 60,000 words in which individual idiosyncrasies wane, but which allow more fine-grained analyses than the classification according to the century in which each sample was written.
b) Classification per discipline of the text
As explained above, the corpus used in this dissertation includes material from three different subcorpora, CETA, CEPhiT, and CELiST; each of them containing samples of texts about Astronomy, Philosophy, and Life Sciences, respectively. Thus, by comparing the uses of conditionals in each of the subcorpora, the particular uses of each discipline might be discovered.
This parameter will also be used in combination with that of diachronic variation, separating the eighteenth and nineteenth century sections of each of the three subcorpora and thus distinguishing six different subsets of samples. This allows analyses about the diachronic variation in a given discipline or about disciplinary differences over a given period of time.
c) Classification per genre of the text
The results will also be examined from the point of view of the genre of each text sample. As explained above, this parameter presents eight variables: Treatises, Textbooks, Essays, Articles, Dialogues, Lectures, Letters and Others (which, in this case, is a sample taken from a dictionary). This parameter allows the analysis of possible genre-motivated differences, contributing to the study of the author-audience relationship by comparing different types of readership according to the different genres.
97 Correlation analyses require a high number of individual measures to be effective and reliable, thus advising the consideration of each sample as a variable in the analysis, rather than bigger groups.
140
d) Classification per sex of the author
A further parameter is that determined by the sex of authors. Using this parameter two different subsets are distinguished: the one formed by Male-authored samples, which, as explained above, contains 110 samples (c. 1,080,000 words), and the subset of Female-authored samples, containing twelve samples (c. 120,000 words). This parameter is particularly significant, as the differences between male and female samples may reflect certain strategies of use in order to avoid discrimination, as explained in Chapter 1.
e) Classification per origin of the author
The final extra-linguistic parameter taken into account is the geographical origin of the author, useful
The final extra-linguistic parameter taken into account is the geographical origin of the author, useful