JUZGADO CUARTO MERCANTIL DEL PRIMER DEPARTAMENTO JUDICIAL DEL ESTADO

Twelve Grade 4 subject textbooks were used to provide the textbook corpus from which HFW were generated. The textbooks cut across the four content area subject areas offered in Grade 4 namely: Mathematics, Natural and Social Sciences, Life Skills, and Technology. These were textbooks which were in current use as core texts in the sampled schools. The researcher took advantage of his participation in the questionnaire administration for the contextual profiling of the 60 schools the large consortium project worked in to negotiate access to schools which later participated in the study as well as to survey the most commonly used textbooks in each of the subject areas. The researcher just asked the teachers to note down the titles of the books they used in the specific subject areas.

The HFW generated from the textbooks would be indicative of the vocabulary demands of the Grade 4 textbooks which is also the vocabulary required of the Grade 3 learners by end of third grade. Expository or informational texts were used on the understanding that they better represent the concept ‘reading to learn’, which is the focus of reading in Grade 4, than do narrative texts. Grade 3 should be a preparation of learners for reading such texts through equipping them with the general and academic vocabulary in which content is embedded. An understanding of the general and academic vocabulary in the texts would position learners

better for working out the meanings of specialist or technical vocabulary with the teacher’s mediation. The present study’s assumption or hypothesis is that currently, these content area textbooks do not get used optimally in the Grade 4 English First Additional Language classrooms because of learners’ limited vocabulary base in English.

The textbook corpus from which the HFW were generated needed to be prepared with care for valid claims to be made on the HFW derived from the corpus being representative of the vocabulary demands of the textbooks.

5.3.1.1. Preparation of the textbook corpus for HFW generation

The generation of HFW from the textbook corpus was not achieved by a few clicks on the word frequency count software. Substantial effort went into preparation of the textbook corpus before the generation of HFW. Much of the need for elaborate preparation was the result of the failure of the readiris3 software programme purchased to convert the pdf-scanned textbook files, which the word frequency counter read as pictures, into readable text. After several futile attempts to get assistance from specialists within the university, as well as from the suppliers’ support services, the ultimate decision was to type the textbooks.

The textbooks were typed by four typists who each typed 3 books. The typists proof-read each other’s typing before the researcher got the typed files at the end of the day to comb through for any errors that would have escaped the typists’ attention. By running the parts of the typed texts through the antconc 3.2.4 word frequency count software4, aspects of misspellings were noted. The software program would show words like togeter, diferent and impotant and the frequency of their occurrence. These would obviously be misspellings of together, different and important. These were corrected in the texts. In some cases as in the

3_{The software program meant to read the scanned textbooks and generate word frequencies read the files as}

pictures not as text and so could not perform the word counts. There was therefore need for another software program, readiris, to convert the scanned texts into readable text in order to allow the word frequency counter to generate word frequencies. Readiris is a powerful optical-character-recognition (OCR) software designed to convert all paper documents, images or PDF into editable and searchable digital text.

4_{The Antconc 3.2.4 is a software program or concordance named after Laurence Anthony who developed it.}

‘Ant’ for Antony and ‘conc’ for concordancer. A concordancer is a computer program that automatically generates a list of words, phrases, or distributed structures along with immediate contexts, from a corpus or other collection of texts assembled for language study. It can search, access and analyse language from a corpus. It allows one to enter a word or phrase and search for multiple examples of how that word or phrase is used in the corpus. The antconc 3.2.4 is a freeware concordance program. Among its several uses is the generation of word lists from a corpus, as well as show all the instances in which a specific word appeared.

word ‘color’, confirmation was made with the textbook concerned whether it used that spelling or whether it used ‘colour’. The frequency of the misspelt words would be an indicator of the number of misspelt words that needed correction.

Although the word frequency count software revealed some of the errors in the typed texts, there were other errors which could only be detected from reading through the typed files. Most such errors involved inaccurate word spacing resulting in confusing everyone for every one, all together for altogether, sometimes for some times as well as may be for maybe. The context helped the researcher to determine which of the form was correct. The effort and vigilance that the process required testified to the need for combining hardware (computers), software (programs), and wetware (our brains) as Nation (2012) advises. The electronic analysis needed to be complemented by human effort for the generation of a valid HFW list. To ease the identification of such errors, small amounts of text were entered into the software program or concordance at a time so that the researcher could comb through the text for misrepresentation of words. In the process of trying to ensure the accuracy of the typed material, it became apparent that some words needed to be excluded from the textbook corpus on account of them not potentially impacting the comprehension of the texts read in any significant way.

Decisions about exclusion of some words from the textbook corpus

From one of the trial word frequency list generations meant to reveal anomalies in the typing, the researcher’s attention was drawn to the high frequency of the word ‘south’ which by far outnumbered that of ‘north’. A reading of the script showed that the words ‘South Africa’ and ‘South African’ recurred several times in the text. This had effectively pushed up the frequency of the word ‘south’ when in these occurrences the word ‘south’ did not relate to the cardinal point of the compass. There was therefore, a need to read the typed files with the express purpose of identifying words which would give a misleading picture of word frequencies. That process led to the elimination of all the names of people, places, countries, cities, and so forth. Names of some of the provinces of South Africa like Free State, Eastern Cape, Northern Cape, and Western Cape, which appeared quite frequently in some texts, were omitted. These names are made up of two words, each of which has independent meaning. Their inclusion would have needlessly increased the frequency of their individual forms. If the province Free State was made constant reference to in a text, one could end up

thinking that the words free and state comprised the vocabulary needs of learners for reading the text with understanding when in reality they did not.

In the preparation of the corpus, words which were part of the contents page, acknowledgements, glossary and index sections were also excluded. From my experience as a teacher and as a student, these are sections that learners and teachers do not pay specific attention to. Because of that, knowledge or ignorance of the vocabulary these sections of text embody is not critical to the comprehension of the manifest content of a textbook. Also omitted were numerals, symbols, and non-English words. The labels and words on pictures and diagrams were, however, included as they determined the degree to which the learners would comprehend the pictures and diagrams, as well as the text related to them. There were some words which were repeated throughout the textbooks but learners could ignore them without losing much, if anything, from the content. Examples were words like Unit 1, Unit 2, Activity 1, activity 2, or Let’s Talk, Let’s Write etc. These were excluded as they did not form part of the core content learners were obliged to comprehend. Having eliminated words which were not part of the critical vocabulary learners needed as part of their repertoire, there was a need to deal with the challenge compound words, some possessive forms and the contracted forms posed when the texts were loaded onto the word frequency counter.

Challenges related to the inclusion of some word forms in the textbook corpus

The challenge of word exclusions was less complex than that of the inclusion of some words. There were words that needed to be included in the corpus as they impacted textual comprehension but whose inclusion in their orthographic forms would produce inaccurate forms in the word frequencies as well as overrepresent some words while at the same time underrepresenting other similar word forms. These included compound forms, possessive forms and contracted forms. The nature of the challenge these posed is discussed and examples cited in this section.

The first challenge presented by compound words was that they were read as two separate words by the frequency counter when they were actually single words. Compounds are word groups comprising two or more parts expressing a single specific concept. Although some compounds like ‘ice cream’ denote a single object, they were read by the word frequency counter as two separate words. The two words in combination normally did not retain the meanings they have as separate words, which gave a misleading idea of the frequency of the

two words making up the compound form. Ice cream would push the word frequency of ‘ice’ and ‘cream’ up. There was, therefore, need to capture compound words as one word. For both hyphenated and non-hyphenated compound words, the two forms were written as one word by removing the hyphen or the space between them. This was because the software programme ignored all punctuation marks and read hyphenated words as two words. Those words whose combination conjured a single meaning and lost that meaning when they were considered separately were typed as one. Some such examples were speechbubbles, foodweb, foodchain, selftiming, Tshirt, overspending, fourdigit, crosssection, and doublestorey. Short forms like km/hr were written as kmhr. The word counter could then read them as single words. The challenge of single words being read as two separate words was not only confined to compound forms but was manifest wherever the apostrophe was used.

Possessive forms and contracted forms were read as two words by the word counter since it did not recognise the apostrophe. The apostrophe was read by the concordancer as a space showing the part of the word before the apostrophe as one word and the part after the apostrophe as another. While generating a trial word frequency list, the researcher noticed that ‘t’ and ‘s’ appeared as words in their own right, and with high frequency as well. Within the list there were also words like ‘don’. It became apparent that some of the ‘t’ letters which stood as independent forms had come from the word don’t which meant don’t was read as don and t. This was misleading in that a single word was counted as two separate words, and also in that it led to the generation of non-existent words. The apostrophe had to be removed in all the instances it appeared. The word don’t was therefore written as dont. Although the removal of apostrophes resolved the problem for some words, it introduced a complication for others.

The most notable challenge for instance was how to differentiate between its (the pronoun) and it’s (the contracted form of ‘it is’. Removing the apostrophe from the contracted form would render the word a pronoun. All the instances of the pronoun and the contracted form would then be counted as apostrophes. This would raise the frequency of its erroneously high while denying the contracted form any single occurrence. Such anomalies would affect the validity of the corpus significantly. To circumvent that, the contracted form it’s was written as itis without spaces between to differentiate it from it is. A list of such changes was made so that after the generation of the HFW list, they would be converted to their original correct orthographic forms.

A related but more complex case was that of distinguishing between words like other’s as in each other’s and others the plural form of other, once the apostrophe is removed. In this case, the researcher abbreviated the word class of the less frequent form next to it in this case ‘otherspos’ for others (possessive form). The choice of the less frequent form was meant to avoid adding letters to more words if the more frequent form was the one on which additions were made. A similar but even greater challenge was distinguishing between three forms of a word as in boys, boy’s and boys’. The complication was in the addition of a third form. Although in this example all the three forms qualified to be regarded as one word according to the notion of word adapted for the present study in chapter 3, the word frequency generation was based on the token as a unit of counting. Letting the three forms be counted as one word at this stage would give an erroneous result of the number of tokens in the corpus. For these three word forms, removal of the apostrophes on the two possessive forms would mean the three forms are entered as the same word, boys. To address this challenge, boy’s was entered as boyspos and boys’ as boyss as in boys’s. The same was done for ‘friend’s/friends/friends’ Again, all such changes were noted down so that the words would be recognisable in the frequency list. The same principle was applied to distinguish the following pairs of words once the punctuation marks were removed:

coordinates/co-ordinates, we’re/were, side’s/sides, hour’s/hours, year’s/ years, among others.

For short forms with dual functions like st. which can be used for saint or street, the word was entered as stsaint or ststreet. The short form for for example, e.g., was written without the full stops which would have reduced it to ‘e’ and ‘g’ separately.

Much time was invested in the preparation of the textbook corpus before the actual word list generation as the quality of the resultant list would only be as good as the quality of the corpus from which it derived. The word elimination, word combination, and the alteration of the orthographic constitution of some words were meant to ensure that the output from the word list generation process would mirror the corpus. In retrospect, the failure of the readiris software program was a disguised blessing as the typing of the textbooks showed the need for all the eliminations, combinations and reconfigurations of word structures. The highly mechanised process of loading the textbook corpus onto the software program and generating the HFW with a single click, which the researcher had envisaged at the beginning, would

have grossly compromised the resultant word list. The extensive ‘cleaning up’ of the corpus prepared it for the next stage of generation of HFW.

5.3.1.2 Generation of HFW

The files from the twelve textbooks were converted into plain text (txt) format and merged into a single file. All the words were converted to lower case to avoid having the same word beginning with upper case being read as a different word to that beginning with lower case. From this file, a word frequency list was then generated using Antconc 3.2.4 software. For generating the word frequency list, the token was used as a unit of counting seeing that no software was available to measure word frequency according to the unit of word adapted for the present study as discussed in Chapter 4. The frequency list indicated the ranking of the words from the most frequent, to the word tokens occurring only once in the corpus. It also captured the frequency with which each word occurred. The corpus yielded 6 748 Word Types and 141 063 Word Tokens. This was after all the exclusion and merging of some word forms described earlier.

The purpose of the word frequency list was to determine the critical vocabulary needs of Grade 3 learners transitioning to Grade 4 on the basis of the frequency of their occurrence. The first screening measure was the frequency with which words appeared in the corpus with the high frequency words meriting inclusion into the list of learners’ required vocabulary. This necessitated decisions on the appropriate cut off point beyond which some words would be considered infrequent enough not to be considered critical vocabulary.

Several figures have been given as representing the number of times a word should be heard and/or seen for it to be acquired incidentally from context. Khatib & Nourzadeh (2012, p. 4, 5) note that:

[S]ome researchers suggest that 6 encounters to an unknown word would be enough while some other researchers argue in support of 8 encounters (Horst et al. 1998). Abundant evidence has been found for 10 (or more) encounters, both in L1 (Jenkins et al. 1984) and L2 (Saragi et al. 1978; Webb, 2007).

The lack of consensus on the number of word recurrences sufficient for word acquisition is occasioned by the relativity of the acquisition process to individual proficiency levels, to the nature of word exposure, to context and many other confounding variables. Considering the

small size of the present study corpus, thirty occurrences of a word within the corpus was considered frequent enough for a word to occupy high frequency status. All the word forms with less than that cut-off point were considered infrequent to merit inclusion in the HFW list. These were discarded. The 30 word frequency cut-off point yielded a total of 633 types from the 6748 types. The 633 word types were too many to consider as the critical vocabulary needs of the learners which teachers would need to give explicit attention to. The study also tested learners’ knowledge of the words identified as representing their critical needs. The 633 types were too many to test on the learners to determine their knowledge of the most useful vocabulary. Although frequency was ‘the’ screening criterion, there was need to augment it with other criteria.

5.3.1.3 Criteria and process for narrowing the vocabulary needs of the learners

This sub-section describes the different criteria that were used and the screening stages that were followed in order to arrive at what could be regarded as the core vocabulary needs of

In document Mérida, Yuc., Lunes 12 de Julio de Diario Oficial. del Gobierno del Estado de Yucatán (página 41-45)