• No se han encontrado resultados

Content analysis aims to systematically classify segments of text (the unit of analysis may be words, phrases or other units) according to substantive themes that are, usually, pre-determined by the researcher(s). Coding is designed to ensure that the segments coded to each category share the same or similar interpretation (Weber, 1990). This thesis aimed to analyse volume crime data: the number of cases in the data necessitated automated or computer-assisted analysis. However, as noted above, the richness and flexibility within natural language presents a number of complications to automated processes, notably in determining whether two segments of text genuinely have the same or similar meaning. Content analysis provides a number of tools that can assist with these complications. Content analysis can include both symbolic and statistical approaches to data analysis. The allocation of codes can be determined by the meaning of the text or, as is increasingly

80

the case with computer-assisted methods, the process can be regarded as the application of a set of rules without the need for the computer to recognise the meanings on which the coding frame is based. These rules are embedded within a coding frame which stipulates the conditions for assigning a code to any given segment of text. Coding frames can be regarded as dictionaries (or ontologies) with a definition provided for each conceptual code. Standard, “off-the-shelf” dictionaries are available which can considerably reduce the burden of coding frame development; however, the analysis benefits when standardised tools are revised and tailored to suit the data and research questions in hand. Any work undertaken to refine dictionaries (as in this thesis) can then act as an improved starting point for future work on similar databases.

A commonly used approach applied to large volumes of data, which formed the starting point of analysis for this thesis, is to identify the frequency of words in a segment of text based on the underlying assumption that terms used most frequently will reveal the subject matter of the text.

This approach, known in natural language processing as N-gram analysis, is also referred to as a ‘bag-of-words’ technique as the analysis can be conducted without a concern for word order. In this approach, the ‘codes’ bear a close resemblance to the words in the raw data. In an automated analysis, words will only be counted as 'similar' if they are exactly the same. To cope with the inherent flexibility of natural language it is, therefore, necessary to standardise and simplify the text (a basic forms of coding) prior to analysis. The processes applied to standardise the text prior to analysis are depicted in Figure 4.1 below.

Initial steps in standardisation might require the identification and correction of spelling errors, for which there are readily available English language checkers. It is also established practice to omit stop words, i.e. words that occur so frequently in the language that they are not expected to add anything to the analysis. Stopword lists commonly include ‘function’ or ‘structure’ words such as the, is, at, which and on. Open source, validated lists of stop words are readily available for use,

although, these lists may require inspection and modification to the specific research question and data sources used.

81 Figure 4-1 The data preparation process

The next essential step in standardising the text is a process known variously as stemming or tokenisation. This thesis will adopt the terms tokenisation and tokens in acknowledgement of the fact that the standardisation of text included steps additional to basic stemming. In stemming, a

82

word is reduced to its stem, or root form. A lemma is the canonical form of a set of words (Leetaru, 2012). For example, walk, walks, and walking are defined as different lexemes that share the same lemma i.e. walk. In the analysis of text, words are frequently reduced to a stem that represents this lemma, by removing affixes, although this approach can be crude and does not successfully

standardise all lexemes (run and ran, for example). Fortunately, algorithms for stemming have been studied in computer science since the 1960s and there are a number of sophisticated algorithms available for use. The analysis in this thesis utilised the widely used English language version of the stemmer ‘Snowball’ (Porter, 2001) which has become regarded as the standard approach to stemming in a wider range of languages (Willet, 2006).

Individual words may not be the most appropriate unit of analysis as it is not always possible to derive the meaning of phrases and idioms from their component words, for example to “change my mind” constitutes a single unit of meaning that does not directly relate to the individual words. In a similar vein, the presence of negating words can radically alter meanings within text: for example, in the case of police data it is important to distinguish ‘no force used’ from ‘force used’. Again, the development of dictionaries has provided flexible approaches for handling phrases, but these approaches are imperfect and dependent on the availability of an appropriate dictionary for the language/domain in question. An alternative approach is to create a list of the most common multi-word combinations within a dataset. The advantage of this approach is that the list is based on the corpus to be analysed and is therefore of greater relevance to the data in hand. This thesis utilised the online software TerMine (Frantzi, Ananiadou, & Mima, 2000)18, developed at the University of Manchester. This program can be used to automatically identify frequently used multi-word phrases in a dataset. Although such software offers considerable time savings, ultimately, the inspection of this list and the decision regarding which phrases need to be treated as a single unit of analysis must be conducted by the analyst. Once the multi-word phrases relevant to the dataset have been

identified, they can be treated as if they were a single word simply by conducting a ‘find and replace’

to swap the spaces between words with an underscore (e.g. ‘no force’ becomes ‘no_force’), these phrases are then subsequently treated as one token.

As noted above, natural language is flexible, meaning different vocabulary can be applied to describe similar events. To treat different, but related vocabulary, as the same requires extending the process of tokenisation to identify and group together synonyms, hypernyms and hyponyms and other words and phrases that indicate the same or similar class of object or action. This process is more

18 http://www.nactem.ac.uk/software/termine/ (last checked 6/08/2015)

83

akin to traditional coding, as applied during qualitative analysis of text, in that groups of words that present a similar meaning are labelled under the same ‘code.’ Throughout this thesis, the term

‘token’ refers to a code applied to a group of words, terms or phrases that share a similar meaning.

Again, open source thesauri are available to assist with the identification of synonyms. This includes WordNet, an online lexical database of English developed at Princeton University19. Alternatively, researchers can modify existing dictionaries or develop entirely new dictionaries that are specific to their research concerns. The identification of words with similar meaning is not an objective process and is guided by the analyst's interests which should themselves be informed by the research questions. For example, one research project may simply be interested in identifying physical assaults, whereas, another may need to draw distinctions between different types (kick, punch, use of weapon etc.) or severities of physical assault. These distinctions need to be reflected in the coding frames. Although the WordNet database was used to assist in the identification of synonyms, much of this work was accomplished via inspection of the data: creating a bespoke dictionary was tested and refined through a number of iterations.

A more challenging stage of data preparation is the correct identification and standardisation of homonyms, defined above as words with the same spelling but different meanings which, under a basic frequency count, could be subject to misclassification. This requires a process known as disambiguation to clarify whether similar text really does warrant the same interpretation. This process is harder to automate, although software can produce a list of known ambiguous words and highlight them for further inspection. These potentially ambiguous words or phrases can then be displayed in the context of the surrounding text. This is important as the meaning of some words can only be determined from the rest of the phrase in which they are contained. Key Words in Context (KWIC) lists, also known as a concordance view, assist in the removal of ambiguity and the accurate categorisation of text segments that can be replaced with an unambiguous tokens. Such

amendments can be time consuming and although, once again, dictionaries can provide rules for disambiguation for use in automated processes, there is still a risk of error. The disambiguation of homonyms is aided, to a degree, when working with a local grammar, in this case crime reports, rather than truly natural language, as the potential number of meanings for a word is restricted by the context of being recorded by the police as part of a crime report. For example, within police data, the potential meanings of ‘stalk’ are restricted and so there is a greater probability, but no

19 https://wordnet.princeton.edu/ (last checked 6/08/2015)

84

certainty, that it will refer to the verb form. When the crime type of a case is known, then the meanings within the text of an individual MO field are likely to be yet more constrained.

The aim of quantitative content analysis is to convert coded segments of text into numerical

variables which represent features within the text. These numerical variables can then be subjected to traditional statistical analysis. This allows the generation of inferences about the importance of certain themes in a text, and for comparison between different texts. This can include the analysis of differences between texts produced at different points in time. When numerical variables are produced based on tokens, care needs to be taken about inferences made based on multiple

mentions of the same token; a token does not necessarily carry the same weight each time it is used.

It is important to consider the unit of analysis, and in this research, the occurrence of a token within each case is important – repeated tokens within the same case are not treated with any greater importance. While frequently used tokens are important, the majority of tokens, in any document, are used sparsely; they may occur only once (hapaxes) and for some research questions, including the questions posed by the current research, important insights may be gained through a

consideration of infrequently used tokens or absent tokens. This relates to Information Theory (Shannon, 1948) which maintains that rarer occurrences contain more information than commonly repeated occurrences.

Word frequencies do not provide any information about the relationships between words. It was noted above that, for N-gram analysis, consideration of word order is not necessary; however, it is possible to extend this technique through the inspection of the sequencing of words within a segment of text. Words that are used together (directly) or in proximity may help to inform research hypotheses and practice-orientated hypotheses (Chainey, 2014). The inspection of words that most frequently precede or succeed a specific word further help to provide context and interpret the meaning of word occurrences in the data. Data can be queried to identify collocations and concurrences of words within texts. As with individual word frequencies, it may be the rarely occurring or unusual combinations that are of interest.

Documento similar