• No se han encontrado resultados

CAPITULO II: LOS ENFOQUES BASICOS DEL REGIMEN DEMOCRATICO, LA DEMOCRATIZACION Y LA

DEMOCRACIA DELIBERATIVA

In sections 3.2 – 3.4 we have focused on the different functions which corpus tools fulfil. We now look in more detail at software architectures.

In the early days of corpus software development, the typical case was a program designed and written ‘in house’ at the users’ institution, intended to perform a single task. Naturally enough, some of this software became widely used and distributed and provided a model for further software developments. A ‘corpus workbench’ consisting of a group of programs was the next development. We have already seen in section 3.4, the CLAN software written (by Leonid Spektor of Carnegie Mellon University) originally for use with the CHILDES database (MacWhinney and Snow 1990, MacWhinney 1991). A more advanced cluster of the same general kind is the Lexa software suite developed by Hickey (1993a, 1993b),which includes corpus pre- processing, annotation, and text retrieval. These ‘toolkits’ take quite a significant step

from single-function to multi-function software development, the latter also illustrated by Brodda's (1991) PC Beta software.

After the move from single-task to multi-task software development, the next logical step is to aim for modular integrated architecture. The development of tools to build and exploit corpora which may run to hundreds of millions of words is an expensive task in terms of time and money. It is hardly surprising, therefore, that concepts such as reusability have been adapted to the field of corpus-based language engineering from the field of software engineering. A useful metaphor here is ‘software Lego’. Programming practices should allow small programs to be slotted together to form larger and altogether more useful programs according to need. Developing software for new functions then need not require going back to the drawing board: a couple of pieces of ‘Lego’ to fit to the existing architecture may be all that is required. Two initiatives which have this modular type of design are (a) the MULTEXT project (as previously mentioned in section 3.2) and (b) the GATE architecture (Cunningham et al. 1996) developed at Sheffield in the UK. In the MULTEXT work, as in related work at Edinburgh (Thompson and McKelvie 1996), the unifying principle is that it should be possible for a text stream in a standard (SGML-based) format to be pipelined between any one module and another without hindrance. Cunningham et al (2000) describe the various software requirements that guided the implementation of GATE.

The openNLP57 initiative has some overlap with GATE and is intended to act as a coordinating structure for several open source projects in Natural Language Processing.

3.6 Summary

In this chapter, the tools needed for the creation and exploitation of corpora, in particular annotated corpora, have been categorised into three major groups: corpus development (the input of information into a corpus), corpus editing (changing

information in a corpus), and information extraction (the output of information from a corpus). We have looked at features and given examples of software in each of the three groups, focussing particularly on software falling into the third category.

Choosing one package over another involves decisions about machine operating system type, as not many packages are supported across the main platforms (UNIX, Linux, PC DOS, PC Windows, Apple Macintosh). Considerable advantage can be gained by using web interfaces and off-the-shelf software such as commercial database packages. Making use of a web interface for corpus software will save the end-user some of the cost of the learning curve in adopting new software, since they will usually be familiar with web browsers which provide access from most platforms. There will often be no extra software to install for the end-user since web browsers are pre-installed along with the operating system. Pioneering concordance services have been provided using the web interface to Stuttgart’s xkwic58, or the simple search of BNC Online59, TACTweb60 (Bradley and Rockwell 1995, Rockwell et al 1997) and BNCweb61 (Lehmann et al, 2000). These, however, usually require separate server machines, and in the case of xkwic, for example, this server is limited to a Unix/Linux operating system. An obvious disadvantage of this approach is the requirement that the user’s computer is connected to a suitable network with access to the corpus server. We mentioned one instance of commercial database packages in section 3.2, in discussing corpus storage: the database of the Spoken English Corpus. The database architecture has the advantage of using the fast indexing and data management functions already available in a commercial database package. Not all the software in the corpus toolbox has to have been developed for, and dedicated to, corpus-based research.

The functionality and usability of search and retrieval packages have been enhanced over recent years to the extent that a number of quite sophisticated functionalities are

58 Available for browsing (using username and password) the ICAME corpus collection online at

http://www.hd.uib.no/icame.html and the Slovene concordance service at http://nl2.ijs.si/corpus/ provided by Tomaž Erjavec.

59 BNC Online simple search is located at http://sara.natcorp.ox.ac.uk/lookup.html 60 See website at http://tactweb.humanities.mcmaster.ca/

now commonplace and expected. In this chapter, we have summarised the inclusion or exclusion of twelve important features in nine of the most widely cited retrieval software packages in corpus linguistics and related research. Many of the tools are very capable of producing word frequency lists and KWIC concordances. However, only one (WordSmith) is capable of statistical comparison of word frequency lists. None of the tools combine the annotation-awareness capability with the comparison of frequency lists. It is this combination of two features that we see as vital in defining a practical data-driven approach as discussed in section 1.1. In section 4.4 of the next chapter, we will show a worked example to illustrate why this combination of features is particularly useful in corpus studies.