Capítulo II Aprendizaje autónomo
5. Presentación del trabajo final
Gathering and analysing data is vital when it comes to surveillance. Electronic surveillance, both in the disciplinary and liberal sense, typically generates large amounts of data. For the most part these data are stored in databases for future reference. As a result the amount of surveillance data available in countless disparate data sources worldwide is immense and growing at an exponential rate. Though these data might contain a wealth of knowledge, information overload and a lack of integration between databases make it hard to discover that knowledge. In order to enhance the effectiveness of surveillance, techno- logies for integrating databases and finding information contained therein are in increasing demand. The popular term for these technologies, while not entirely accurate, is data mining.
4.1.1 Implementation
Data mining is a technology whereby useful information is mined from large quantities of data, much like the process of extracting minerals from the earth. Over the past two decades it has evolved from an experimental technology into an important instrument to help overcome the problem of information overload. Data mining allows the automatic analysis of databases and the recognition of important trends and behavioural patterns (Ména 2004, p. 29). A data mining exercise differs from a standard database query as it is aimed at finding previously unknown information in existing data. A standard database query returns information consisting of data from individual fields or records contained in the database (Taipale 2003, p. 22). The answer to a standard database query is always explicit, because it is a data item in the database. In addition, data mining is aimed at finding implicit information, such as patterns or relations in data, that were not previously identified and thus not themselves data items (Taipale 2003, p. 23).
While the term data mining is used to describe the entire process of know- ledge discovery in databases, it is actually only a step in the process. The broader process of finding useful information in large quantities of data is known as knowledge discovery in databases (KDD). A definition of knowledge discovery in databases is: “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data” (Frawleyet al.1992, p. 58). Still, data mining and knowledge discovery in databases have roughly become synonyms, with data mining being the most widely used term.
We can break the knowledge discovery process down into several distinct phases (Sietsma, Verbeek, Van den Herik 2002, p. 23).
1 Pre-processing
The first phase in the knowledge-discovery process is pre-processing. This stage involves steps such as goal definition, data collection, selection, and warehousing. When the goals of the data mining exercise have been defined the necessary data can be selected and collected. In traditional data mining, data collected from various sources is assembled into a single dataset that is stored in a data warehouse. The data-mining algorithms are applied to this data warehouse. But before the collected data can be mined effectively it must be cleansed, this means errors must be removed and missing fields must be completed.
2 Data mining
The actual data mining itself involves the application of particular algorithms to the data warehouse in order to elicit, identify, or discover certain previously unknown characteristics of the data, including descriptive and predictive patterns or relationships (Fayyadet al. 1996).
3 Post-processing
Post-processing consists of interpreting and evaluating the discovered patterns and determining their usefulness within the applicable domain context (Taipale 2003, p. 30). An important element of post-processing is determining to whom the results of the knowledge-discovery process should be addressed (Sietsma, Verbeek, Van den Herik 2002, p. 24).
Mining data from distributed sources
So far, we have discussed the classic data-mining process, but the classic, centralised data-mining approach described above is not always feasible. Advances in computing have resulted in countless heterogeneous and distri- buted data sources. Oftentimes the data in these data sources cannot be gathered into a single repository for processing due to privacy concerns, problems with scalability, or the fact that owners of the databases do not wish to share an entire dataset. Furthermore, the structure of these databases may differ and the data contained therein is not necessarily consistent. The field of distributed data mining (DDM) deals with this problem. In a distributed data- mining approach most of the processing takes place at the local database, with only aggregate data being sent to a central server.
Agent technology can play a key role in distributed data mining. In an agent-enabled distributed data-mining approach, software agents are sent to prepare and mine databases in different remote locations over open and closed networks (Ména 2004, p. 30). Software agents act as mediators between differ- ent data sources by providing logical and semantic interoperability. Further- more, software agents can aid subject-based inquiries by removing the need for manually querying different data sources. In this approach the agent functions as a query broker and handler that can continuously, or at predefined
intervals, query remote data sources in (near) real-time without the presence of a human operator.
One way in which agents may perform such tasks is by associating with every database a database agent that is responsible for accessing the database and by associating with every client a client agent that is responsible for gathering the information requested by the client. These agents can provide the client with the requested information. The agents that make up a multi- agent system, collaborate in order to: (1) determine which agent can provide the requested information, (2) map a client’s request onto database queries, (3) combine the information in such a way that the communication load is minimised, and (4) deal with inconsistencies among different databases.
Ontology’s play an important role in a multi-agent systems. An ontology provides a formal specification of a shared conceptualisation. The ontology can therefore be used for producing a description of the knowledge stored in a database. Such a description is necessary to determine which database can provide which part of the requested information. Moreover, the ontology needed for specifying the database content, must be closely tied to the ontology used for communication and the ontology of the language in which the client formulates his/her request. (Van den Herik, Wiesman and Roos 2001)
The application of agent technology in the field of distributed data mining is in particular considered for the purposes of homeland security and anti- terrorism (Bairdet al.2003, p. 23). But the private sector also employs (distri- buted) data mining for purposes such as risk management and market research. Taipale (2003) distinguishes between three discrete applications for auto- mated analysis of data in the context of domestic security. The first application issubject-oriented link analysiswhere data mining is used to learn more about a particular data subject, its relationships, associations, and actions. The second application ispattern analysis, whereby a descriptive or predictive model is developed based on previously discovered patterns. The third application is pattern matching, whereby a predictive or descriptive model is used against new data in order to identify related data subjects such as people, places, relationshipset cetera(Taipale 2003, p. 34). In other words data mining can be used to conduct eithersubject-based inquiriesorpattern-based inquiries. These different types of inquiries influence privacy and individual liberty in different ways, so I shall discuss them separately. But first I shall elaborate on the value of data mining for law enforcement and anti-terrorism purposes.
In the surveillant assemblage personal data regarding individuals can be obtained from a variety, of public and private-sector databases. Agent technol- ogy presents ample opportunity for aggregating and integrating information across a variety of heterogeneous and disparate data sources. The power of data aggregation and integration within the surveillant assemblage was clearly shown in the investigation that followed the September 11 terrorist attacks. Within a few days an accurate record of the last days of Mohammed Atta, the alleged ringleader of the September 11 hijackers, was compiled from data
gathered from public, put predominantly private sector surveillance systems. The data includedCCTVfootage, credit card receipts, cell phone information, and airline tickets. Of course, since the information on Mohammed Atta was garneredex postit was of no value in preventing the terrorist attacks. Had the information of these different surveillance systems been aggregated and integratedex anteit might have led to the discovery of Atta’s plans. This is the idea behind ‘connecting the dots’ and the driving force behind the wish to integrate surveillance systems. As described earlier, connecting the dots is perceived as vital when it comes to combating terrorism and other forms of serious crime. However, intensive aggregation and integration of data also poses a potential risk to privacy and individual liberty.
Subject-based inquiries
A subject-based inquiry is aimed at gaining a more complete picture of a specified data subject (for instance an individual or an organisation). Through a subject-based inquiry additional information regarding a particularised subject such as links, associations, history, and actions can be distilled from the available data.
Pattern-based inquiries
A pattern-based inquiry is a non-particularised search, that is to say, it is not aimed at a particular data subject or data subjects. By mining data in a non- particularised fashion, previously undiscovered relationships between indi- vidual data items can be discovered.Pattern analysismay aid law enforcement agencies and the intelligence community in developing descriptive or pre- dictive models of deviant behaviour. A pattern-based inquiry can be used to develop a descriptive model or predictive model based on discovered patterns in existing data (pattern analysis or data mining in a narrow sense). Once a descriptive or predictive model has been developed it can be applied to new data in order to find similar or related data subjects (pattern matching).
The general idea is that the planning and organisation of a terrorist attack (or for that matter any crime that requires sufficient organisation) leaves behind a trail in surveillance data that makes up a distinctive pattern. By matching a discovered pattern to a new data set, suspicious behaviour not previously apparent can be found. By matching the predictive or descriptive model against new data similar patterns can be detected in an earlier stage, thus enabling a more pro-active method of investigation.
4.1.2 Current examples
Data mining is used extensively throughout society. The use of data mining does not restrict itself to the private sector but is also prolific in the public sector. For instance, a survey conducted in 2004 by the Government Account-
ability Office of the United States under 128 federal departments showed that 52 agencies were using or were planning to use data mining. These depart- ments and agencies reported 199 data-mining efforts, of which 68 were planned and 131 were operational, some of which involved the processing of personal data (GAO Report 04-548). Though data mining was used for law enforcement purposes well before September 11, 2001, the terrorist attacks have definitely acted as a catalyst for the development of data mining in the area of law enforcement and homeland security. I shall restrict myself to giving examples of agent-enabled data-mining applications that are used for these purposes. COPLINK2
COPLINKis a good example of how an agent-enabled data-mining application can make law enforcement more efficient and effective.COPLINKis a system used by law enforcement agencies in the United States to aid in criminal investigations. TheCOPLINKsystem was developed to provide a solution to the lack of integration in law enforcement information systems. COPLINK software organises and analyses vast quantities of structured and seemingly unrelated data, housed in various incompatible databases and record manage- ment systems, over an intranet-based platform (Knowledge Computing Cor- poration 2004).COPLINKintegrates different data sources and facilitates subject- based inquiries.
Apart from integrating disparate databasesCOPLINKuses a collaboration and notification tool called ‘Active Agent’. This component of theCOPLINK system is a tool that can be set to watch for new data meeting user-specified parameters and then automatically notify the user(s) when such data is migrated intoCOPLINK(Knowledge Computing Corporation 2004). TheCOPLINK Active Agent thus automates the task of running repetitive or periodic database queries. The Active Agent also allows an investigator to collaborate with others who are conducting similar queries. If collaboration is set as active, the agent notifies other investigators running similar queries. This can quickly bring together incidents involving the same suspect or other database objects that are under investigation by different investigators, or by different jurisdictions (Knowledge Computing Corporation 2004).
InferAgent3
WhileCOPLINKis designed specifically for law enforcement purposes, many commercial parties also provide ‘off the shelf’ data-mining solutions that can be used for law enforcement and homeland security. One such program is InferX Software’sInferAgent.TheInferAgentsoftware suite uses agent technol- ogy to look for patterns and behaviours in networks made up of disparate databases. The agent technology used by InferX allows for (1) the automatic
2 <http://www.coplink.net> 3 <http://www.inferx.com>
analysis of separate, unlinked databases, (2) the recognition of important trends, and (3) behavioural patterns. These trends and behaviour patterns may identify suspicious activities and events related to fraud, terrorism, and theft. As conditions change in remote databases the InferX software agents detect the changes around them collaborating their findings to a centralised controller allowing for the discovery of potential threats, fraud, and risks.
ANITA4
In the area of subject-based inquiries, theANITAproject being conducted by research groups from the universities of Groningen, Utrecht, Maastricht, and Leiden is of particular interest. TheANITAproject aims to design an agent framework wherein administrative agents will decide, based on norms, whether to allow transactions of police data.
Currently, the information infrastructure of the Dutch police does not allow for complicated search queries in the police registries that deal with serious forms of crime. Agent technology can provide a solution to this problem (Koelewijn and Kielman 2006). By setting up a national registry on serious crime (beheersindex) that is only accessible to software agents, a fast and flexible query system is created. Privacy risks are avoided since humans have no access to the system and the software agents that have access to the system base their behaviour on pre-determined rules.
4.1.3 Future
The inability of the intelligence community to predict and prevent the Septem- ber 11 terrorist attacks underlined the importance of the ability to ‘connect the dots’ when it comes to (surveillance) data. Judging from the amount of data mining applications currently being considered, developed, and deployed, a great deal of faith is placed in data mining to ensure security. Whether this faith is justified remains subject of debate, but as the ‘war on terrorism’ goes on, the drive towards more effective ways to integrate databases is likely to continue.
It is important to recognise the importance of the surveillant assemblage when it comes to the future of data mining. Governments, most notably the United States government, actively pursue ways to use data contained in private-sector databases for public purposes, such as law enforcement and homeland security. The most prominent evidence to this intention is the DARPA’s Total Information Awareness initiative. Although the Total Information Awareness programme was discontinued, it did offer us a glimpse into the (planned) future of data mining. Two proposed programmes, theelectronic
4 Administrative Normative Information Transaction Agents (ANITA), NWO/ToKeN project no. 634.000.017.
evidence and link discovery(EELD) programme and theGENISYSprogramme were aimed at bringing data mining to the next level. Of these two programmes theGENISYSprogramme is most relevant to the subject matter of this thesis, for it would use agent technology as a primary tool for integrating disparate databases.
TheGENISYSprogramme, sought to produce technology for integrating and broadening databases and other information sources to support effective intelligence analysis aimed at preventing terrorist attacks on the citizens, institutions, and property of the United States (DARPA2003, p. 5). The technol- ogy to be developed would enable many physically disparate heterogeneous databases to be queried as if they were one logical ‘virtually’ centralised database (DARPA2003, p. A-11).GENISYSwas discontinued in September 2003 along with the other programmes that made up the Total Information Aware- ness programme. Still, it is interesting to study theGENISYSprogramme more in depth as it a prime example of the use of software agents for distributed data mining.
GENISYS would address the problems of current database technologies, which have their roots in the mid 1970s. In a time when process power, disk space, and bandwidth were expensive, time and space efficiencies were stressed at the expense of flexibility and ease of use, making automation a difficult task. Furthermore, human operators using a database have to know a great deal about the design of a particular database (for instance, how data items compare to real-world objects or people) in order to make sense of the data contained therein (Dyer 2003). An additional problem is the fact that database design differs from database to database, hampering effective integration.
In order to overcome these problemsGENISYSwas aimed at achieving three separate but related goals. The first goal was to enable the integration and restructuring of existing legacy databases. The second goal was to increase the coverage of vital information by making it easy to create new databases and attach new information feeds automatically. The third and final goal was to create a brand new database technology based on simple, scalable, dis- tributed information stores known as repositories (Dyer 2003).
One of the technologies drivingGENISYSwould be software-agent mediation. Software agents would relieve human analysts of the difficult tasks of having to know (1) all the details about how to access different databases, (2) the precise definition of terms, (3) the internal structure of the database, (4) how to join information from different sources, and (5) how to optimise queries for performance. Instead, this information would be encoded and managed by software agents (DARPA2003, p. A-11). In this way mediation agents would provide logical and semantic interoperability of previously disparate data sources.
DARPA’s data-mining efforts within the TIA programme anticipated the further evolution of information and communication technology, a develop- ment characterised by a trend towards ubiquity, interconnection, intelligence,
delegation, and human-orientation that will continue well into the future. These trends will eventually culminate in the nextICTparadigm, that of ambient intelligence (Ahola 2001). In a world where we are surrounded by ubiquitous computing and intelligence no single part of our lives will per default be able to seclude itself from digitisation (Langheinrich 2001).
Since the data generated by our intelligent environment will dwarf current volumes of data, automated and intelligent processing of data is a prerequisite for effective surveillance. For example, it is estimated that theEPCglobal Net-