Segunda parte: Marco empírico
5.3 Material y método
5.3.4 Descripción de los instrumentos de medida
Conceptual models generally contain attributes that have generic types. These types are well enough descriptive but not specific to any implementation.
Figure 8.9: Transforming a inheritance hierarchy: one table per hi- erarchy
Generic type Description
Boolean Used to represent the logical values of True or False
Currency Used to declare variables capable of holding fixed- point numbers with 15 digits to the left of the decimal point and 4 digits to the right
Date Used to hold date and time values
Double Used to declare variables capable of holding real numbers with 15-16 digits of precision
Integer Used to declare whole numbers with up to 4 digits of precision
Long Used to hold numbers with up to 10 digits of pre- cision
Single Used to declare variables capable of holding real numbers with up to 7 digits of precision
String Used to hold an unlimited number of characters
8.6. Conclusions 131
Generic type Datatype
Boolean Bit
Currency Decimal
Date Date
Double Double Precision Integer Integer
Long Decimal Single Decimal
String Char
Table 8.2: Generic types mapped to ANSI SQL datatypes
In Table 8.1, we show some of the generic types that are commonly used. These generic types are available in mostCASEtools.
In Table 8.2, we show a mapping of the generic types shown in Table 8.1 to specific types defined inStructured Query Language
(SQL)[11]. These table is only an example, because designers should define their own mappings for every project. For example, Integer may be also mapped to Numeric, Decimal, or Smallint.
The previous figures, we have included some mapping from at- tribute types to column datatypes. For example, in Figure 8.5, the Stringtype is transformed intoChar(50)andChar(255), whereas the Currency type is turned into Decimal(10,2). In Figure 8.6 and Fig- ure 8.7, theBooleantype is turned intoChar(1).
8.6
Conclusions
In this chapter, we have explained how we tackle the logical modeling of data sources andDW. Our approach in based on theUML Profile for Database Design [90], created by Rational Software Corporation for use when designing a database. From the conceptual model of the data sources and theDW, we have shown how to proceed to the logical model applying some mappings.
Chapter 9
Modeling ETL Processes
in Data Warehouses
ETL processes are responsible for theextraction of data from het- erogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading into DW. ETL pro- cesses are a key component ofDWbecause incorrect or misleading data will produce wrong business decisions, and therefore, a correct design of these processes at early stages of aDW project is absolutely necessary to improve data quality. However, not much research has dealt with the modeling of ETL processes. In this chapter, we present our approach that allows us to accomplish the modeling of these ETL processes to- gether with the logical schema of the targetDWin an integrated manner. We provide the necessary mechanisms for an easy and quick specification of the common operations defined in theseETL processes such as, the integration of different data sources, the transformation between source and target attributes, the generation of surrogate keys and so on. More- over, our approach allows the designer a comprehensive tracking and documentation of entire ETL processes, which enormously facilitates the maintenance of these processes. Another advantage of our proposal is the use of the UML (standardization, ease-of-use and functionality) and the seamless integration of the design of the ETL processes with the DW logical schema. Finally, we show how to use our integrated
approach by using a well-known modeling tool such as Rational Rose.
Contents
9.1 Introduction . . . 135 9.2 ETL . . . 136 9.3 Modeling ETL processes . . . 137
9.3.1 Aggregation . . . 138 9.3.2 Conversion . . . 140 9.3.3 Log . . . 142 9.3.4 Filter . . . 142 9.3.5 Join . . . 143 9.3.6 Loader . . . 144 9.3.7 Incorrect . . . 145 9.3.8 Merge . . . 145 9.3.9 Wrapper . . . 146 9.3.10 Surrogate . . . 146 9.4 ETL Examples . . . 147
9.4.1 Transform Columns into Rows . . . 148 9.4.2 Merging Two Different Data Sources and
Multi-target Loading . . . 148 9.4.3 Aggregate and Surrogate Key Process . 150
9.1. Introduction 135
9.1
Introduction
Recalling Bill Inmon’s DW definition [57], “A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decisions”. ADW is “integrated”
For more infor- mation about Bill Inmon’sdata warehouse def- inition, consult Section 2.1, pp. 13.
because data are gathered into theDWfrom a variety of sources and merged into a coherent whole. ETLprocesses are responsible for the extraction of data from heterogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading intoDW.
It is highly recognized that the design and maintenance of these
ETLprocesses is a key factor of success inDWprojects for several reasons [123, 124]. DWare usually populated by data from different and heterogeneous operational data sources such as legacy systems, relational databases, COBOL files, Internet (XML, web logs) and so on. Therefore, different routines have to be defined and configured for accessing these heterogeneous systems and loading the correct data into the commonDW repository.
Moreover, data from the operational systems are usually specified in different schemas and have to be extracted and transformed to collect them into a commonDW repository [104]. Some of the more common technical tasks that have to be accomplished with these data are as follows. Data coming from different sources have to be joined into a unique target in theDW. Data usually have to be aggregated in order to facilitate the definition of the queries and improve the performance of theDW. Data are usually in different types and for- mats and they need to be converted into a common format. Data in the operational systems are usually managed by different primary keys representing for example, product or store codes and so on. In
DW we usually use surrogate keys, and therefore, we need an effi- cient mechanism to assign surrogate keys to the operational data in theDW repository. Furthermore, as data are coming from different sources, we usually need to check the different primary and foreign keys to assure a high quality of data. Moreover, we also need a high number of filters to verify the right data to be uploaded in theDW
and many more problems.
Due to the high difficulty in designing and managing these ETL
processes, there has lately been a proliferation in the number of avail- ableETLtools that try to simplify this task [45, 67]. During 2001, theETLmarket grew to about $667 million [5]. Currently, compa- nies expend more than thirty percent out of the total budget forDW
projects in expensiveETLtools, but “It’s not unusual for the ETL effort to occupy 60 percent to 80 percent of a data warehouse’s im- plementation effort” [124]. Nevertheless, it is widely recognized that the design and maintenance of theseETLprocesses has not yet been
solved [5].
Therefore, we argue that a model and method is needed to help the design and maintenance of these ETL processes from the early stages of a DW project; as Kimball states, “Our job as data ware- house designers is to star with existing sources of used data” [63]. Unfortunately, little effort has been dedicated to propose a logical model that allows us to formally define theseETLprocesses.
In this chapter, we present a logical model based on theUML[97] for the design of ETLprocesses which deals with the more common technical problems above-presented. As the UMLhas been widely accepted as the standard forOOanalysis and design, we believe that our approach will minimize the efforts of developers in learning new diagrams or methods for modelingETLprocesses. Furthermore, the logical modeling of these ETLprocesses is totally integrated in our global approach. Therefore, our approach reduces the development time of aDW, facilitates managing data repositories,DWadminis- tration, and allows the designer to perform dependency analysis (i.e. to estimate the impact of a change in the data sources in the global
DW schema).
The rest of this chapter is organized as follows. Section 9.2 pro- vides an overview of ETLprocesses and their surrounding data qual- ity problems. Section 9.3 describes in detail how to accomplish the logical modeling of ETLprocesses using our proposal. Then, Sec- tion 9.4 shows how some interestingETLproblems are solved apply- ing our approach. Finally, Section 9.5 presents the main conclusions.
9.2
ETL
In an ETL process, the data extracted form a source system pass through a sequence of transformations before they are loaded into a
DW. The repertoire of source systems that contribute data to aDW
is likely to vary from standalone spreadsheets to mainframe-based systems many decades old. Complex transformations are usually im- plemented in procedural programs, either outside the database (in C, Java, Pascal, etc.) or inside the database (by using any 4GL). The design of anETLprocess is usually composed of six tasks:
1. Select the sources for extraction: the data sources to be used in theETLprocess are defined. It is very common in anETL
process to access different heterogeneous data sources.
2. Transform the sources: once the data have been extracted from the data sources, they can be transformed or new data can be derived. Some of the common tasks of this step are: filtering data, converting codes, performing table lookups, calculating