• No se han encontrado resultados

RAZONES POR LAS CUALES LA FORMULACIÓN DEL PLAN ES SOMETIDA A EVALUACIÓN AMBIENTAL

A.- SITUACIÓN DE LA INTERCOMUNA DE TINGUIRIRICA

A.2 SITUACIÓN DE LA INFRAESTRUCTURA

V.- RAZONES POR LAS CUALES LA FORMULACIÓN DEL PLAN ES SOMETIDA A EVALUACIÓN AMBIENTAL

We kno w which data we want to pull into the data wareho use, and we kno w the destinatio n tables fo r that data. There are several bits o f info rmatio n to be lo gged during an ETL pro cess. First, yo u'll track the dat a co unt (the number o f ro ws) transfered at each step in the pro cess. This is useful info rmatio n fo r a few go o d reaso ns:

We can make sure no ro ws are "lo st" in the ETL pro cess.

We can detect abno rmal data; if we pro cess 10 0 0 ro ws o ne day , but the next day we o nly pro cess 10 , we'll kno w that so mething pro bably went wro ng.

In the future we can use this captured data to predict future capacity needs. Other useful bits o f info rmatio n are the st art and e nd t im e s o f the pro cess:

They can be used to alert us to pro blems. They also allo w us to plan fo r future capacity.

Lo gging is no t always co mplex. While we co uld use a single table to track this info rmatio n, we'll split it into two tables: e t lRuns and e t lLo g. Switch to a terminal, and lo g into yo ur acco unt, then co nnect to yo ur perso nal database. Run this co mmand against yo ur perso nal database:

CODE TO TYPE:

CREATE TABLE etlRuns (

run_id integer NOT NULL AUTO_INCREMENT, start_time datetime NOT NULL,

end_time datetime, PRIMARY KEY(run_id) );

Next we'll create the e t lLo g table, which will be used to lo g messages and statistics. Many o f these co lumns are TOS- specific (Talend Open Studio -specific. We'll explain mo re abo ut Talend in the next lesso n). We will see them again when we implement lo gging in a later lesso n. So me o f this info rmatio n wo n't be useful fo r every wareho use; it is up to yo u to decide the amo unt and type lo gging yo u need. Run the fo llo wing co mmand against yo ur perso nal database:

CODE TO TYPE: CREATE TABLE etlLog (

run_id integer NOT NULL, moment datetime NOT NULL, pid varchar(20), father_pid varchar(20), root_pid varchar(20), system_pid double, project varchar(50), job varchar(50), job_repository_id varchar(255), job_version varchar(255), context varchar(50), priority int, origin varchar(255), message_type varchar(255), message varchar(255), code int, duration double, count int, reference int, thresholds varchar(255), key(run_id) );

With these tables in place, we are ready to tackle audit ing.

Why do we need auditing? Suppo se we co me into wo rk o ne day to find that o ur daily sales jumped o vernight fro m $10 ,0 0 0 to $1,0 0 0 ,0 0 0 . Yo u kno w that the co mpany did no t sell $1,0 0 0 ,0 0 0 in o ne day, but ho w do yo u track do wn the pro blem?

Yo u can use the auditing features o f the data wareho use to debug the pro blem. Auditing allo ws us to link ro ws in tables with specific "runs" via the run_id column in etlLog. Each row in our fact and dimension tables will have a run_id co lumn, letting us kno w exactly when that data was added to the wareho use.

We implemented o ur dimensio ns and fact tables in the prio r lesso n, and tho se tables do n't have any co lumns related to auditing. We did create so me dimensio ns type 2 SCD, but they can't be used fo r auditing purpo ses. Instead we'll need to add a run_id column to all of those tables.

CODE TO TYPE:

ALTER TABLE dimCustomer ADD run_id int not null REFERENCES etlRuns(run_id); ALTER TABLE dimMovie ADD run_id int not null REFERENCES etlRuns(run_id); ALTER TABLE dimStore ADD run_id int not null REFERENCES etlRuns(run_id); ALTER TABLE dimStaff ADD run_id int not null REFERENCES etlRuns(run_id); ALTER TABLE factSales ADD run_id int not null REFERENCES etlRuns(run_id);

ALTER TABLE factCustomerCount ADD run_id int not null REFERENCES etlRuns(run_id); ALTER TABLE factRentalCount ADD run_id int not null REFERENCES etlRuns(run_id); ALTER TABLE factRentalDuration ADD run_id int not null REFERENCES etlRuns(run_id);

Note

We wo n't add auditing to dimDate since it will only be loaded once. ETL pro cesses themselves are typically bro ken into three parts:

1. Initial ho usekeeping such as create a "run", o r clear temp files and tables. 2. Extract, Transfo rm, and Lo ad data.

3. Final ho usekeeping such as end a "run," send email, o r clear temp files and tables.

To do the initial ho usekeeping we will use a sto red pro cedure, called etl_StartRun. This procedure will be used to po pulate the etlRuns table and return the run_id to be used in all ETL processes. It will return the same run_id each time it is called, until the co rrespo nding "final ho usekeeping" pro cedure etl_EndRun is called. Run this command against yo ur perso nal database:

CODE TO TYPE: DELIMITER //

CREATE PROCEDURE etl_StartRun() BEGIN

DECLARE current_run_id INTEGER;

SELECT max(run_id) into current_run_id FROM etlRuns

WHERE end_time IS NULL;

IF current_run_id IS NULL THEN BEGIN

INSERT INTO etlRuns (start_time) VALUES (now()); SELECT LAST_INSERT_ID() into current_run_id; END;

END IF;

SELECT 'run_id' as "key", current_run_id as value; END;

//

DELIMITER ;

With that pro cedure o ut o f the way, we can think abo ut the last part o f the pro cess: a pro cedure to perfo rm final ho usekeeping. Run this co mmand against yo ur perso nal database:

CODE TO TYPE: DELIMITER //

CREATE PROCEDURE etl_EndRun ()

BEGIN

UPDATE etlRuns SET end_time=now() where end_time IS NULL; END;

//

DELIMITER ;

That lo o ks great! No w we're ready to lo o k at o ur so urce data.

Getting Data into the Warehouse

No w that we have o ur data wareho use setup, it's time to review o ur so urce data. We defined the data we're putting into o ur wareho use in the previo us lesso ns, and we have so me understanding o f the data so urce. But do we kno w everything abo ut o ur so urce data? What info rmatio n can we pro vide fo r o ur business users?

Part o f ETL is T ransf o rm at io n - cleaning and transfo rming so urce data so it's easier to understand and mo re useful. T ransf o rm at io n can alter so urce data to make it better by:

Changing custo mer status co des, such as O, D, C to "OK," "Deleted Custo mer," and "Acco unt in Co llectio ns."

Handling kno wn so urce data erro rs o r test data, such as all acco unts with the prefix "TST_0 1."

Splitting data in o ne co lumn into multiple co lumns, fo r example, splitting a single field fo r "R:20 0 8 -0 5-20 " into "Rental" and "20 0 8 -0 5-20 ."

Often business analysts and users are respo nsible fo r determining and do cumenting transfo rmatio n and mapping rules. Other times, tho se respo nsibilities fall upo n the pro grammer. But in any o f tho se situatio ns, it's impo rtant to have clear do cumentatio n. We want no co nfusio n abo ut the reaso ns custo mer co des o f "D" are getting translated to "Deleted Custo mer" in the data wareho use.

Note

Mo st co mpanies have data scattered acro ss many different systems, databases, and files. We'll keepthings simple fo r this co urse by restricting o ur data so urces. No matter where yo ur data o riginates fro m, the pro cess fo r getting it into the data wareho use is the same.

So , ho w will we do cument o ur transfo rmatio n and mapping rules? We 'll use t he e asie st and m o st use f ul m e t ho d available . This might be a wo rd do cument in so me situatio ns, o r a spreadsheet in ano ther. Fo r this co urse we'll just use plain text do cuments to describe o ur transfo rmatio ns.

dimDate

Our date dimensio n do esn't really have a so urce o ther than a calendar. So ho w do we co me up with the data? Pro grammers will co mmo nly use o ne o f these metho ds:

Create a pro gram to po pulate the date table. Create a spreadsheet with date data in it.

Co py the date dimensio n fro m an existing data wareho use.

Suppo se o ne o f the business users is handy with Excel, and has o ffered to create a spreadsheet fo r yo u. The spreadsheet will already co ntain all o f the required info rmatio n, including ho lidays and weekends. In this case, mo st o f the wo rk is do ne fo r us. We o nly need to lo ad the data (which we'll do in the next lesso ns).

dimCustomer

Let's take a clo ser lo o k at dimCustomer. Back in lesson three we discovered that a customer record is stored in several tables in the sakila database: cust o m e r, addre ss, cit y and co unt ry. We're no t planning o n do ing any transfo rmatio ns o n the data, but suppo se a business user info rms us that ro ws in the custo mer table where customer_id <= 10 are actually test accounts that should be excluded from the data

wareho use.

Let's write the query we need to extract the data fro m the custo mers table. Switch to the seco nd terminal, and lo g into the sakila database. Run this co mmand against the sakila database:

CODE TO TYPE: SELECT

c.customer_id, c.first_name, c.last_name, c.email,

a.address, a.address2, a.district, ci.city,

co.country, postal_code,

a.phone,c.active, c.create_date

FROM customer c

JOIN address a on (c.address_id = a.address_id) JOIN city ci on (a.city_id = ci.city_id)

JOIN country co on (ci.country_id = co.country_id) WHERE customer_id > 10;

OBSERVE: mysql> SELECT

-> c.customer_id, c.first_name, c.last_name, c.email, -> a.address, a.address2, a.district,

-> ci.city, -> co.country, -> postal_code,

-> a.phone,c.active, c.create_date -> FROM customer c

-> JOIN address a on (c.address_id = a.address_id) -> JOIN city ci on (a.city_id = ci.city_id)

-> JOIN country co on (ci.country_id = co.country_id) -> WHERE customer_id > 10;

+---+---+---+--- ---+---+---+--- -+---+---+--- ----+---+---+---+

| customer_id | first_name | last_name | email | address | address2 | district | city | country | postal_c ode | phone | active | create_date |

+---+---+---+--- ---+---+---+--- -+---+---+--- ----+---+---+---+

| 218 | VERA | MCCOY | [email protected] | 1168 Najafabad Parkway | | Kabol | Kabul | Afghanistan | 40301 | 886649065861 | 1 | 2004-03-19 00:00:00 |

| 441 | MARIO | CHEATHAM | [email protected] | 1924 Shimonoseki Drive | | Batna | Batna | Algeria | 52625 | 406784385440 | 1 | 2004-10-07 00:00:00 |

| 69 | JUDY | GRAY | [email protected] | 1031 Daugavpils Parkway | | Bchar | Bchar | Algeria | 59025 | 107137400143 | 1 | 2004-02-25 00:00:00 |

| 176 | JUNE | CARROLL | [email protected] | 757 Rustenburg Avenue | | Skikda | Skikda | Algeria | 89668 | 506134035434 | 1 | 2004-08-11 00:00:00 |

| 320 | ANTHONY | SCHWAB | [email protected] | 1892 Nabereznyje Telny Lane | | Tutuila | Tafuna | American Samoa | 28396 | 478229987054 | 1 | 2004-07-20 00:00:00 |

| 528 | CLAUDE | HERZOG | [email protected] | 486 Ondo Parkway | | Benguela | Benguela | Angola | 35202 | 105882218332 | 1 | 2004-01-24 00:00:00 |

...lines ommitted...

| 303 | WILLIAM | SATTERFIELD | WILLIAM.SATTERFIELD@sakilacustomer. org | 687 Alessandria Parkway | | Sanaa | Sanaa | Yemen | 57587 | 407218522294 | 1 | 2004-04-22 00:00:00 |

| 213 | GINA | WILLIAMSON | [email protected] | 1001 Miyakonojo Lane | | Taizz | Taizz | Yemen | 67924 | 584316724815 | 1 | 2004-08-02 00:00:00 |

| 553 | MAX | PITT | [email protected] | 1917 Kumbakonam Parkway | | Vojvodina | Novi Sad | Yugoslavia | 11892 | 698182547686 | 1 | 2004-02-09 00:00:00 |

| 438 | BARRY | LOVELACE | [email protected] | 1836 Korla Parkway | | Copperbelt | Kitwe | Zambia | 55405 | 689681677428 | 1 | 2004-09-24 00:00:00 |

---+---+---+--- -+---+---+--- ----+---+---+---+

589 rows in set (0.04 sec)

It lo o ks like this is a go o d query to use to extract custo mer info rmatio n. Save this query - we will use it in a future lesso n.

dimMovie

The next table we will po pulating is dim Mo vie . Data fro m this table co mes fro m two tables: f ilm and language. We will have to jo in o n language twice ho wever, since the f ilm table jo ins to language o n language_id and original_language_id.

Let's write the query needed to extract the data fro m the custo mers table. Run this co mmand against the sakila database:

CODE TO TYPE:

SELECT f.film_id, f.title, f.description, f.release_year,

l.name as language, orig_lang.name as original_language,

f.rental_duration, f.length, f.rating, f.special_features

FROM film f

JOIN language l on (f.language_id=l.language_id)

JOIN language orig_lang on (f.original_language_id = orig_lang.language_id); Try executing the query. If yo u typed everything co rrectly, yo u will see the fo llo wing:

OBSERVE:

mysql> SELECT f.film_id, f.title, f.description, f.release_year, -> l.name as language, orig_lang.name as original_language, -> f.rental_duration, f.length, f.rating, f.special_features -> FROM film f

-> JOIN language l on (f.language_id=l.language_id)

-> JOIN language orig_lang on (f.original_language_id = orig_lang.language_ id);

Empty set (0.01 sec)

What happened to the data? We do n't have a WHERE clause, so that can't be the problem. But we do have two jo ins. Let's write ano ther query to find o ut which jo in is failing us. Run this co mmand against the sakila database:

CODE TO TYPE:

SELECT count(distinct language_id), count(distinct original_language_id) FROM film f;

Run the query, and o bserve the results: OBSERVE:

mysql> SELECT count(distinct language_id), count(distinct original_language_id) -> FROM film f;

+---+---+ | count(distinct language_id) | count(distinct original_language_id) | +---+---+ | 1 | 0 | +---+---+ 1 row in set (0.00 sec)

It lo o ks like we do n't have any films that have been translated. Perhaps this is a feature in pro gress, o r an o ld feature that has since been abando ned. Whatever the reaso n, we will need to alter o ur SELECT query to use a LEFT J OIN instead o f a no rmal jo in. Run this co mmand against the sakila database:

CODE TO TYPE:

SELECT f.film_id, f.title, f.description, f.release_year,

l.name as language, orig_lang.name as original_language,

f.rental_duration, f.length, f.rating, f.special_features

FROM film f

JOIN language l on (f.language_id=l.language_id)

LEFT JOIN language orig_lang on (f.original_language_id = orig_lang.language_id );

As lo ng as yo u typed everything co rrectly, yo u will see lo ts o f results: OBSERVE:

mysql> SELECT f.film_id, f.title, f.description, f.release_year, -> l.name as language, orig_lang.name as original_language, -> f.rental_duration, f.length, f.rating, f.special_features -> FROM film f

-> JOIN language l on (f.language_id=l.language_id)

-> LEFT JOIN language orig_lang on (f.original_language_id = orig_lang.lang uage_id);

+---+---+--- --- ---+---+---+---+---+-- ---+---+---+

| film_id | title | description | release_year | language | original_language | rental_duration | l ength | rating | special_features |

+---+---+--- --- ---+---+---+---+---+-- ---+---+---+

| 1 | ACADEMY DINOSAUR | A Epic Drama of a Feminist And a Mad S cientist who must Battle a Teacher in The Canadian Rockies | 2006 | English | NULL | 6 | 86 | PG | Deleted Scenes,Behind the Scenes | | 2 | ACE GOLDFINGER | A Astounding Epistle of a Database Adm inistrator And a Explorer who must Find a Car in Ancient China | 2006 | English | NULL | 3 | 48 | G | Trailers,Deleted Scenes | | 3 | ADAPTATION HOLES | A Astounding Reflection of a Lumberjac k And a Car who must Sink a Lumberjack in A Baloon Factory | 2006 | English | NULL | 7 | 50 | NC-17 | Trailers,Deleted Scenes | ...lines omitted...

| 998 | ZHIVAGO CORE | A Fateful Yarn of a Composer And a Man who must Face a Boy in The Canadian Rockies | 2006 | English | NULL | 6 | 105 | NC-17 | Deleted Scenes | | 999 | ZOOLANDER FICTION | A Fateful Reflection of a Waitress And a Boat who must Discover a Sumo Wrestler in Ancient China | 2006 | English | NULL | 5 | 101 | R | Trailers,Deleted Scenes | | 1000 | ZORRO ARK | A Intrepid Panorama of a Mad Scientist And a Boy who must Redeem a Boy in A Monastery | 2006 | English | NULL | 3 | 50 | NC-17 | Trailers,Commentaries,Behind the Scenes | +---+---+--- --- ---+---+---+---+---+-- ---+---+---+

1000 rows in set (0.02 sec) This lo o ks great!

dimStore

The last table we'll wo rk o n po pulating is dim St o re . Data fro m this table co mes fro m many tables: st o re,

st af f, addre ss, cit y, and co unt ry. Run this co mmand against the sakila database: CODE TO TYPE:

SELECT s.store_id, a.address, a.address2, a.district, c.city, co.country, a.postal_code, s.region,

st.first_name as manager_first_name,

st.last_name as manager_last_name

FROM

store s

JOIN staff st on (s.manager_staff_id = st.staff_id)

JOIN address a on (s.address_id = a.address_id) JOIN city c on (a.city_id = c.city_id)

JOIN country co on (c.country_id = co.country_id); Run the query, and o bserve the results:

OBSERVE:

mysql> SELECT s.store_id, a.address, a.address2, a.district, -> c.city, co.country, a.postal_code, s.region,

-> st.first_name as manager_first_name, -> st.last_name as manager_last_name -> FROM

-> store s

-> JOIN staff st on (s.manager_staff_id = st.staff_id) -> JOIN address a on (s.address_id = a.address_id) -> JOIN city c on (a.city_id = c.city_id)

-> JOIN country co on (c.country_id = co.country_id) -> ;

+---+---+---+---+---+---+ ---+---+---+---+

| store_id | address | address2 | district | city | country | postal_code | region | manager_first_name | manager_last_name |

+---+---+---+---+---+---+ ---+---+---+---+

| 1 | 47 MySakila Drive | NULL | Alberta | Lethbridge | Canada | | West | Mike | Hillyer |

| 2 | 28 MySQL Boulevard | NULL | QLD | Woodridge | Australia | | East | Jon | Stephens |

+---+---+---+---+---+---+ ---+---+---+---+

2 rows in set (0.00 sec) This lo o ks great to o !

Great jo b so far. In the next lesso n w'll practice writing an ETL jo b. See yo u then! Copyright © 1998-2014 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Documento similar