• No se han encontrado resultados

HERRAMIENTAS PARA EL DESARROLLO DE OVA

3. DISEÑO Y CONSTRUCCIÓN DE OVA

3.4 HERRAMIENTAS PARA EL DESARROLLO DE OVA

Using Hadoop Distributed File System (HDFS) Tables 94

date date, amount float4, category text, desc1 text ) LOCATION ('gpfdists://etlhost-1:8081/*.txt',

'gpfdists://etlhost-2:8082/*.txt') FORMAT 'TEXT' ( DELIMITER '|' NULL ' ') ; Example 4—Single gpfdist instance with error logging

Uses the gpfdist protocol to create a readable external table, ext_expenses, from all files with the txt extension. The column delimiter is a pipe ( | ) and NULL (‘ ‘) is a space.

Access to the external table is single row error isolation mode. Input data formatting errors are written to the error table, err_customer, with a description of the error. Query err_customer to see the errors, then fix the issues and reload the rejected data.

If the error count on a segment is greater than five (the SEGMENT REJECT LIMIT

value), the entire external table operation failes and no rows are processed.

=# CREATE EXTERNAL TABLE ext_expenses ( name text,

date date, amount float4, category text, desc1 text ) LOCATION ('gpfdist://etlhost-1:8081/*.txt',

'gpfdist://etlhost-2:8082/*.txt') FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')

LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5;

To create the readable ext_expenses table from CSV-formatted text files:

=# CREATE EXTERNAL TABLE ext_expenses ( name text,

date date, amount float4, category text, desc1 text ) LOCATION ('gpfdist://etlhost-1:8081/*.txt',

'gpfdist://etlhost-2:8082/*.txt') FORMAT 'CSV' ( DELIMITER ',' )

LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5; Example 5—TEXT Format on a Hadoop Distributed File Server

Creates a readable external table, ext_expenses, using the gphdfs protocol. The column delimiter is a pipe ( | ).

=# CREATE EXTERNAL TABLE ext_expenses ( name text,

date date, amount float4, category text, desc1 text ) LOCATION ('gphdfs://hdfshost-1:8081/data/filename.txt') FORMAT 'TEXT' (DELIMITER '|');

Note: gphdfs requires only one data path.

For examples of reading and writing custom formatted data on a Hadoop Distributed File System, see “Reading and Writing Custom-Formatted HDFS Data” on page 84.

Example 6—Multiple files in CSV format with header rows

Creates a readable external table, ext_expenses, using the file protocol. The files are

CSV format and have a header row.

date date, amount float4, category text, desc1 text ) LOCATION ('file://filehost:5432/data/international/*', 'file://filehost:5432/data/regional/*'

'file://filehost:5432/data/supplement/*.csv') FORMAT 'CSV' (HEADER);

Example 7—Readable Web External Table with Script

Creates a readable web external table that executes a script once per segment host:

=# CREATE EXTERNAL WEB TABLE log_output (linenum int, message text)

EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST FORMAT 'TEXT' (DELIMITER '|');

Example 8—Writable External Table with gpfdist

Creates a writable external table, sales_out, that uses gpfdist to write output data to the file sales.out. The column delimiter is a pipe ( | ) and NULL (‘ ‘) is a space. The file will be created in the directory specified when you started the gpfdist file server.

=# CREATE WRITABLE EXTERNAL TABLE sales_out (LIKE sales) LOCATION ('gpfdist://etl1:8081/sales.out')

FORMAT 'TEXT' ( DELIMITER '|' NULL ' ') DISTRIBUTED BY (txn_id);

Example 9—Writable External Web Table with Script

Creates a writable external web table, campaign_out, that pipes output data received by the segments to an executable script, to_adreport_etl.sh:

=# CREATE WRITABLE EXTERNAL WEB TABLE campaign_out (LIKE campaign)

EXECUTE '/var/unload_scripts/to_adreport_etl.sh' FORMAT 'TEXT' (DELIMITER '|');

Example 10—Readable and Writable External Tables with XML Transformations

Greenplum Database can read and write XML data to and from external tables with gpfdist. For information about setting up an XML transform, see “Transforming XML Data” on page 104.

Handling Load Errors

Readable external tables are most commonly used to select data to load into regular database tables. You use the CREATE TABLE AS SELECT or INSERT INTO commands to query the external table data. By default, if the data contains an error, the entire command fails and the data is not loaded into the target database table.

The SEGMENT REJECT LIMIT clause allows you to isolate format errors in external table data and to continue loading correctly formatted rows. Use SEGMENT REJECT LIMIT to set an error threshold, specifying the reject limit count as number of ROWS

Using Hadoop Distributed File System (HDFS) Tables 96 The entire external table operation is aborted, and no rows are processed, if the number of error rows reaches the SEGMENT REJECT LIMIT. The limit of error rows is per-segment, not per entire operation. The operation processes all good rows, and it discards or logs any erroneous rows into an error table (if you specified an error table), if the number of error rows does not reach the SEGMENT REJECT LIMIT.

The LOG ERRORS INTO clause allows you to keep error rows for further examination. Use LOG ERRORS INTO to declare an error table in which to write error rows. When you set SEGMENT REJECT LIMIT, Greenplum scans the external data in single row error isolation mode. Single row error isolation mode applies to external data rows with format errors such as extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Greenplum does not check constraint errors, but you can filter constraint errors by limiting the SELECT from an external table at runtime. For example, to eliminate duplicate key errors:

=# INSERT INTO table_with_pkeys

SELECT DISTINCT * FROM external_table;

Define an External Table with Single Row Error Isolation

The following example creates an external table, ext_expenses, sets an error threshold of 10 errors, and writes error rows to the table err_expenses.

=# CREATE EXTERNAL TABLE ext_expenses ( name text,

date date, amount float4, category text, desc1 text ) LOCATION ('gpfdist://etlhost-1:8081/*',

'gpfdist://etlhost-2:8082/*') FORMAT 'TEXT' (DELIMITER '|')

LOG ERRORS INTO err_expenses SEGMENT REJECT LIMIT 10 ROWS;

Create an Error Table and Declare a Reject Limit

The following SQL fragment creates an error table, err_expenses, and declares a reject limit of 10 rows.

LOG ERRORS INTO err_expenses SEGMENT REJECT LIMIT 10 ROWS Viewing Bad Rows in the Error Table

If you use single row error isolation (see “Define an External Table with Single Row Error Isolation” on page 96 or “Running COPY in Single Row Error Isolation Mode” on page 99), any rows with formatting errors are logged into an error table. The error table has the following columns:

Table 7.3 Error Table Format

column type description

cmdtime timestampz Timestamp when the error occurred. relname text The name of the external table or the

target table of a COPY command. filename text The name of the load file that

You can use SQL commands to query the error table and view the rows that did not load. For example:

=# SELECT * from err_expenses;

Identifying Invalid CSV Files in Error Table Data

If a CSV file contains invalid formatting, the rawdata field in the error table can contain several combined rows. For example, if a closing quote for a specific field is missing, all the following newlines are treated as embedded newlines. When this happens, Greenplum stops parsing a row when it reaches 64K, puts that 64K of data into the error table as a single row, resets the quote flag, and continues. If this happens three times during load processing, the load file is considered invalid and the entire load fails with the message “rejected N or more rows”. See “Escaping in CSV

Formatted Files” on page 117 for more information on the correct use of quotes in CSV files.

Moving Data between Tables

You can use CREATE TABLE AS or INSERT...SELECT to load external and web external table data into another (non-external) database table, and the data will be loaded in parallel according to the external or web external table definition. If an external table file or web external table data source has an error, one of the following will happen, depending on the isolation mode used:

linenum int If COPY was used, the line number in the load file where the error occurred. For external tables using file:// protocol or gpfdist:// protocol and CSV format, the file name and line number is logged.

bytenum int For external tables with the gpfdist:// protocol and data in TEXT format: the byte offset in the load file where the error occurred. gpfdist parses TEXT files in blocks, so logging a line number is not possible.

CSV files are parsed a line at a time so line number tracking is possible for CSV files.

errmsg text The error message text.

rawdata text The raw data of the rejected row. rawbytes bytea In cases where there is a database

encoding error (the client encoding used cannot be converted to a server-side encoding), it is not possible to log the encoding error as

rawdata. Instead the raw bytes are stored and you will see the octal code for any non seven bit ASCII characters.

Table 7.3 Error Table Format

Using Hadoop Distributed File System (HDFS) Tables 98 Tables without error isolation mode: any operation that reads from that table

fails. Loading from external and web external tables without error isolation mode is an all or nothing operation.

Tables with error isolation mode: the entire file will be loaded, except for the problematic rows (subject to the configured REJECT_LIMIT)

Loading Data

The following methods load data from readable external tables. Use the gpload utility

Use the gphdfs protocol

Load with copy

Loading Data with gpload

The Greenplum gpload utility loads data using readable external tables and the

Greenplum parallel file server (gpfdist or gpfdists). It handles parallel file-based external table setup and allows users to configure their data format, external table definition, and gpfdist or gpfdists setup in a single configuration file.

To use gpload

1. Ensure that your environment is set up to run gpload. Some dependent files from

Documento similar