3. DISEÑO Y CONSTRUCCIÓN DE OVA
3.4 HERRAMIENTAS PARA EL DESARROLLO DE OVA
Using Hadoop Distributed File System (HDFS) Tables 94
date date, amount float4, category text, desc1 text ) LOCATION ('gpfdists://etlhost-1:8081/*.txt',
'gpfdists://etlhost-2:8082/*.txt') FORMAT 'TEXT' ( DELIMITER '|' NULL ' ') ; Example 4—Single gpfdist instance with error logging
Uses the gpfdist protocol to create a readable external table, ext_expenses, from all files with the txt extension. The column delimiter is a pipe ( | ) and NULL (‘ ‘) is a space.
Access to the external table is single row error isolation mode. Input data formatting errors are written to the error table, err_customer, with a description of the error. Query err_customer to see the errors, then fix the issues and reload the rejected data.
If the error count on a segment is greater than five (the SEGMENT REJECT LIMIT
value), the entire external table operation failes and no rows are processed.
=# CREATE EXTERNAL TABLE ext_expenses ( name text,
date date, amount float4, category text, desc1 text ) LOCATION ('gpfdist://etlhost-1:8081/*.txt',
'gpfdist://etlhost-2:8082/*.txt') FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5;
To create the readable ext_expenses table from CSV-formatted text files:
=# CREATE EXTERNAL TABLE ext_expenses ( name text,
date date, amount float4, category text, desc1 text ) LOCATION ('gpfdist://etlhost-1:8081/*.txt',
'gpfdist://etlhost-2:8082/*.txt') FORMAT 'CSV' ( DELIMITER ',' )
LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5; Example 5—TEXT Format on a Hadoop Distributed File Server
Creates a readable external table, ext_expenses, using the gphdfs protocol. The column delimiter is a pipe ( | ).
=# CREATE EXTERNAL TABLE ext_expenses ( name text,
date date, amount float4, category text, desc1 text ) LOCATION ('gphdfs://hdfshost-1:8081/data/filename.txt') FORMAT 'TEXT' (DELIMITER '|');
Note: gphdfs requires only one data path.
For examples of reading and writing custom formatted data on a Hadoop Distributed File System, see “Reading and Writing Custom-Formatted HDFS Data” on page 84.
Example 6—Multiple files in CSV format with header rows
Creates a readable external table, ext_expenses, using the file protocol. The files are
CSV format and have a header row.
date date, amount float4, category text, desc1 text ) LOCATION ('file://filehost:5432/data/international/*', 'file://filehost:5432/data/regional/*'
'file://filehost:5432/data/supplement/*.csv') FORMAT 'CSV' (HEADER);
Example 7—Readable Web External Table with Script
Creates a readable web external table that executes a script once per segment host:
=# CREATE EXTERNAL WEB TABLE log_output (linenum int, message text)
EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST FORMAT 'TEXT' (DELIMITER '|');
Example 8—Writable External Table with gpfdist
Creates a writable external table, sales_out, that uses gpfdist to write output data to the file sales.out. The column delimiter is a pipe ( | ) and NULL (‘ ‘) is a space. The file will be created in the directory specified when you started the gpfdist file server.
=# CREATE WRITABLE EXTERNAL TABLE sales_out (LIKE sales) LOCATION ('gpfdist://etl1:8081/sales.out')
FORMAT 'TEXT' ( DELIMITER '|' NULL ' ') DISTRIBUTED BY (txn_id);
Example 9—Writable External Web Table with Script
Creates a writable external web table, campaign_out, that pipes output data received by the segments to an executable script, to_adreport_etl.sh:
=# CREATE WRITABLE EXTERNAL WEB TABLE campaign_out (LIKE campaign)
EXECUTE '/var/unload_scripts/to_adreport_etl.sh' FORMAT 'TEXT' (DELIMITER '|');
Example 10—Readable and Writable External Tables with XML Transformations
Greenplum Database can read and write XML data to and from external tables with gpfdist. For information about setting up an XML transform, see “Transforming XML Data” on page 104.
Handling Load Errors
Readable external tables are most commonly used to select data to load into regular database tables. You use the CREATE TABLE AS SELECT or INSERT INTO commands to query the external table data. By default, if the data contains an error, the entire command fails and the data is not loaded into the target database table.
The SEGMENT REJECT LIMIT clause allows you to isolate format errors in external table data and to continue loading correctly formatted rows. Use SEGMENT REJECT LIMIT to set an error threshold, specifying the reject limit count as number of ROWS
Using Hadoop Distributed File System (HDFS) Tables 96 The entire external table operation is aborted, and no rows are processed, if the number of error rows reaches the SEGMENT REJECT LIMIT. The limit of error rows is per-segment, not per entire operation. The operation processes all good rows, and it discards or logs any erroneous rows into an error table (if you specified an error table), if the number of error rows does not reach the SEGMENT REJECT LIMIT.
The LOG ERRORS INTO clause allows you to keep error rows for further examination. Use LOG ERRORS INTO to declare an error table in which to write error rows. When you set SEGMENT REJECT LIMIT, Greenplum scans the external data in single row error isolation mode. Single row error isolation mode applies to external data rows with format errors such as extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Greenplum does not check constraint errors, but you can filter constraint errors by limiting the SELECT from an external table at runtime. For example, to eliminate duplicate key errors:
=# INSERT INTO table_with_pkeys
SELECT DISTINCT * FROM external_table;
Define an External Table with Single Row Error Isolation
The following example creates an external table, ext_expenses, sets an error threshold of 10 errors, and writes error rows to the table err_expenses.
=# CREATE EXTERNAL TABLE ext_expenses ( name text,
date date, amount float4, category text, desc1 text ) LOCATION ('gpfdist://etlhost-1:8081/*',
'gpfdist://etlhost-2:8082/*') FORMAT 'TEXT' (DELIMITER '|')
LOG ERRORS INTO err_expenses SEGMENT REJECT LIMIT 10 ROWS;
Create an Error Table and Declare a Reject Limit
The following SQL fragment creates an error table, err_expenses, and declares a reject limit of 10 rows.
LOG ERRORS INTO err_expenses SEGMENT REJECT LIMIT 10 ROWS Viewing Bad Rows in the Error Table
If you use single row error isolation (see “Define an External Table with Single Row Error Isolation” on page 96 or “Running COPY in Single Row Error Isolation Mode” on page 99), any rows with formatting errors are logged into an error table. The error table has the following columns:
Table 7.3 Error Table Format
column type description
cmdtime timestampz Timestamp when the error occurred. relname text The name of the external table or the
target table of a COPY command. filename text The name of the load file that
You can use SQL commands to query the error table and view the rows that did not load. For example:
=# SELECT * from err_expenses;
Identifying Invalid CSV Files in Error Table Data
If a CSV file contains invalid formatting, the rawdata field in the error table can contain several combined rows. For example, if a closing quote for a specific field is missing, all the following newlines are treated as embedded newlines. When this happens, Greenplum stops parsing a row when it reaches 64K, puts that 64K of data into the error table as a single row, resets the quote flag, and continues. If this happens three times during load processing, the load file is considered invalid and the entire load fails with the message “rejected N or more rows”. See “Escaping in CSV
Formatted Files” on page 117 for more information on the correct use of quotes in CSV files.
Moving Data between Tables
You can use CREATE TABLE AS or INSERT...SELECT to load external and web external table data into another (non-external) database table, and the data will be loaded in parallel according to the external or web external table definition. If an external table file or web external table data source has an error, one of the following will happen, depending on the isolation mode used:
linenum int If COPY was used, the line number in the load file where the error occurred. For external tables using file:// protocol or gpfdist:// protocol and CSV format, the file name and line number is logged.
bytenum int For external tables with the gpfdist:// protocol and data in TEXT format: the byte offset in the load file where the error occurred. gpfdist parses TEXT files in blocks, so logging a line number is not possible.
CSV files are parsed a line at a time so line number tracking is possible for CSV files.
errmsg text The error message text.
rawdata text The raw data of the rejected row. rawbytes bytea In cases where there is a database
encoding error (the client encoding used cannot be converted to a server-side encoding), it is not possible to log the encoding error as
rawdata. Instead the raw bytes are stored and you will see the octal code for any non seven bit ASCII characters.
Table 7.3 Error Table Format
Using Hadoop Distributed File System (HDFS) Tables 98 • Tables without error isolation mode: any operation that reads from that table
fails. Loading from external and web external tables without error isolation mode is an all or nothing operation.
• Tables with error isolation mode: the entire file will be loaded, except for the problematic rows (subject to the configured REJECT_LIMIT)
Loading Data
The following methods load data from readable external tables. • Use the gpload utility
• Use the gphdfs protocol
• Load with copy
Loading Data with gpload
The Greenplum gpload utility loads data using readable external tables and the
Greenplum parallel file server (gpfdist or gpfdists). It handles parallel file-based external table setup and allows users to configure their data format, external table definition, and gpfdist or gpfdists setup in a single configuration file.
To use gpload
1. Ensure that your environment is set up to run gpload. Some dependent files from