Pig can be run in a batch or interactive mode. To run it in batch, simply save your Pig commands to a fi le and pass that fi le as an argument to the Pig execut- able. To run commands interactively, you can run the Pig executable from the command prompt.
Pig uses a language, Pig Latin, to defi ne the data transformations that will be done. Pig Latin statements are operators that take a relation and produces another relation. A relation, in Pig Latin terms, is a collection of tuples, and a
172 Part IV ■ Working with Your Big Data
tuple is a collection of fi elds. One way to envision this is that a relation is like a table in a database. The table has a collection of rows, which is analogous to the tuples. The columns in the row are analogous to the fi elds. The primary dif- ference between a relation and a database table is that relations do not require that all the tuples have the same number or type of fi elds in them.
An example Pig Latin statement follows. This statement loads information from Hadoop into a relation. Statements must be terminated with semicolons, and extra whitespace is ignored:
source = LOAD '/MsBigData/Customer/' USING PigStorage() AS (name, city, state,
postalcode, totalpurchaseamount);
In this case, the result of the LOAD function is a relation that is being assigned to the alias of source. The alias allows the relation to be referred in later state- ments. Also, while this example declares the fi elds that will be retrieved, it is not required to defi ne them. In fact, you may have noticed that there is no type defi nition. Pig can reference fi elds by ordinal position or name, if provided, and data values will be implicitly converted as needed.
The LOAD function is using the PigStorage() function. This is the default storage function, which allows access to Hadoop fi les and supports delimited text and the standard binary formats for Hadoop. Additional storage functions can be developed to allow Pig to communicate with other data stores.
To reduce the number of tuples (rows) in the relation, you can apply a fi lter to it using the FILTER function. In this case, the FILTER is being applied to the
source alias created in the previous statement: filtered = FILTER source BY state = 'FL';
The relation produced by this statement is assigned to an alias of filtered. You can also group the data using a GROUP function. The following statement results in a new relation that contains a tuple for each distinct city, with one fi eld containing the city value and another fi eld containing a collection of tuples for the rows that are part of the group:
grouped = GROUP filtered BY city;
You can look at this as producing a new table, with any rows belonging to the same grouped value being associated with that row:
grouped | filtered
Jacksonville | (John Smith, FL, 32079, 10000), (Sandra James, FL, 32079, 8000) Tampa | (Robert Betts, FL, 32045, 6000) | (Tim Kerr, FL, 32045, 1000) Miami | (Gina Jones, FL, 32013, 7000)
When you need to operate on columns, you can use the FOREACH function. It is used when working with data like that shown here, because it runs the associ- ated function for each value in the specifi ed column. If you want to produce an average totalpurchaseamount for each city, you can use the following statement:
averaged = FOREACH grouped GENERATE group, AVG(filtered.totalpurchaseamount);
To order the results, you can use the ORDER function. In this case, the $2
indicates that the statement is using the ordinal column position, rather than addressing it by name:
ordered = ORDER averaged BY $2 DESC;
To store the results, you can call the STORE function. This lets you write the values back to Hadoop using the PigStorage() functionality:
STORE ordered INTO 'c:\SampleData\PigOutput.txt' USING PigStorage(); If you take this entire set of statements together, you can see that Pig Latin is relatively easy to read and understand. These statements could be saved to a fi le as a Pig script and then executed as a batch fi le:
source = LOAD '/MsBigData/Customer/' USING PigStorage() AS (name, city, state,
postalcode, totalpurchaseamount); filtered = FILTER source BY state = 'FL'; grouped = GROUP filtered BY city;
averaged = FOREACH grouped GENERATE group, AVG(filtered.totalpurchaseamount); ordered = ORDER averaged BY $2 DESC;
STORE ordered INTO 'c:\SampleData\PigOutput.txt' USING PigStorage(); N O T E Pig scripts are generally saved with a .PIG extension. This is a convention, but it is not required. However, it does make it easier for people to fi nd and use your scripts.
Another key aspect of Pig Latin is that the statements are declarative rather than imperative. That is, they tell Pig what you intend to do, but the Pig engine can determine the best way accomplish the operation. It may rearrange or combine certain operations to produce a more effi cient plan for accomplishing the work. This is similar to the way SQL Server’s query optimizer may rewrite your SQL queries to get the results in the fastest way possible.
Several functions facilitate debugging Pig Latin. One useful one is DUMP. This will output the contents of the specifi ed relation to the screen. If the relation contains a large amount of data, though, this can be time-prohibitive to execute:
174 Part IV ■ Working with Your Big Data
DESCRIBE outputs the schema of a relation to a console. This can help you understand what the relation looks like after various transformations have been applied:
DESCRIBE grouped;
EXPLAIN shows the planned execution model for producing the specifi ed rela- tion. This outputs the logical, physical, and MapReduce plans to the console:
EXPLAIN filtered;
ILLUSTRATE shows the data steps that produce a given relation. This is different from the plan, in that it actually displays the data in each relation at each step:
ILLUSTRATE grouped;
A large number of other functions are available for Pig. Unfortunately, space constraints do not allow a full listing here. You can fi nd complete documenta- tion on Pig at http://pig.apache.org.