Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
Pig has a very lax attitude when it comes to schemas. This is a consequence of Pig’s philosophy of eating anything. If a schema for the data is available, Pig will make use of it, both for up-front error checking and for optimization. But if no schema is available, Pig will still process the data, making the best guesses it can based on how the script treats the data. First, we will look at ways that you can communicate the schema to Pig; then, we will examine how Pig handles the case where you do not provide it with the schema.
The easiest way to communicate the schema of your data to Pig is to explicitly tell Pig what it is when you load the data:
dividends = load 'NYSE_dividends' as
(exchange:chararray, symbol:chararray, date:chararray, dividend:float);
Pig now expects your data to have four fields. If it has more, it will truncate the extra ones. If it has less, it will pad the end of the record with nulls.
It is also possible to specify the schema without giving explicit data types. In this case, the data type is assumed to be bytearray:
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
You would expect that this also would force your data into a tuple with four fields, regardless of the number of actual input fields, just like when you specify both names and types for the fields. And in Pig 0.9 this is what happens. But in 0.8 and earlier versions it does not; no truncation or null padding is done in the case where you do not provide explicit types for the fields.
Also, when you declare a schema, you do not have to declare the schema of complex types, but you can if you want to. For example, if your data has a tuple in it, you can declare that field to be a tuple without specifying the fields it contains. You can also declare that field to be a tuple that has three columns, all of which are integers. Table 4-1 gives the details of how to specify each data type inside a schema declaration.
Table 4-1. Schema syntax
The runtime declaration of schemas is very nice. It makes it easy for users to operate on data without having to first load it into a metadata system. It also means that if you are interested in only the first few fields, you only have to declare those fields.
But for production systems that run over the same data every hour or every day, it has a couple of significant drawbacks. One, whenever your data changes, you have to change your Pig Latin. Two, although this works fine on data with 5 columns, it is painful when your data has 100 columns. To address these issues, there is another way to load schemas in Pig.
If the load function you are using already knows
the schema of the data, the function can communicate that to Pig. (Load
functions are how Pig reads data; see Load for
details.) Load functions might already know the schema because it is
stored in a metadata repository such as HCatalog, or it might be stored in
the data itself (if, for example, the data is stored in JSON format). In this case, you do not have to declare the
schema as part of the load statement. And you can still refer
to fields by name because Pig will fetch the schema from the load function
before doing error checking on your script:
mdata = load 'mydata' using HCatLoader(); cleansed = filter mdata by name is not null; ...
But what happens when you cross the streams? What if you specify a schema and the loader returns one? If they are identical, all is well. If they are not identical, Pig will determine whether it can adapt the one returned by the loader to match the one you gave. For example, if you specified a field as a long and the loader said it was an int, Pig can and will do that cast. However, if it cannot determine a way to make the loader’s schema fit the one you gave, it will give an error. See Casts for a list of casts Pig can and cannot insert to make the schemas work together.
Now let’s look at the case where neither you nor
the load function tell Pig what the data’s schema is. In addition to being
referenced by name, fields can be referenced by position, starting from
zero. The syntax is a dollar sign, then the position: $0 refers to the first field. So it is easy to
tell Pig which field you want to work with. But how does Pig know the data
type? It does not, so it starts by assuming everything is a bytearray.
Then it looks at how you use those fields in your script, drawing
conclusions about what you think those fields are and how you want to use
them. Consider the following:
--no_schema.pig daily = load 'NYSE_daily'; calcs = foreach daily generate $7 / 1000, $3 * 100.0, SUBSTRING($0, 0, 1), $6 - $3;
In the expression $7 /
1000, 1000 is an integer, so
it is a safe guess that the eighth field of NYSE_daily is an integer or something that can
be cast to an integer. In the same way, $3 *
100.0 indicates $3 is a
double, and the use of $0 in a function
that takes a chararray as an argument indicates the type of $0. But what about the last expression, $6 - $3? The - operator is used
only with numeric types in Pig, so Pig can safely guess that $3 and $6 are
numeric. But should it treat them as integers or floating-point numbers?
Here Pig plays it safe and guesses that they are floating points, casting
them to doubles. This is the safer bet because if they actually are
integers, those can be represented as floating-point numbers, but the
reverse is not true. However, because floating-point arithmetic is much
slower and subject to loss of precision, if these values really are
integers, you should cast them so that Pig uses integer types in this
case.
There are also cases where Pig cannot make any intelligent guess:
--no_schema_filter daily = load 'NYSE_daily'; fltrd = filter daily by $6 > $3;
> is a valid operator on numeric,
chararray, and bytearray types in Pig Latin. So, Pig has no way to make a
guess. In this case, it treats these fields as if they were bytearrays,
which means it will do a byte-to-byte comparison of the data in these
fields.
Pig also has to handle the case where it guesses wrong and must adapt on the fly. Consider the following:
--unintended_walks.pig
player = load 'baseball' as (name:chararray, team:chararray,
pos:bag{t:(p:chararray)}, bat:map[]);
unintended = foreach player generate bat#'base_on_balls' - bat#'ibbs';
Because the values in maps can be of any type, Pig
has no idea what type bat#'base_on_balls' and bat#'ibbs' are. By the rules laid out
previously, Pig will assume they are doubles. But let’s say they actually
turn out to be represented internally as integers.[5] In that case, Pig will need to adapt at runtime and convert
what it thought was a cast from bytearray to double into a cast from int
to double. Note that it will still produce a double output and not an int
output. This might seem nonintuitive; see How Strongly Typed Is Pig?
for details on why this is so. It should be noted that in Pig 0.8 and earlier, much of this runtime adaption code was
shaky and often failed. In 0.9, much of this has been fixed. But if you
are using an older version of Pig, you might need to cast the data
explicitly to get the right results.
Finally, Pig’s knowledge of the schema can change
at different points in the Pig Latin script. In all of the previous
examples where we loaded data without a schema and then passed it to a
foreach statement, the data started out without a schema. But
after the foreach, the schema is known. Similarly, Pig can
start out knowing the schema, but if the data is mingled with other data
without a schema, the schema can be lost. That is, lack of schema is
contagious:
--no_schema_join.pig divs = load 'NYSE_dividends' as (exchange, stock_symbol, date, dividends); daily = load 'NYSE_daily'; jnd = join divs by stock_symbol, daily by $1;
In this example, because Pig does not know the
schema of daily, it cannot know the
schema of the join of divs and daily.
The previous sections have referenced casts in Pig without bothering to define how casts work. The syntax for casts in Pig is the same as in Java—the type name in parentheses before the value:
--unintended_walks_cast.pig
player = load 'baseball' as (name:chararray, team:chararray,
pos:bag{t:(p:chararray)}, bat:map[]);
unintended = foreach player generate (int)bat#'base_on_balls' - (int)bat#'ibbs';
The syntax for specifying types in casts is exactly the same as specifying them in schemas, as shown previously in Table 4-1.
Not all conceivable casts are allowed in Pig. Table 4-2 describes which casts are allowed between scalar types. Casts to bytearrays are never allowed because Pig does not know how to represent the various data types in binary format. Casts from bytearrays to any type are allowed. Casts to and from complex types currently are not allowed, except from bytearray, although conceptually in some cases they could be.
Table 4-2. Supported casts
| To int | To long | To float | To double | To chararray | |
|---|---|---|---|---|---|
| From int | Yes. | Yes. | Yes. | Yes. | |
| From long | Yes. Any values greater than 231 or less than –231 will be truncated. | Yes. | Yes. | Yes. | |
| From float | Yes. Values will be truncated to int values. | Yes. Values will be truncated to long values. | Yes. | Yes. | |
| From double | Yes. Values will be truncated to int values. | Yes. Values will be truncated to long values. | Yes. Values with precision beyond what float can represent will be truncated. | Yes. | |
| From chararray | Yes. Chararrays with nonnumeric characters result in null. | Yes. Chararrays with nonnumeric characters result in null. | Yes. Chararrays with nonnumeric characters result in null. | Yes. Chararrays with nonnumeric characters result in null. |
One type of casting that requires special
treatment is casting from bytearray to other types. Because bytearray
indicates a string of bytes, Pig does not know how to convert its
contents to any other type. Continuing the previous example, both
bat#'base_on_balls' and bat#'ibbs' were loaded as bytearrays. The
casts in the script indicate that you want them treated as ints.
Pig does not know whether integer values in baseball are stored as ASCII strings, Java serialized values, binary-coded decimal, or some other format. So it asks the load function, because it is that function’s responsibility to cast bytearrays to other types. In general this works nicely, but it does lead to a few corner cases where Pig does not know how to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will not know how to perform casts on it because that bytearray is not generated by a load function.
Before leaving the topic of casts, we need to consider cases where Pig inserts casts for the user. These casts are implicit, compared to explicit casts where the user indicates the cast. Consider the following:
--total_trade_estimate.pig
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
date:chararray, open:float, high:float, low:float, close:float,
volume:int, adj_close:float);
rough = foreach daily generate volume * close;
In this case, Pig will change the second line to
(float)volume * close to do the
operation without losing precision. In general, Pig will always widen
types to fit when it needs to insert these implicit casts. So, int and
long together will result in a long; int or long and float will result
in a float; and int, long, or float and double will result in a double.
There are no implicit casts between numeric types and chararrays or
other types.
[5] That is not the case in the example data. For that to be the
case, you would need to use a loader that did load the bat map with these values as
integers.