6 RESULTADOS
6.5 Análisis de los elementos básicos de la cooperación
6.5.4 La evaluación grupal
Many developers go down the path of writing their raw data in a schemaless format like JSON. This is appealing because of how easy it is to get started, but this approach quickly leads to problems. Whether due to bugs or misunderstandings between differ- ent developers, data corruption inevitably occurs. It’s our experience that data cor- ruption errors are some of the most time-consuming to debug.
Data corruption issues are hard to debug because you have very little context on how the corruption occurred. Typically you’ll only notice there’s a problem when there’s an error downstream in the processing—long after the corrupt data was writ- ten. For example, you might get a null pointer exception due to a mandatory field being missing. You’ll quickly realize that the problem is a missing field, but you’ll have absolutely no information about how that data got there in the first place.
When you create an enforceable schema, you get errors at the time of writing the data—giving you full context as to how and why the data became invalid (like a stack trace). In addition, the error prevents the program from corrupting the master data- set by writing that data.
Serialization frameworks are an easy approach to making an enforceable schema. If you’ve ever used an object-oriented, statically typed language, using a serialization framework will be immediately familiar. Serialization frameworks generate code for whatever languages you wish to use for reading, writing, and validating objects that match your schema.
However, serialization frameworks are limited when it comes to achieving a fully rigorous schema. After discussing how to apply a serialization framework to the Super- WebAnalytics.com data model, we’ll discuss these limitations and how to work around them.
3.2
Apache Thrift
Apache Thrift (http://thrift.apache.org/) is a tool that can be used to define statically typed, enforceable schemas. It provides an interface definition language to describe the schema in terms of generic data types, and this description can later be used to auto- matically generate the actual implementation in multiple programming languages.
OUR USE OF APACHE THRIFT Thrift was initially developed at Facebook for building cross-language services. It can be used for many purposes, but we’ll limit our discussion to its usage as a serialization framework.
Other serialization frameworks
There are other tools similar to Apache Thrift, such as Protocol Buffers and Avro. Remember, the purpose of this book is not to provide a survey of all possible tools for every situation, but to use an appropriate tool to illustrate the fundamental con- cepts. As a serialization framework, Thrift is practical, thoroughly tested, and widely used.
The workhorses of Thrift are the struct and union type definitions. They’re composed of other fields, such as
■ Primitive data types (strings, integers, longs, and doubles) ■ Collections of other types (lists, maps, and sets)
■ Other structs and unions
In general, unions are useful for representing nodes, structs are natural representa- tions of edges, and properties use a combination of both. This will become evident from the type definitions needed to represent the SuperWebAnalytics.com schema components.
3.2.1 Nodes
For our SuperWebAnalytics.com user nodes, an individual is identified either by a user ID or a browser cookie, but not both. This pattern is common for nodes, and it matches exactly with a union data type—a single value that may have any of several representations.
In Thrift, unions are defined by listing all possible representations. The following code defines the SuperWebAnalytics.com nodes using Thrift unions:
union PersonID { 1: string cookie; 2: i64 user_id; } union PageID { 1: string url; }
Note that unions can also be used for nodes with a single representation. Unions allow the schema to evolve as the data evolves—we’ll discuss this further later in this section.
3.2.2 Edges
Each edge can be represented as a struct containing two nodes. The name of an edge struct indicates the relationship it represents, and the fields in the edge struct contain the entities involved in the relationship.
The schema definition is very simple:
struct EquivEdge {
1: required PersonID id1; 2: required PersonID id2; }
struct PageViewEdge {
1: required PersonID person; 2: required PageID page; 3: required i64 nonce; }
The fields of a Thrift struct can be denoted as required or optional. If a field is defined as required, then a value for that field must be provided, or else Thrift will give an error upon serialization or deserialization. Because each edge in a graph schema must have two nodes, they are required fields in this example.
3.2.3 Properties
Last, let’s define the properties. A property contains a node and a value for the property. The value can be one of many types, so it’s best represented using a union structure.
Let’s start by defining the schema for page properties. There’s only one property for pages, so it’s really simple:
union PagePropertyValue { 1: i32 page_views; }
struct PageProperty { 1: required PageID id;
2: required PagePropertyValue property; }
Next let’s define the properties for people. As you can see, the location property is more complex and requires another struct to be defined:
struct Location {
1: optional string city; 2: optional string state; 3: optional string country; } enum GenderType { MALE = 1, FEMALE = 2 } union PersonPropertyValue { 1: string full_name; 2: GenderType gender; 3: Location location; } struct PersonProperty { 1: required PersonID id;
2: required PersonPropertyValue property; }
The location struct is interesting because the city, state, and country fields could have been stored as separate pieces of data. In this case, they’re so closely related it makes sense to put them all into one struct as optional fields. When consuming location information, you’ll almost always want all of those fields.
3.2.4 Tying everything together into data objects
At this point, the edges and properties are defined as separate types. Ideally you’d want to store all of the data together to provide a single interface to access your infor- mation. Furthermore, it also makes your data easier to manage if it’s stored in a single dataset. This is accomplished by wrapping every property and edge type into a
DataUnit union—see the following code listing.
union DataUnit { 1: PersonProperty person_property; 2: PageProperty page_property; 3: EquivEdge equiv; 4: PageViewEdge page_view; } struct Pedigree {
1: required i32 true_as_of_secs; }
struct Data {
1: required Pedigree pedigree; 2: required DataUnit dataunit; }
Each DataUnit is paired with its metadata, which is kept in a Pedigree struct. The pedigree contains the timestamp for the information, but could also potentially con- tain debugging information or the source of the data. The final Data struct corre- sponds to a fact from the fact-based model.
3.2.5 Evolving your schema
Thrift is designed so that schemas can evolve over time. This is a crucial property, because as your business requirements change you’ll need to add new kinds of data, and you’ll want to do so as effortlessly as possible.
The key to evolving Thrift schemas is the numeric identifiers associated with each field. Those IDs are used to identify fields in their serialized form. When you want to change the schema but still be backward compatible with existing data, you must obey the following rules:
■ Fields may be renamed. This is because the serialized form of an object uses the field IDs, not the names, to identify fields.
■ A field may be removed, but you must never reuse that field ID. When deserializing
existing data, Thrift will ignore all fields with field IDs not included in the schema. If you were to reuse a previously removed field ID, Thrift would try to deserialize that old data into the new field, which will lead to either invalid or incorrect data.
■ Only optional fields can be added to existing structs. You can’t add required fields
because existing data won’t have those fields and thus won’t be deserializable. (Note that this doesn’t apply to unions, because unions have no notion of required and optional fields.)
As an example, should you want to change the SuperWebAnalytics.com schema to store a person’s age and the links between web pages, you’d make the following changes to your Thrift definition file (changes in bold font).
union PersonPropertyValue { 1: string full_name; 2: GenderType gender; 3: Location location; 4: i16 age; } struct LinkedEdge {
1: required PageID source;