• No se han encontrado resultados

Several NoSQL systems are used. For clarity, we have divided them into the typical usage scenarios (for example, Online Transaction Processing [OLTP] or Online Analytical

Processing [OLAP]) we often deal with.

No current NoSQL system purely supports the need for OLTP; they all lack a couple important supports. This section covers the following four categories of NoSQL systems used with OLTP:

Columnar, or column-oriented, or column-store databases Document-oriented databases

Graph databases

GO TO For more information on supports for OLTP, refer to the “Limitations of NoSQL Systems” section, later in this hour.

Key-Value Store Databases

Key-value store databases store data as a collection of key-value pairs in a way that each possible key appears once, at most, in a collection. This is similar to the hash tables of the programming world, with a unique key and a pointer to a particular item of data. This database stores only pairs of keys and values, and it facilitates retrieving values when a key is known. These mappings are usually accompanied by cache mechanisms, to

maximize performance. Key-value stores are probably the simplest type and normally do not fit for all Big Data problems. Key-value store databases are ideal for storing web user profiles, session information, and shopping carts. They are not ideal if a data relationship is critical or a transaction spans keys.

A file system can be considered a key-value store, with the file path/name as the key and the actual file content as the value. Figure 1.12 shows an example of a key-value store.

FIGURE 1.12 Key-value store database storage structure.

In another example, with phone-related data, "Phone Number" is considered the key, with associated values such as "(123) 111-12345".

Dozens of key-value store databases are in use, including Amazon Dynamo, Microsoft Azure Table storage, Riak, Redis, and MemCached.

Amazon Dynamo

Amazon Dynamo was developed as an internal technology at Amazon for its e-commerce businesses, to address the need for an incrementally scalable, highly available key-value storage system. It is one of the most prominent key-value store NoSQL databases.

Amazon S3 uses Dynamo as its storage mechanism. The technology has been designed to enable users to trade off cost, consistency, durability, and performance while maintaining high availability.

Microsoft Azure Table Storage

Microsoft Azure Table storage is another example of a key-value store that allows for rapid development and fast access to large quantities of data. It offers highly available, massively scalable key-value–based storage so that an application can automatically scale to meet user demand. In Microsoft Azure Table, key-value pairs are called Properties and are useful in filtering and specifying selection criteria; they belong to Entities, which, in turn, are organized into Tables. Microsoft Azure Table features optimistic concurrency and, as with other NoSQL databases, is schema-less. The properties of each entity in a specific table can differ, meaning that two entities in the same table can contain different collections of properties, and those properties can be of different types.

Columnar or Column-Oriented or Column-Store Databases

Unlike a row-store database system, which stores data from all the columns of a row stored together, a column-oriented database stores the data from a single column together. You might be wondering how a different physical layout representation of the same data (storing the same data in a columnar format instead of the traditional row format) can improve flexibility and performance.

In a column-oriented database, the flexibility comes from the fact that adding a column is both easy and inexpensive, with columns applied on a row-by-row basis. Each row can have a different set of columns, making the table sparse. In addition, because the data from single columns is stored together, the database has high redundancy and achieves a greater degree of compression, improving the overall performance.

Column-oriented or column-store databases are ideal for site searches, blogs, content management systems, and counter analytics. Figure 1.13 shows the difference between a row store and a column store.

FIGURE 1.13 Row-store versus column-store storage structure.

Some RDBMS systems have begun to support storing data in a column-oriented structure, such as SQL Server 2012 and onward. The following NoSQL databases also support

column-oriented storage and, unlike RDBMS systems, in which the schema is fixed, allow a different schema for each row.

Apache Cassandra

Facebook developed Apache Cassandra to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is the perfect platform for mission-critical data. Cassandra is a good choice when you need scalability and high availability without compromising on performance. Cassandra’s support for replicating across multiple data centers is best-in-class, providing lower

latency for users and providing peace of mind that you can survive even regional outages.

Apache HBase

Apache HBase is a distributed, versioned, column-oriented database management system that runs on top of HDFS. An HBase system comprises a set of tables. Each table contains rows and columns, much like in a traditional relational table. Each table must have an element defined as a primary key, and all access attempts to HBase tables must use this primary key. An HBase column represents an attribute of an object. HBase enables many columns to be grouped together into column families, so the elements of a column family are all stored together. This differs from a relational table, which stores together all the columns of a given row. HBase mandates that you predefine the table schema and specify the column families. However, it also enables you to add columns to column families at any time. The schema is thus flexible and can adapt to changing application requirements. Apache HBase is a good choice when you need random, real-time read/write access to your sparse data sets, which are common in many Big Data use cases. HBase supports writing applications in Avro, REST, and Thrift.

Document-Oriented Databases

Document-oriented databases, such as other NoSQL systems, are designed for horizontal scalability or scale-out needs. (Scaling out refers to spreading the load over multiple hosts.) As your data grows, you can simply add more commodity hardware to scale out and distribute the load. These systems are designed around a central concept of a

document. Each document-oriented database implementation differs in the details of this definition, but they all generally assume that documents encapsulate and encode data in some standard formats or encodings, such as XML, Yet Another Markup Language (YAML), and JSON, as well as binary forms such as Binary JSON (BSON) and PDF. In the case of a relational table, every record in a table has the same sequence of fields (they contain NULL\empty if they are not being used). This means they have rigid schema. In contrast, a document-oriented database contains collections, analogous to relational tables. Each collection might have fields that are completely different across collections: The fields and their value data types can vary for each collection; furthermore, even the collections can be nested. Figure 1.14 shows the document-oriented database storage structure.

FIGURE 1.14 Document-oriented database storage structure.

Document-oriented databases are ideal for storing application logs, articles, blogs, and e- commerce applications. They are also suitable when aggregation is needed. These

databases are not ideal when transactions or queries span aggregations.

Several implementations of document-oriented databases exist, including MongoDB, CouchDB, Couchbase, and Microsoft DocumentDB.

MongoDB

MongoDB derives its name from word humongous and stores data as JSON-like documents with dynamic and binary schemas called BSON. MongoDB has databases, collections (like a table in the relational world), documents (like a record in the relational world), and indexes, much like a traditional relational database system. In MongoDB, you don’t need to define fields in advance. No schema exists for fields within a document—the fields and their value data-types can vary from one document to another. In practice, you typically store documents of the same structure within collections. In fact, a collection itself is not defined. The database creates a collection on the first insert statement.

CouchDB

Similar to MongoDB, CouchDB is a document-oriented database that stores data in JSON document format. CouchDB has a fault-tolerant storage engine that puts the safety of the data first. Each CouchDB database is a collection of independent documents; each

document maintains its own data and self-contained schema. You can use JavaScript as the CouchDB query language for MapReduce programming (for more on this, see Hour 4, “The MapReduce Job Framework and Job Execution Pipeline”); you can use HTTP for an API because it completely supports the Web and is particularly suited for interactive web applications.

Graph Databases

Graph databases have a concept of nodes, with relationships and properties. Unlike relational tables of rows and columns, and with the rigid structure of RDBMS, a flexible graph model is used that can scale across many machines. A graph database is designed for data that can be better represented as a graph (elements interconnected with an undetermined number of relationships between them). For example, this could be social relationships, public transport links, road maps, or network topologies. Examples of graph databases include Sones GraphDB and Neo4j. Figure 1.15 shows the structure of graph databases at a high level.

FIGURE 1.15 Graph database structure.