• No se han encontrado resultados

3. Capítulo: Cadena de Suministro

3.2. Evolución histórica

3.2.1. Origen y desarrollo

An HDInsight cluster deployment consists of multiple nodes: two head nodes to provide high availability for master daemons, two secure gateway nodes to provide access to the HDInsight cluster, and one or more worker or data nodes that you select for computation. As you can see in Figure 6.2 (and discussed in Hour 3, “Hadoop Distributed File System Versions 1.0 and 2.0,” and Hour 5, “MapReduce—Advanced Concepts and YARN”), to maintain high availability of head nodes, the cluster leverages ZooKeeper services for fault detection and to elect a new active head node in case the currently active head node fails.

FIGURE 6.2 Physical architecture of the HDInsight cluster.

Head nodes—Head nodes run master daemons such as NameNode and Secondary NameNode for data storage and run the Resource Manager (or JobTracker) for processing, along with other data services daemons such as HiveServer,

HiveServer2, Pig, and Sqoop. Head nodes also run operational services such as Oozie and Ambari.

Secure gateway nodes—Secure gateway nodes are proxies that serve as a gateway to your Azure HDInsight cluster. They perform authentication and authorization and expose endpoints for WebHcat, Ambari, HiveServer, HiveServer2, and Oozie on port 443. To authenticate to the HDInsight cluster, you use the username and password you specified at the time of HDInsight cluster provisioning. Secure gateway nodes are also responsible for connecting you to the current active head node (in case of head node failover).

Data or worker nodes—Data or worker nodes run all the slave daemons, including data node, TaskTracker, NodeManager, Pig, and Hive Client. In a typical scenario, you have multiple data or worker nodes for the distributed processing of your Big Data.

of the head nodes. The HDInsight cluster leverages it for fault detection of the currently active head node and to elect a new active head node in case the currently active head node fails, to ensure high availability (it overcomes the limitation of the master node being a single point of failure in earlier Hadoop releases).

SQL database—When you want to use Hive or Oozie, you can use the Azure SQL database to store metadata related to Hive or Oozie.

Azure Storage Blob—By default, HDInsight uses Azure Storage Blob (also called WASB, short for Windows Azure Storage Blob) for data storage. It has been

implemented so that it appears as a full-featured Hadoop Distributed File System (HDFS) to end users. The best part of WASB is that Hadoop users still use the HDFS commands as usual to interact with it; at the same time, it can be accessed using Azure Storage Blob REST APIs or some other applications, or through one of the many popular Azure Storage Explorer tools.

Although you can change the configuration to make local drives of the HDInsight cluster (or, more specifically, the data or worker nodes) for data storage as an HDFS layer, we recommend that you use the default Azure Storage Blob because it is optimized for the storage of Big Data and computations on that data.

Note: WASB Versus Local Disk Storage for HDFS

Microsoft has implemented a mesh grid network called Azure Flat Network Storage (also known as Quantum 10 or Q10 network) to offer a high-

bandwidth, high-speed connectivity between WASB and worker or data nodes of your HDInsight cluster.

To minimize network data transfer, the implementation uses WASB during the initial and final streaming phases. Most of the other tasks are performed intra-node (the map, reduce, and sort tasks typically are performed on the local disk residing with the worker or data nodes themselves).

GO TO Hour 8, “Storing Data in Microsoft Azure Storage Blob, looks at why HDInsight chooses WASB as the default storage option and how it makes sense both technically and from a business perspective.

Note: HDInsight Service and the Azure Storage Service

Azure Storage Blob is a high-capacity, highly scalable, highly available storage option that costs significantly less and can be shared by other

applications that run outside your HDInsight cluster. Storing data in an Azure Storage Blob enables the HDInsight clusters used for computation to be safely released without losing data: The data is stored in Azure Storage Blob and not on the local drives in the HDInsight cluster.

In other words, you pay for the HDInsight cluster just for the time you are using it for computation. Data is stored in the Azure Storage Blob, which is already a low-cost storage option and is decoupled from the HDInsight

cluster. You also don’t need an operational HDInsight cluster for uploading or downloading data from the Azure Storage Blob—you can do it anytime,

anywhere with Azure Storage Blob REST APIs, using PowerShell, or through one of the many popular Azure Storage Explorer tools (see

http://azure.microsoft.com/en-us/documentation/articles/hdinsight-upload- data/).

When you need to do computation again, you can simply provision another HDInsight cluster based on your new requirement (either smaller than, bigger than, or the same size as the earlier cluster), point it back to the same Azure Storage Blob, and use the data again for further computation.

The two head nodes and multiple data or worker nodes (or as many as you have requested) are billed on an hourly basis, prorated to the nearest minute a cluster exists. The secure nodes, along with ZooKeeper nodes, are free as of this writing. Charges start when cluster creation completes and stops when you request that the cluster be deleted.

Unlike the virtual machine in Microsoft Azure, which can be shut down or deallocated when not in use to save on cost, there is no concept of deallocating or putting the

HDInsight cluster on hold. As mentioned earlier, you can delete your cluster safely at any time(by default, your data gets stored in Azure Storage Blob—uploading data to Azure Storage Blob and downloading it from there does not require operational HDInsight cluster, so you can use different methods for uploading or downloading). Then you can create another instance of the HDInsight cluster with the same specification or a different one, based on your new need. Afterward, you can start processing again without losing your data. By doing this, you save on cost by paying for only when you are using the HDInsight cluster.

See Figure 6.3 for Hadoop costs and Figure 6.4 for Hadoop + HBase costs as of this writing. You have many other options for VM size than mentioned in Figures 6.3 and 6.4, and you can choose from this list based on your requirements, such as a memory-intensive option, a compute-intensive option, or faster VMs with Solid State Drive (SSD). As

advised earlier, always refer to the Microsoft Azure pricing site for the latest VM size options and pricing models.

FIGURE 6.3 HDInsight costs for Hadoop only.

Note

Other Microsoft Azure services associated with HDInsight, such as Storage and Data Transfers, are billed separately using the standard rates for Storage and Data Transfers. Standard data transfer charges also are applied when transferring data from one region to another.