• No se han encontrado resultados

4.1 Clasificación y organización de los datos obtenidos

4.1.2 Guía de entrevista dirigida a los docentes de tercer ciclo

An important novel (and complementary) technique for storing data is column orienta­ tion. It has made its way to the de facto file format implementation of the data lake, as well as to other interactive (and web-scale) ad-hoc query systems like Dremel (Melnik

et al., 2010), and its open source counterpart Apache Drill13. The idea, which originates

from the relational DBMS space, has now been incorporated in Big Data ecosystems like Hadoop. The two common implementations fully integrated with HDFS are Apache Parquet14 and Apache Optimized Row Columnar (ORC)15 . Figure 3.3 shows an example of the ORC file format to illustrate how each column is stored contiguously. The main features of column orientation are enumerated below:

• Data is stored contiguously on disk by columns rather than rows. Then, each block in HDFS contains a range of rows of the dataset and there is some metadata that can be used to seek to the start or the end of any column data, so if we are reading just two columns, we do not have to scan the whole block, but just the two columns data. This way, it is not necessary to create many different files which might incur in extra overhead for the HDFS name node.

11https://www.gluster.org/ 12 http://www.alluxio.org/ 13https://drill.apache.org/ 14 https://parquet.apache.org/ 15https://orc.apache.org/

• Compression ratio for each column data will be higher than the row oriented coun­ terpart due to the fact that values for the same column are usually more similar, overall for scientific data sets involving time series, because new values represent­ ing certain phenomena are likely to be similar than those just measured. These implementations go beyond a simple compression for the column data by allow­ ing different compression algorithms for different types of data, or even do so on the fly as we create the dataset by trying several alternatives. For instance, for a column representing a measure (floating point number) of a determined sensor or instrument, it would be reasonable to use a delta compression algorithm where we store the differences between values which will probably require less bits for their representation. It is important to remark that I/O takes more time than the associated CPU time (de)compressing the same data. For other columns with enumerated values, the values themselves will be stored in the metadata section along with a shorter (minimum) set of bits which will be the ones being used in the column data values. This of course requires (de)serializing the whole block (range of rows) for building back the original values of each column, but this technique usually performs well and takes less time. Examples of compression techniques are null supression, delta/dictionary/run-length encoding, etc (Abadi et al., 2006). • Predicate push-down. Not only we can specify the set of columns that will be read

for a determined workflow, but for those that need to be queried or filtered with some constraints, we can also push down the predicates so that the data not needed are not even deserialized in the worker nodes, or even not read from disk (only the metadata available is accessed). This is achieved through the usage of bloom filters

(Bloom, 1970), which are a compact data structure that is used to test whether an

element is a member of a set. False positive matches are possible (trade-off allowing a much more reduced size), but false negatives are not, which make them perfect for columnar stores.

• Complex nested data structures can be represented in the format (not just simple flattening of nested namespaces like in the naive implementation used in Section 4.3). The data types range from simple integers, floating point and strings, to structs, lists, maps, unions and arbitrary binary data. One example of a technique for implementing this feature is presented in more detail in Melnik et al. (2010). • This data format representation is agnostic to the data processing, data model or

even the programming language, being Parquet the best suited for this interoper­ ability in systems and languages.

• Further improvements of the column-based approach include the ability to split files without scanning for markers, some kind of indexing per chunk or stripe and the ability to perform operations for updating or deleting rows, which makes it ACID compliant. However, this does not obviously intend to provide OLTP requirements. It can support millions of rows updated per a transaction, but it can not support millions of transactions an hour.

52 Chapter 3. Enabling Large Scale Data Science and Data Products

Figure 3.3: ORC file layout. Each stripe is independent of each other and contains only entire rows so that rows never straddle stripe boundaries. The data for each column is stored separately and there are statistics about each column both at file and stripe level. The stripe footer contains the encoding of each column among other things.

These storage formats are crucial in any architecture that is meant to support scien­ tific workflows because typical scientific data sets are multidimensional, especially those found in astronomical surveys where a set of instruments are observing different phe­ nomena of the same area of the sky (astrometry, photometry, etc). Workflows accessing them do not normally need to access all features, but just those relevant to the particular use case being studied. By having the data organized in columns, scanning time will be very much reduced as only those required dimensions will be fetched from disk. But not only that, the amount of memory consumed will also be less, resulting in a more efficient use of a scarce resource, which will eventually let machine learning algorithms or models (normally iterative) run faster (on data that stays in memory). Figure 3.4 shows an example of a real astronomical data set (GUMS version 10) being compressed in a popular MPP database (Greenplum). We observe that the columnar format compresses even further the data set. Using different compression techniques for each column would certainly compress it even more.

This innovation is the best match for information that is going to be stored in the data lake. It facilitates exploration of large data sets as the end user decides what to read from disk. The compatibility of the two major implementations with the main programming languages and data processing engines makes it suitable for scientific data processing and archiving.

Figure 3.4: Disk size of the GUMS version 10 data set for different storage models selected in Greenplum DBMS.

Documento similar