5. MARCO TEÓRICO Y CONCEPTUAL
5.6 ENSAYOS DE LABORATORIO
The database system is a very important tool for data management, because it offers the ability to manage, access and query data in an efficient manner. Many existing studies have discussed the use of relational databases for astronomical data management. Gray et al. [277] presented the design of a database system called SkyServer, which allows public astronomers and scientists to access data released by the Sloan Digital Sky Survey (SDSS) [187] through a browser. They use a Microsoft SQL server [278] for the database implementation for archiving SDSS data. In this paper, they describe their database design in detail, including the relational schema design, the tables designed for different observation data, and the design of the data access features (views, indices, and user defined access functions). In addition, they also list 20 different SQL query examples for lookup of different types of observational data from Skyserver database system. They conclude that using views and many indices give convenient access to the conventional data, and that the performance of data lookups can be improved through defined store procedures and indices.
Tankar et al. [279] proposed a Catalog Archive Server (CAS) model for accessing catalog data produced by SDSS. This model contains a set of RDBMS user defined functions and it is implemented on top of the RDBMS system. As they conclude, the use of these user defined functions could enhance the performance of accessing particular pieces of catalog data from the relational database.
Power [280] conducted a performance study between MySQL [281] and Oracle [282] databases, using SQL queries for lookup of particular star data (i.e: Searches of white dwarf stars, lookup of objects that are close to each other, and searches of objects with rmag values less than 21.75, among others) from astronomy catalogues that contain approximately a billion objects. Based on his results, most of the queries can be
165
performed very quickly, and the query response time of the Oracle database outperforms that of the MySQL database. However, he also points out that the query response time of searching neighbor stars for a catalogue of a billion objects and cross matching two catalogues consisting of a billion objects is very slow, and that these two types of searches are not well supported for large catalogues in a database when only using SQL.
Many similar studies have also been carried out in the astronomical domain to investigate the feasibility of using relational database systems for astronomical data management [283] [284] [285]. However, the studies into using relational databases for data management do not only exist in terms of the astronomical domain; there are also many studies about relational database systems that have been conducted in other domains for discussing the schema design of relational database and performance evaluation, among others [286][287][288].
However, as data grows very quickly, the relational database faces a scalability issue; because most of the current systems will store and process vast amounts of data, processing these vast amounts of data requires speed, flexible schema and high scalability, which have pushed relational databases to their limits [289]. Mohamed et al. [275] state that the main scalability limitation of relational databases is that they depend on vertical scalability, which means that the only way to improve the capacity of a relational database system is by adding more powerful hardware resources on the machine that hosts the relational database system. However, each machine has a hardware limit and, once that hardware limit is reached, there is no way to improve the capacity of the relational database system. They also discuss the fact that horizontal scalability is an alternative solution to overcome this issue with vertical scalability, but that relational databases do not provide good supports for horizontal scalability.
Since there are some limitations of relational databases in terms of storing and processing vast amounts of data, another type of database has become popular; namely the NOSQL database. The NOSQL database is typically designed for dealing with big data situations and uses the horizontal scaling approach to boost database capacity [275].
166
The origin of the NOSQL database can be related to the Bigtable [290], which was developed by Google. The Bigtable is a distributed data storage system and Google uses it to manage their projects’ data, including web indexing, Google Earth and Google Finance. The current user groups of NOSQL databases are increasing rapidly, because NOSQL databases offer high scalability and have the ability to efficiently mine data from vast amounts of datasets. Therefore, many studies have been conducted to investigate this type of database.
Couchbase [291] conducted a survey to investigate the reasons that drive people to migrate from relational databases to NOSQL databases; 49% of the survey respondents cited the rigid schema of RDBMS as the primary reason they decided to migrate from RDBMS to NOSQL, but the inability to scale out data (35%) and the high latency/low performance of RDBMS are also high ranking reasons.
Li and Monoharan [292] carry out a performance comparison study to compare MS- SQL with several NOSQL database products, including RavenDB, MongoDB, CouchDB, Cassandra, Hypertable and Couchbase. They examine the performance of read, write and delete operations between these different database products through using 6 different testing samples (10, 50, 100, 1000, 10000, and 100000). According to their findings, not all the NOSQL databases perform better than the MS-SQL database; the CouchDB and RavenDB show bad performance on all three testing operations. Cassandra shows good performance on the write and delete operations, but does not perform well on the read operation. The Couchbase and MongoDB are the fastest two database products for the read, write and delete operations in all comparison products, and these two databases show better performance than MS-SQL. However, what type of data and queries they used in their performance measurement has not been reported in the paper.
Nyati et al. [293] also conducted a study to compare read and write performance between RDBMS and NOSQL databases. In this study, they use MySQL as the RDBMS implementation and MongoDB as the NOSQL implementation. Based on their evaluation results, the MongoDB outperforms the MySQL database in both the write and read
167
operations. Similarly, they have not described the database design, how they set up the indices on MySQL and what query they used for lookup of the data in their paper.
Many other similar studies [294] [295] [296] have also been conducted in order to compare the performance or basis attributes between RDBMS and NOSQL databases. However, the RDBMS and NOSQL databases are completely different types of database systems, so it is not equitable to compare these two different types of databases since RDBMS and NOSQL focus on different aspects; hardware, software, and data. As Nance et al. [205] point out, the RDBMS will not be replaced by NOSQL, because it is typically suited for those in line of business applications that are supporting business operations. On the other hand, the NOSQL database is better for serving large, public and content centric applications.
The main differences between our study and previous studies are:
1. As mentioned above, many studies of NOSQL have been conducted, but none in astronomy domains with the purpose of investigating the feasibility of using NOSQL databases to store and manage astronomical data. Therefore, this provided the opportunity to perform our study. In this chapter, we will not compare the RDBMS with NOSQL databases for astronomical data management; the main focus in this chapter is discussing the design for mapping the astronomical data into NOSQL databases, presenting some scenarios and examples for mining particular astronomical data from NOSQL, and evaluating the performance of inserting, querying and deleting the astronomical data from the NOSQL database.
2. Most of the existing NOSQL studies has not presented full design details of NOSQL databases; hence, we will use MOA as our case study and present full implementation details of NOSQL databases in this chapter.
168
In addition, all data used in this chapter are based on real astronomical observation data, which are provided by MOA.
6.2 Data Management Approaches: File System vs. Relational Database