• No se han encontrado resultados

The economic consequences of applying data compression may vary depend- ing on circumstances. In order to estimate the economic consequences of applying data compression to XML databases in possibly large number of cases, the follow- ing procedure has been established.

First, a number of experimental results has been gathered, which include compression ratios and query execution times for a number of XML compression methods. Instead of measuring them in an own test environment, results published earlier by this author [8] were used for this purpose.

Second, the following economic values were calculated: the gain due to re- duced media usage (for different types of storage media), the gain due to reduced query execution time (for different numbers of queries), the overall economic ef- fect of applying data compression (being the sum of the two previous values for a chosen set of parameters).

Table 1 contains compression results measured on the Lineitem test file [7], containing business order line items from the 10 MB version of the TPC-H data- base benchmark, for two general purpose compression methods (Deflate as imple- mented in gzip and LZMA as implemented in 7-zip) and seven XML-conscious compressors described in section 2. The “C.Ratio” row shows compression ratios in output bits per input characters, the “D.time” row shows decompression times of the entire XML document in seconds, and the “Q.time” row shows average query execution times in seconds; the three queries for which the average was calculated were (XPath notation): “/table/T/L_TAX”, “/table/T/L_COMMENT”, “/table/T/[L_COMMENT= "slowly"]”. For non-query-supporting compressors (all but XBzipIndex and QXT), the query execution times were obtained by summing up decompression time and query time on decompressed data using sgrep utility.

Table 1. XML compression test results.

File Deflate LZMA XMill XML PPM SCM PPM XBzip XBzip Index Deflate QXT LZMA QXT C.ratio 0.721 0.461 0.380 0.273 0.244 0.248 0.332 0.285 0.245 D.time 0.203 0.829 0.219 3.188 13.157 4.512 6.259 1.120 1.340 Q.time 0.870 1.496 0.886 3.855 13.824 5.179 0.055 0.097 0.099

Average query time on uncompressed data was 0.667. Source: [8].

The gain due to reduced media usage has been calculated using formula (1) for the following types of storage: Solid State Drive (MLC Flash), Hard Disk Drive (7200 rpm), Hard Disk Drive (15000 rpm). For each of these device types, the low- est cost per megabyte found using an Internet price comparison tool on 2010-09-22 has been assumed as purchase cost, and the cost of ownership during device life- time has been estimated at four times the purchase cost. Figure 1 shows the gain in Polish zlotys achieved per gigabyte of stored XML data.

0 5 10 15 20 25 30 35 SSD HDD 7.2 HDD 15

Deflate LZMA XMill XML PPM SCM PPM XBzip XBzip Index QXT Deflate QXT LZMA

Figure 1. Gain due to reduced storage media usage after compression.

Although there are significant differences in compression ratios achieved by different methods, the economic outcome of applying them, in terms of storage space savings, varies by less than 10%. As could be expected, there are huge dif-

106

ferences depending what type of storage has been used. Applying compression to data stored on cheap 7200 rpm SATA disks seem to be least substantiated, taking into consideration only the storage space savings.

The gain due to reduced query execution time (per single query) has been cal- culated using formula (2) only for Hard Disk Drive (7200 rpm), as only such mea- surements were available from the quoted source. Figure 2 shows relation between average user time value and the achieved gain (or loss).

-15 -10 -5 0 5 0.01 0.02 0.04 0.05 0.1 0.2 0.4 0.5 1 2 4

Deflate LZMA XMill XML PPM

SCM PPM XBzip XBzip Index QXT

Figure 2. Gain due to reduced data access time usage after compression.

As one can see, only the query-supporting XML compressors are economical- ly substantiated if just the value of data access time is considered. Both versions of QXT (shown as a single line due to minuscule differences between the Deflate and LZMA modes) and XBzip Index provide very similar savings. Applying the non- query-supporting PPM-based methods may result in high costs due to significant increase of data access delays.

The overall economic effect of applying data compression has also been cal- culated only for Hard Disk Drive (7200 rpm), assuming the size of database at 100 GB. Figure 3 shows relation between total user time value (for all the data access operations during entire device lifetime) and the overall gain (or loss) achieved by applying compression to XML database in Polish zlotys.

-100000 -80000 -60000 -40000 -20000 0 20000 40000 60000 1000 2000 5000 10000 20000 40000 60000 80000

Deflate LZMA XMill XML PPM

SCM PPM XBzip XBzip Index QXT

Figure 3. Overall gain due to applying compression.

In case of cheap SATA drives, the savings due to reduced database size domi- nate over the effect of compression on access time only for very rarely accessed databases (or, assuming a very low value of time). For the numbers shown on the figure, they play insignificant role, and, as a result, the two query-supporting com- pressors (XBzip Index and QXT) once again seem to be the best choice.

5. CONCLUSIONS

Applying data compression to XML databases results in measurable economic effect, coming from reduced storage capacity requirements and modified data access time (and resulting work delays). The effect of the former is always positive (provided the compression is effective), yet its value is very small, unless very expensive storage media are used, or the size of the compressed database is huge. The effect of the latter can be positive only if queries over compressed data may be executed without decompressing the entire XML file. Applying compression me- thods that offer this functionality, such as QXT or XBzip Index, may produce sig- nificant savings.

108

REFERENCES

[1] Adiego J., de la Fuente P., Navarro G. (2007) Using Structural Contexts to Compress

Semistructured Text Collections, Information Processing and Management, 3 (May),

769-790.

[2] Cheney J. (2001) Compressing XML with multiplexed hierarchical PPM models, Pro- ceedings of the IEEE Data Compression Conference, Snowbird, UT, USA, 163-172. [3] Ferragina P., Luccio F., Manzini G., Muthukrishnan S. (2006) Compressing and

Searching XML Data Via Two Zips, Proceedings of the International World Wide

Web Conference, Edinburgh, Scotland, 751-760.

[4] Liefke H., Suciu D. (2000) XMill: an efficient compressor for XML data, Proceedings of the 19th ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 153-164.

[5] Sakr S. (2009) XML Compression Techniques: A Survey and Comparison, Journal of Computer and System Sciences 75(5), 303-322.

[6] Sassone P. G. (1988) Cost benefit analysis of information systems: a survey of metho-

dologies, Proceedings of the ACM SIGOIS and IEEECS TC-OA 1988 conference on

office information systems, ACM, Palo Alto, CA, USA, 126-133.

[7] Skibiński P., Grabowski S., Swacha J. (2008) Effective asymmetric XML compres-

sion, Software – Practice and Experience 38(10), 1027-1047.

[8] Skibiński P., Swacha J. (2007) Combining Efficient XML Compression with Query

Processing, Lecture Notes in Computer Science 4690, 330-342.

[9] Skibiński P., Swacha J., Grabowski S. (2008) A Highly Efficient XML Compression

Scheme for the Web, SOFSEM 2008: Theory and Practice of Computer Science, Lec-

ture Notes in Computer Science, 4910, Springer, Berlin-Heidelberg, Germany, 766- 777.

[10] Solomon D. (2007) Data compression. The Complete Reference, Springer-Verlag, London, England.

[11] Swacha J. (2006) Usprawnienie systemów informatycznych poprzez użycie kompresji

danych, Kisielnicki J. [ed.]: Informatyka w globalnym świecie, Wydawnictwo PJ-

WSTK, Warszawa, Poland, 364-370.

[12] Swacha J. (2007) Estimating the cost of computer system working time, Proceedings of the IXth International Conference CADSM 2007, Publishing House of Lviv Poli- technic National University, Lviv, Ukraine, 204-205.

[13] Swacha J. (2010) Miary efektywności ekonomicznej przechowywania danych, Współ- czesne aspekty informacji, t. II, Monografie i Opracowania 570, Oficyna Wydawni- cza SGH, Warszawa, Poland, 421-430.

[14] XML-Based Standards: Document Object Model, Soap, Sharable Content Object Ref-

INDEXING XML DATA AND THE PERFORMANCE OF XQUERY IN

Documento similar