4. PROGRAMAS
4.5. PROGRAMA DE EDUCACIÓN Y RECREACIÓN
We move now to the evaluation of Recast resiliency against an active attacker, i.e., an attacker that knows the metadata describing the system, has access to all its node and pro-actively exploits these abilities. We simulate such a powerful attacker via greedy heuristics which, as mentioned, are suboptimal. We improve them by exploring many partial solutions in the intermediate steps of the leap- ing attack. In practice, we perform much costlier computations which better approximate the unknown optimal solution. We show how the set of blocks to be deleted gets smaller in Figure 7.10. For the first document archived, in- creasing from 1 to 500 partial solutions at each step leads to an insignificant improvement (from 4704 to 4662) in the number of documents to be destroyed. Small improvements also apply to the tail of the archive. Opposite, the number of documents to be tampered to censor d2000drops from 1359 to 601. Neverthe-
less, it should be noted that to halve the number of documents to be erased we multiplied by 500 the number of solutions that we expand and explore at every step. We also note that for the tail of the archive, it is sufficient to explore 10 solutions at each step of the greedy algorithm, while for the rest of the archive, in our simulation, a buffer size of 100 guarantees that the order of magnitude of the collateral damage is stable.
7.7
Summary
Archival systems are designed to support long-term storage of documents. As we turn into a “digital society”, these systems become increasingly important. Archival storage can leverage the same mechanisms that classical data stores use, notably cryptography and redundancy, to protect against common types of attacks. It should however be resilient against more subtle attacks that would threaten the long-term integrity of the archived data, in particular offering strong protection against attackers that covertly tamper with the data or delete specific documents (i.e., censors).
In this chapter, we present Recast, a novel anti-censorship archival system based on random data entanglement [71]. It exploits erasure codes to generate redundant blocks combining content from multiple documents, old and new, in order to protect them from both failures and malicious attacks. As opposed to prior work, Recast allows efficient recursive repair while requiring censors to do an increasingly large amount of work over a large number of storage nodes as the archive scales, which is a highly desirable property.
Recast uses a hybrid strategy for data entanglement designed to offer fast short-term and strong long-term protection to all the documents in the archive. This means that entanglement is performed in such a way that documents be- come more resilient as they stay longer in the system, and the level of interdepen- dencies makes it quickly impossible to delete or tamper with a single document without causing collateral damage to a large number of other documents.
Chapter 8
Convolutional LPDC codes
for distributed storage
systems
8.1
Introduction
The requirements of large-scale distributed storage systems are often ill-suited to coding techniques developed in other settings. Erasure codes must have small length to store data objects upon arrival, high code rate to mitigate the storage overhead and low repair locality to cope with a small number of unavailable nodes efficiently.
Our contributions We study high rate convolutional LDPC codes of pe- riod 1 [120] for immutable distributed storage systems. Convolutional LDPC codes allow efficient local encoding and decoding of data objects using small constituent codes, while improving the reliability of these constituent codes by allowing message passing algorithms when local decoding is insufficient. Our construction sequentially archives data objects, each consisting of a single code- word. Each data object has s data blocks, is entangled with t ≜ s + p blocks already archived with which the code generates p parity blocks. Aided by math- ematical analysis to avoid bad structures, we do an exhaustive search and pro- vide the best constructions for 1 ≤ s ≤ 5 and p = 2. As an example, optimal bSTEP(5, 2) is more reliable than a rs(20,8) code, with a complexity compa- rable to that of rs(10,4) codes used by Facebook [121].
State of the art The performance of LDPC codes in the small length regime has not been ascertained. The finite-length analysis of LDPC codes on the binary erasure channel (BEC), presented in Section 4.5.3, provides ensemble results for the error probability together with a combinatorial characterization of decoding failures via message passing. Although the behaviour of individual codes is likely to concentrate around the ensemble average, the performance evaluation of code instances remains open. Moreover, degree distributions tai- lored to achieve capacity asymptotically lead to a significant decoding overhead
factor, when used with small lengths, that is, the number of blocks required to reconstruct the data is greater than the code dimension, see Section 4.5.1. Moreover, Reed-Solomon codes outperform LDPC codes in this regime when the download speed is slow. Small LDPC codes minimizing the decoding overhead are proposed in [122], but their performance in the presence of erasures is not analysed. The reliability of low repair bandwidth LDPC codes of length n ≥ 60 is studied in [22], see Section 4.5.2, and an archival storage system based on Tornado codes of length 96 and rate 1
2 is implemented in [70], see Section 4.5.5.
Ensemble-wise, coupling LDPC codes increases the decoding threshold to the maximum possible on the BEC [123]. On the AWGN channel, such gain of convolutional LDPC codes over the underlying LDPC block code is veri- fied via simulations for n ≥ 145 [124]. Implementation aspects of convolutional LDPC codes, e.g., systematic encoding and code termination, are discussed in [125], but besides an instance of length 128, simulations only cover codes with n ≥ 512. Short to moderate length convolutional LDPC codes based on quasi- cyclic LDPC codes are built in [126], but again only with three-digit-number code lengths.
Outline The rest of the chapter is organised as follows. In Sections 8.2 and 8.3 we respectively introduce bSTEP(s, p) and define its structure. In Section 8.4 we compute the minimal erasures which we use for the reliability analysis of Section 8.5. We discuss extensions in Section 8.6 and conclude in Section 8.7.