Erasure Codes for Reading and Writing - CiteSeerX

3 downloads 3260 Views 222KB Size Report
Jul 4, 2006 - system reliability, i.e. high ability to retrieve data blocks despite failed disk drives and ... Again, consider m = 4 hard disks with bits y1,y2,y3,y4.
Erasure Codes for Reading and Writing (Technical Report tr-ri-06-274) Christian Schindelhauer∗

Mario Vodisek†

July 4, 2006

Abstract Erasure resilient coding have indisputable importance in many application areas. In particular, for storage area networks one of the merits of erasure codes is that not all disks have to be read. Using the best available disks improves the access speed beyond the effect of parallel use of disks. Still, for writing all disks need to be updated. An effect known as write-penalty. In this paper, we present a novel approach which overcomes the write-penalty. Our Read-Write-Codes can decode the information of n symbols from any r symbols and write to any w symbols of the encoded m symbols. First, we show a lower bound of n + m ≥ r + w for all RW-codes. Then, we show that all so-called perfect RW-Codes with n + m = r + w exist if the symbol alphabet size is large enough. For this, we present an efficient deterministic method based on matrix operations over finite fields. This matrix based RW-Code features additional properties regarding safety and security. For safety, it can reconm!(r+`)! struct m − r missing symbols, and it can detect and repair ` faulty code symbols if (m−`)!r! < 12 . For security, m − w symbols can be given to an all powerful adversary without revealing any information of the encoded information vector. As a surplus, this matrix based RW-codes can be implemented as so-called Chameleon-Codes. There, the RW-codes parameters can be adapted during runtime, i.e. can be changed from (n, r, w, m) to (nearly) any choice of (n0 , r0 , w0 , m0 ) with n0 + r0 = w0 + m0 by reading only r symbols and changing w0 symbols. At last, we concentrate on the Boolean alphabet. We show that the only non-trivial perfect Boolean RW-codes satisfy n + 1 = r = w = m − 1. However, if we consider more general RW-codes with n + m > r + w then Boolean RW-Codes are again available.

1

Introduction

Erasure (resilient) codes map a word of n symbols over an alphabet Σ into an encoded word y of a total of m > n symbols over the same alphabet Σ. In such a code, it is possible to reconstruct from any set of r symbols from the m n symbol code word the original n symbol word, where n ≤ r < m. The ratio m is called the rate and m n is denoted to be the stretch factor of the code. These codes have been a major research topic for the last years with important applications. As an example, we refer to distributed storage systems, like large-scale RAID arrays or storage area networks (SANs) in which high system reliability, i.e. high ability to retrieve data blocks despite failed disk drives and hence, lost of data blocks, space efficiency and modification handling are of paramount concern. System reliability then is a crucial feature, since the probability of a failure to occur within a collection of disks increases with its size (see [22, 9]). Furthermore, the focus on space efficiency and modification handling result from the fact that such environments are highly read and write intensive, and the devices on which the data to store are magnitudes slower than any memory access leading to comparatively high I/O costs. Thus, the amount of data that is involved in an I/O operation and which needs to be rewritten to update the stored blocks according to the changed information becomes very important. An appropriate erasure-resilient encoding scheme should be able to cope with all such existing problems. Most of the recent research in this area focus on either quick encoding and decoding or optimizes the recovery feasibility. Until now, there is no encoding scheme known to us that solves all the mentioned problems appropriately. ∗ Computer † Heinz

Networks and Telematics, University of Freiburg, [email protected] Nixdorf Institute and Computer Science Department, University of Paderborn, Germany, [email protected]

1

Contents x1 x2 y1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1

Code y 2 y3 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 0

y4 0 1 1 0 1 0 0 1

Line v 0 1 0 1 0 1 0 1

Table 1: A (2, 3, 3, 4)2 -Read-Write-Code for contents x1 , x2 and code y1 , y2 , y3 , y4 . Every information vector has two possible code words. Even if only three of the four code words are available for reading and writing, the system can perform read and write operations. The easiest example of an erasure code is known as RAID 4. Consider m = 4 hard disks storing a data file bit for bit, Σ = {0, 1}. We encode n = 3 bits x1 , x2 , x3 to y1 = x1 , y2 = x2 , y3 = x3 and y4 = x1 + x2 + x3 , where addition denotes the XOR-operation. This code is resilient against the erasure of one bit, since it is possible to recover the original three bits from any combination of r = 3 hard disks, e.g. giving y2 , y3 , y4 we have x1 = y2 + y3 + y4 , x2 = y2 , and x3 = y3 . RAID 4 allows to read the information from the three fastest disks. Furthermore, if one disks is temporarily not available, then reading data is still possible. However then, writing data is not possible, since the change of the original information involves the change of the entire code. In this paper, we will overcome this limitation by introducing general erasure codes for reading and writing. The second example shows such an RW-code. Again, consider m = 4 hard disks with bits y1 , y2 , y3 , y4 . Here, n = 2 bits x1 , x2 are encoded such that any r = 3 code bits yi , yj , yk can be used to read the original message, or any of such w = 3 code bits yi0 , yj 0 , yk0 need to be changed to encode a new information. E.g. start with code (0, 1, 1, ?). According to the Table 1, the information is (1, 1) and therefore the complete code is (0, 1, 1, 0). Now, we want to encode (0, 1) without changing the second entry. So, we choose line 0 for information (0, 1) and get code (0, 1, 0, 1). This example code shows that such Read-Write-Codes exist. From now on, we call a (n, r, w, m)b -Read-WriteCode a coding system with an n-symbol message and an m-symbol code where the message can be computed from any r code symbols, and any w code symbols need to be changed (written) to encode a new message where a symbol is chosen from a b-symbol alphabet. This paper introduces such coding systems for the first time. It is structured as follows. In the next section, we give a brief review of erasure-resilient coding systems and codings used in distributed storage networks. Then, we state the problem formally and state general bounds for the parameters of Read-Write-Codes. We show when bits can be used for the symbols, and present a general scheme to produce any (n, r, w, m)b -Read-Write-Code if n + m ≤ r + w, for an appropriate choice of b. After that, we consider the case of Chameleon-Codes, where even these parameters can be changed. Then, we investigate the question for which choices of (n, r, w, m) a coding system exists over the binary alphabet, i.e. b = 2. Furthermore, we discuss how Read-Write-Codes can be combined, and then conclude the paper with possible applications for these codes in the fields of peer-to-peer-networks and storage area networks.

2

Related Work

Erasure coding is a technique which has widely been employed for achieving high availability and reliability in storage and communication systems. For instance, in network communication, for which it was originally developed, erasure coding is generally applied to provide reliability of data transmission for a variety of data delivery applications (see e.g. [1, 6, 8, 21, 29]). n Conventional erasure codes have a fixed rate parameter r = m < 1 which specifies the fraction of the encoded output blocks required to reconstruct the original document. Furthermore, the storage consumption increases by a factor of 1r . 2

Optimal erasure codes such as parity-based encoding schemes, like RAID [22], the EVENODD layout [4], or Reed-Solomon encoding [27] only require any n of the m encoded symbols to recover the original document. In such codes, the encoding normally consists of the n original and m − n (> 0) repair symbols, where the m encoded symbols are stored separately. For instance, Reed-Solomon coding has been utilized efficiently in distributed storage systems [28, 17] and disk arrays [5, 24] because such codes have optimal capability of recovery from data loss and hence, offer optimal efficiency such that any parity symbol can be substituted for any erased data symbol in the block. As a drawback, Reed-Solomon codes suffer from high computational costs since the parity-check information is generated by arithmetic operations over finite fields [23] leading to computational costs depending on the size of the underlying finite field. Because of these costs, often near-optimal codes are used, which require (1 + )n encoding symbols to recover the original message, where  > 0 is a parameter, and the cost of a smaller  is increased computation. Some of those codes have already been utilized for distributed storage applications (see e.g. [9, 10, 13, 24, 25]). The most popular instances of linear-time, near-optimal codes are Tornado codes [19] that provide probabilistic erasure correction based on sparse bipartite graphs to speed up encoding and decoding. In recent years, linear-time, near-optimal erasure codes have already been utilized in modern distributed systems, like peer-to-peer content distribution networks (P2P-CDNs) to properly increase system reliability, as well as system performance to achieve high-bandwidth input and output [6, 7, 8, 14, 25, 30, 31]. Such systems suffer from high fluctuations of peers which therefore cannot guarantee long-term system stability, because they mostly keep large files whose transfer times often exceed the average uptime of a single source node. Hence, ensuring high data availability is the main concern [17, 22]. Since Tornado codes cannot extend a fixed encoding to a potentially unlimited number of encoding symbols onthe-fly if the demand arises, rateless codes [18, 32, 25] have been developed. Such codes adapt to any rate with optimal asymptotic performance which makes them highly comfortable in the field of communications or P2P-CDNs (see [16, 20]). Rateless codes are probabilistic codes that fail to recover the encoding with some certain probability, e.g. LT codes [18] fail with constant probability δ, which is a parameter. In contrast to P2P-CDNs, SANs or RAID-arrays rather apply XOR-based schemes such as RAID or EVENODD, and (MDS) codes [12], like Reed-Solomon codes. XOR-based schemes are compact, simple and efficient techniques that focus on both, aggregating disk performance by employing parallel access to them, and guarantee sufficient reliability to tolerate up to a limited number of disk failures, all at a low cost. Distributed storage systems are highly read and write intensive. Hence, the update complexity is of great concern. It is the number of parity symbols that must rewritten when the information has changed. As stated in [2], in erasure code-based schemes used for distributed storage, the encoding is generated as a function of the original symbols which implies that with a RAID- or Reed-Solomon encoding, on every write to a data block all its associated parity blocks have also to be updated. This leads to inconvenient I/Os, because the devices on which to store the data are magnitudes slower than any memory access. Furthermore, the write procedure is no longer atomic, since writing to multiple disks is implied. Potential concurrent writes to blocks that are associated to the same parity block may then induce an inconsistency. Inconsistencies can be reduced e.g. when only a small fraction of all parity symbols must be rewritten. For instance, in [2] X-codes which are (n, n − 2) MDS array codes [3] (n − 2 data symbols generate an encoding en and only n − 2 encoded symbols are sufficient to recover the original message) are applied to model a fault-tolerant scheme, but which is solely limited to recover from any two simultaneous erasures.

3

Model

Read-Write-Codes encode information words into code words. The information is given by an n-tuple over a finite alphabet Σ. The code is an m-tuple over the same alphabet Σ. Let b = |Σ|. P(M ) denotes the power set of the set M , let Pk (M ) := {S ∈ P(M ) | |S| = k} and let [m] := {1, 2, . . . , m}. An (n, r, w, m)b Read-Write-Codings-System (RWC) consists of the following elements. 1. Initial state X0 ∈ Σn , Y0 ∈ Σm This is the initial state of the system with information X0 and code Y0 . 2. Read function: f : Pr ([m]) × Σr → Σn This function reconstructs the information by reading r symbols of the code word with known positions of the symbols. The first parameter shows the positions of the symbols in the code, the second parameter gives the corresponding code symbols. The outcome is the decoded information. 3

3a. Write function g : Pw ([m]) × Σw × Σn × Σn → Σw This function adapts the code word to a changed information. For this, it changes w symbols of the code word at given positions. The first parameter shows which code symbols may be changed, the second parameter describes the original symbols of the code word. Then, we have the original information and the new information as parameters. The outcome are the values of the new w code symbols encoding the information. 3b. Differential write function δ : Pw ([m])×Σn → Σw This is a restricted alternative to the write function which has as parameters the positions S of symbols available for write and the difference of the original and new information. The result is the difference of the available old code word symbols and the new code word symbols. This means for two functions ∆1 : Σn × Σn → Σn and ∆2 : Σw × Σw → Σw that the write functions g can be described by the differential write function via Y 0 = ∆2 (Y, δ(S, ∆1 (X, X 0 ))) All RW-codes, presented here, have such differential write functions where ∆1 , ∆2 are the the bit-wise XORoperations. For a tuple X = x1 , . . . , xn and a subset S of [n], let C HOOSE(S, X) be the tuple (xi1 , xi2 , . . . , xim ) where i1 , . . . , im are the ordered elements of S. Furthermore, let S UBST(S, X, Y ) for an m-tuple X, a subset S of P([m]) of size k and a k-tuple Y be the tuple where according to S each indexed element of X is replaced with the element in Y such that C HOOSE(S, S UBST(S, X, Y )) = Y and all other elements in X remain unchanged in the outcome. Now, define the read operation Read(S, Y ) := f (S 0 , C HOOSE(S 0 , Y )) and the write operation Write(S, S 0 , X, Y ) := S UBST(S, g(S, Y, Read(S 0 , Y ), X), Y ) . Define the set of possible codes C as the transitive closure of the function Y 7→ Write(S, S 0 , X, Y ) starting with Y = Y0 and allowing all values S, S 0 , X. An RWC is correct if the following statements are satisfied. 1. Correctness of the initial state: ∀S ∈ Pr ([m]): Read(S, Y0 ) = X0 . 2. Consistency of read operation: ∀S, S 0 ∈ Pr ([m]) ∀Y ∈ C: Read(S, Y ) = Read(S 0 , Y ) . 3. Correctness of write operation: ∀S ∈ Pw ([m]), ∀S 0 , S 00 ∈ Pr ([m]), ∀Y, Y 0 ∈ C, ∀X ∈ Σm : Read(S 0 , Write(S, S 00 , X, Y )) = X.

4

Lower Bounds

The example of a (2, 3, 3, 4)2 -RW-Code stores two symbols of information in a four symbol code. This storage overhead of a factor two is unavoidable, as the following theorem shows that no (3, 3, 3, 4)b -RWC nor (2, 3, 3, 3)b RWC exist. Theorem 1 For r + w < n + m or r < n or w < n and any base b there does not exist any (n, r, w, m)b -RWC. Proof: Consider a write operation with a subsequent read operation where the index sets W of the write operation (|W | = w) and the index set of the read operation R with |R| = r have an intersection: W ∩ R = S with |S| = r + w − m. There are bn possible change vectors that need to be encoded by the write operation into this intersection symbols, since this is the only base of information for the read operation. The reason is that all R \ S code words 4

Contents Code x y1 y 2 y3 0 0 0 0 0 1 1 1 0 2 2 2 1 0 1 2 1 1 2 0 1 2 0 1 2 0 2 1 2 1 0 2 2 2 1 0

Line v 0 1 2 0 1 2 0 1 2

Table 2: A (1, 2, 2, 3)3 -Read-Write-Code for contents x and code y1 , y2 , y3 . Every information vector has three possible code words. If only two of the four code words are available for reading and writing, the system can perform read and write operations. remain unchanged. Now, assume that |S| < n, then at most bn−1 possible changes can be encoded and therefore, the read operation will produce faulty outputs for some write operations. Thus, r + w − m ≥ n and the claim follows. If r < n then only br different messages can be distinguished while bn different messages exist. From the pigeonhole principle it follows that such a code does not exist. Analogously, the case w < n can be excluded using the pigeonhole principle.  So, in the best case (n, r, w, m)b RW-Codes have parameters r + w = n + m. We call such RWCs perfect. Such perfect RW-Codes do not always exist. Lemma 1 There is no (1, 2, 2, 3)2 -RWC. Proof: Consider the read-operation on y1 , y2 and the write operation on y2 , y3 . Then, y2 is the only intersecting bit and it has to be inverted if the information vector changes. Now, the same holds for the read-operation on y1 , y3 , the write operation on y2 , y3 and bit y3 . So, y2 and y3 have to be inverted if the information vector changes. Now, consider a sequence of three write operations on bits (1, 2), (2, 3), (1, 3), each inverting the information bit x1 . After these operations, all code bits have been inverted twice bringing it back to the original state. The information bit has been inverted thrice and is thus inverted. So, all read operation lead to wrong results.  Yet, if we allow a larger symbol alphabet we can provide an RWC. Lemma 2 There exists a (1, 2, 2, 3)3 -RWC. 

Proof: See Table 2 for an example. The correctness is straight-forward.

5

The Matrix-Approach

We show that all perfect RW-Codes exist if the symbol alphabet is large enough. Such codes can be constructed by matrix operations over finite fields. Let x = x1 , . . . , xn ∈ F [b] be the information vector and let y = y1 , . . . ym ∈ F [b]. The matrix based approach uses some internal slack variables v = v1 , . . . , vk for k = m − w = r − n carrying no particular information, and an appropriate m × r generator matrix M with Mi,j ∈ F [b]. The sub-matrix (Mi,j )i∈[m],j∈{n+1,...,r ) is called the variable matrix. The code relies on the following equation.      

M1,1 M2,1 .. .

M1,2 M2,2 .. .

··· ···

Mm,1

Mm,2

···

M1,r M2,r .. . Mm,r

5

       

x1 .. . xn v1 .. . vk

       =    

y1 y2 .. . ym

    

(1)

Operations • Initialization. Starting with any information vector x1 , . . . , xn the variables v1 , . . . , vk can be set to arbitrary values. If one want to benefit from the security features of this coding system these variables must be chosen uniformly at random. Now, compute the code Y = (y1 , . . . , ym ) by using Equation (1). • Read: Given r code entries from y, compute x We rearrange the rows of M and the rows of y such that the first r entries of y are available for reading. Let y 0 and M 0 denote these rearranged vector and matrix. The first r rows of M 0 describe the r × r matrix M 00 . We now use the property that M 00 is invertible and the information vector x (and the variable vector v) is now given by: (x | v)T = (M 00 )−1 y . • Differential write: Given the change vector δ and w code entries from y, compute the difference for the w code entries. Recall that the new information vector x0 is given by x0i = xi + δ. This notation allows to change the vector x0 without reading its entries, according to the differential write function. Only the choices w < r make sense. Again, for simpler description we re-arrange the rows of M and y such that the writable code symbols are y1 , . . . , yw . Let M 0 denote this rearranged matrix and y 0 the re-arranged code vector. Let k = r − n. Define the following sub-matrices of M’. M ←↑

=

0 (Mi,j )i∈[w],j∈{1,...,n ) ,

M ↑→

=

0 (Mi,j )i∈[w],j∈{n+1,...,r ) ,

M ←↓

=

0 (Mi,j )i∈{w+1,...,m},j∈{1,...,n ) .

M ↓→

=

0 (Mi,j )i∈{w+1,...,m},j∈{n+1,...,r ) .

A precondition of the write operation is the invertibility of M ↓→ . The code symbol vector is now updated by the vector ρ with γ = ((M ←↑ ) − (M ↑→ )(M ↓→ )−1 (M ←↓ )) · δ such that the new w code symbols Y 0 is derived from the former code symbols at the writable positions by an addition. Y0 =Y +γ . In fact, the (2, 3, 3, 4)2 -RWC in Table 1 can be generated by this matrix based approach, as shown in Figure 1. Furthermore, the (1, 2, 2, 3)3 -RWC in Fig. 2 corresponds to Table 2.



0  0   1 1

0 1 0 1





1 x1  1    x2  =    1 v1 1 

Write code 1, 2, 3 1, 2, 4 1, 3, 4 2, 3, 4



Readable code symbols y1 , y 2 , y 3 y1 , y 2 , y 4 y1 , y 3 , y 4 y2 , y 3 , y 4



y1 y2   y3  y4

x1

x2

y1 + y3 y2 + y4 y1 + y3 y2 + y4

y1 + y 2 y1 + y 2 y3 + y 4 y3 + y 4

(x01 , x02 ) = (x1 + δ1 , x2 + δ2 ) = y1 + y20 = y2 + y30 = y3 + y40 = y4 + δ1 + δ2 δ1 δ1 0 δ1 δ1 + δ2 0 δ2 δ2 0 δ1 + δ2 δ1 0 δ2 δ1 δ1 + δ2

y10

Figure 1: A (2, 3, 3, 4)2 -Read-Write-Code over F [2] = {0, 1} modulo 2.

6



0  1 2

Readable code symbols y1 , y 2 y1 , y 3 y2 , y 3

     1 y1 x 1  =  y2  v 1 y3

y10

= δ + y1 2δ + y1 y1

Writable symbols y 1 , y2 y 1 , y3 y 2 , y3

x 2y1 + y2 y1 + 2y3 2y2 + y3

x0 = x + δ y20 = y30 = 2δ + y2 y3 y2 δ + y3 δ + y2 2δ + y3

Figure 2: A (1, 2, 2, 3)3 -Read-Write-Code over F [3] = {0, 1, 2} modulo 3.

Definition 1 An m×n-matrix A over the base b with m ≥ n is row-wise invertible if for each n×n matrix constructed by combining n distinct rows of A has full rank (and therefore is invertible). Theorem 2 The Matrix based Read Write Coding system is correct and well-defined if the m × r generator matrix M is row-wise invertible and the m × (r − n) variable sub-matrix M 0 is row-wise invertible. Proof: Follows from the definition of row-wise invertibility and the description of the operations. To prove the correctness of the coding system we prove that after each operation Equation 1 is valid. This is straight-forward for the initialization and read operations. It remains to prove the correctness of the write operation. For this, we use the additional vector ρ1 , . . . , ρk denoting the change of the variable vectors and the vector γ1 , . . . , γw . x0 v0 y0

= x+δ, = v+ρ, = y+γ .

The correctness follows now by combining:  0        x x+δ x δ M =M =M +M v0 v+ρ v ρ 

y1 .. .

    yw =  yw+1   .  .. ym





 γ1   ..    .       γw  +    0       .    ..  0

This equation is equivalent to the following. (M ↓→ )ρ + (M ←↓ )δ (M ←↑ )δ + (M ↑→ )ρ

= 0, = γ.

Remember that δ is given, so one can compute ρ by the following equation and γ by the last upper equation. ρ =

M ↓→

−1

 −M ←↓ · δ .

If ρ is known then the product M · (δ | ρ)T (reduced to the first w rows) gives the difference vector γ which provides the new code entries of y 0 by y 0 = y + γ.  7

Theorem 3 For any n ≤ r, w ≤ m with r + w = n + m there exists an (n, r, w, m)b -Read-Write-Code system for an appropriate base b. Furthermore, this coding system can be computed in polynomial time. Proof: Follows from the following Lemma. Lemma 3 For each m ≥ n there exists a basis b ≥ 2dlog2 m+1e with a row-wise m × n-matrix over the finite field F [b]. Furthermore, each submatrix is also row-wise invertible. Proof: Define an m × n Vandermonde like matrix V for non-zero distinct elements c1 , . . . , cm ∈ F [2dlog2 m+1e ].  1  c1 c21 . . . cn1  c12 c22 . . . cn2    1 2 n   V =  c3 c3 . . . c3   .. ..  ..  . . .  1 2 n cm cm . . . cm Now erase any m − n rows resulting in an n × n matrix V 0 . This submatrix is also a Vandermonde-matrix. Since all Vandermonde-matrices are invertible this proves the claim.  

6

Security and Redundancy

A very extreme scenario is the combination of hard disks of m portable (laptop) computers in a storage network. Using an (n, r, w, m)b RW-code for at most m laptops, it is sufficient if at least max{r, w} computers are at the office to access and change data. If only r computers are connected, then at least read operations can be performed. Now, what happens if computer hard disks are broken or hard disks are changed. Then, the inherent redundancy of any (n, r, w, m)b -Read-Write-Coding system allows to point out the number of wrong data and repair it (to some extent). A different problem occurs if computers are stolen to achieve knowledge about company data. The good news is that in every matrix based Read-Write-Coding systems one can give away any m − w hard disks without revealing any information to an adversary. The attacker will receive hard disks with perfect random sequences, absolutely useless without the other hard disks.

Redundancy Theorem 4 Every (n, r, w, m)b -RWC system can detect and repair ` faulty code symbols if reconstruct m − r missing code symbols.

m!(r+`)! (m−`)!r!


m, then we compute the corresponding variables zi from x and v. If m0 < m, we rename m − m0 code variables to z-variables and thus, reduce the code size. If r0 > r, then the content/slack-variable vector (x|v)T is enhanced by (r0 − r) 0-entries. We can assume that new contents needs to be written during the switch-operation (especially if n 6= n0 ). So, let 0 v1 , . . . , vr0 0 −n0 be the new set of slack variables. Furthermore, w0 code symbols are only available for writing. First, we erase in the Vandermonde matrix and the code vectors the rows m0 + 1, . . . , M , since they are of no interest for this switch operation. Like in the matrix based approach, we rearrange the residual matrix and the residual code vector such that the writable variables are on the first w rows. Furthermore, we rearrange the columns of the Vandermonde matrix and the contents/slack vector such that the new slack variables are on the rightmost columns, ˜. respectively lowermost lines. This results in the matrix M Let x the be original rearranged vector up to the lowest r0 − n0 entries (possibly containing a mixture of old contents, old slack variables, and 0-entries). Let x0 be the new (adequately rearranged) vector containing the new 9

contents, and let v 0 be the new slack variable vector with r0 − n0 entries. If r0 ≥ r, then x has n0 entries, if r0 < r, then x (x0 ) has r − r0 additional entries resulting from former slack or content variables need to be set to 0. First, we deal with the case r0 ≥ r. So, the number of entries in x is n0 . Then, we can perform a matrix based RWC write operation changing w0 code symbols. Let k 0 = r0 − n0 = m0 − w0 . For this let M ←↑ be the upper left w0 × n0 -sub-matrix of M 0 , let M ↑→ be the upper right w0 × k 0 -sub-matrix of M 0 . Let M ←↓ be the left-lower k 0 × n0 -sub-matrix of M 0 . Let M ↓→ be the right-lower k 0 × k 0 -sub-matrix of M 0 . As in the matrix based, the new (rearranged) writable symbol vector y 0 can be obtained by y 0 = y + ((M ←↑ ) − (M ↑→ )(M ↓→ )−1 (M ←↓ )) · (x0 − x) by using the old (rearranged) writable symbol vector y. The proof of correctness is completely analogous to the correctness proof of the matrix based approach. Now, we consider the case r0 < r. Then, the number of entries in x and x0 is n ˜ = n0 + r − r0 . Again, let x0 be 0 the new (adequately rearranged) vector containing the n ˜ new symbols, and let v be the new slack variable vector with r0 − n0 entries. Note that x0 − x can be computed at this stage. We now perform a slightly adapted matrix based RWC write operation changing w0 code symbols. Clearly, the matrix has more r − r0 more columns as in the previous case. In fact, this is no problem, we have only to adapt the sub-matrices. Let k 0 = r0 − n0 = m0 − w0 and let w ˜ = w + r − r0 . ←↑ Now, let M be the upper left w ˜×n ˜ -sub-matrix of M 0 , let M ↑→ be the upper right w ˜ × k 0 -sub-matrix of M 0 . ←↓ 0 0 ↓→ 0 0 Let M be the left-lower k × n ˜ -sub-matrix of M . Let M be the right-lower k × k -sub-matrix of M 0 . 0 Again, let y be the old writable symbols and y be the new ones. The new resulting vector y 0 can be obtained by y 0 = y + ((M ←↑ ) − (M ↑→ )(M ↓→ )−1 (M ←↓ )) · (x0 − x) . Again, the proof is analogous to the proof for the matrix based approached.

8



Boolean Read-Write-Codes

The most interesting case for the choice of the alphabet is the binary case Σ = {0, 1}. We have already seen that there are no Boolean (1, 2, 2, 3)-RW-Codes. Also for the matrix based RW-codes, the Boolean basis poses a severe restriction. The reason is that only for some dimensions Boolean matrices can be row-wise invertible. Lemma 4 For each m there is one m × 1 row-wise invertible Boolean matrix. For n ≥ 2, there exist m × n row-wise invertible Boolean matrices if and only if m = n or m = n + 1. Proof: The first claim is trivial. For the second, note that there is an (n + 1) × n row-wise invertible Boolean matrix, e.g.   1 0 ··· 0  ..   0 1 .     .  . .. 0   ..    0 ··· 0 1  1 ··· 1 1 Now, we prove that there are no (n + 2) × n row-wise invertible matrices. For this, we prove that given an n × n full rank Boolean matrix M there is always a unique vector leading to an, (n + 1) × n row-wise Boolean matrix. Now, consider M and remove the row i. Then, there are 2n−1 possibilities to add a row receiving a matrix with full rank. Such a row is described by the vector set M (x1 , . . . , xn )T where xi = 1 and the rest is chosen arbitrarily. So, the only vector that can be added to each of these combinations is the vector M (1, . . . , 1)T . After this vector has been added, no other vector can be added which is linearly independent from the other vectors.  This lemma has severe implications on the matrix based method for Boolean bases. Theorem 7 For r + w = n + m there are only Boolean m × n generator matrices V for (n, r, w, m)2 -RWR-MatrixCoding for only these cases: 1. (1, 1, m, m)2 , (1, m, 1, m)2 , (n, n, n, n)2 10

2. For n ≥ 2: (n, n + 1, n + 1, n + 2)2 3. For n ≥ 1: (n, n, n + 1, n + 1)2 4. For n ≥ 1: (n, n + 1, n, n + 1)2 Proof: Follows by combining the prerequisites of the matrix based RWC method with Lemma 4.  If only one bit information needs to be encoded, it turns out that every perfect Boolean RW-Code is a matrix based RW-code. Lemma 5 Every (1, r, w, r + w − 1)2 -RW-Code is a matrix based RW-Code. Proof: Consider a write operation on the set W of cardinality w and a Read-operation on the index set R ⊆ {1, . . . , n} of cardinality r such that |R ∩ W | = 1. Let i be the index of this element in the intersection. Now, if the data entry x ∈ {0, 1} changes, then this bit must change as well. Otherwise, the read operation would reproduce the same result as before which would be wrong. So, the operation on this code bit yi can be described by yi0 ≡ yi + x0 + x (mod 2) if x is the former value, x0 is the new value and yi0 is the new value of the code bit. Note that for any index i ∈ W there is a set R with |R| = r and |R ∩ W | = 1. So, the above equivalence is valid for all bits, and this leads to the matrix representation of RW-codes.  One might hope that other techniques produce perfect Boolean RW-codes beyond the matrix based RW-codes. One can prove that this is not the case. Lemma 6 There is no (2, r, w, m)2 -RW code for r + w = 2 + m and w ≥ 4. Proof: Consider any 4 bits of the code word y1 , . . . , ym . The index set is denoted by F . Now, we partition the residual m − 4 code bits into two disjoint index sets R of cardinality r − 2 and W of cardinality w − 4. Without loss of generality, consider the set F = {1, 2, 3, 4} describing the the code bits y1 , y2 , y3 , y4 . Now, we consider Write-operations on the index sets Wi,j = W ∪ F for all distinct i, j ∈ F and Read-operations on the index set Ri,j = R ∪ {i, j}, again for all i, j ∈ W . We are interested in the number of bits that need to be changed if the information x1 , x2 changes to x01 , x02 (for which there are three posibilities). Let pi be a predicate which is only true if the code bit yi has to be changed, i.e. inverted. If we consider the read operation on Ri,j , then all bits in R remained the same. So, the only way, the Writeoperation induces a change is that at least one of the code bits yi , yj is changed. Considering all such read operations we have to fulfill the following term: ^ pi ∨ pj = i,j∈[4],i6=j

(p1 ∧ p2 ∧ p3 ) ∨ (p1 ∧ p2 ∧ p4 ) ∨ (p1 ∧ p3 ∧ p4 ) ∨ (p2 ∧ p3 ∧ p4 ) . So, all but (at most) one bit have to be inverted. Hence, there are five possibilities to encode all three possible changes to the information vector x1 , x2 . Now, each read operation can only observe two bits of these four bits. We start with R1,2 : 1. A: y1 and y2 are inverted. 2. B: Only y1 is inverted. 3. C: Only y2 is inverted. So, A,B, and C can be mapped to all three possible changes of x1 , x2 to x01 , x02 . Yet, there is a conflict with the read operation R3,4 in the situations B and C. These situations B and C are indistinguishable for the read operation, because all bits y3 and y4 must be inverted in both situations. Then, the read operation of R3,4 cannot distinguish the different change in the information vector.  This lemma can be generalized to the following theorem. Theorem 8 There is no (n, r, w, m)2 -RW code for r + w = n + m and w ≥ n + 2.

11

Proof: Consider any n + 2 bits of the code word y1 , . . . , ym . The index set is denoted by F . Now, we partition the residual m − n code bits into two disjoint index sets R of cardinality r − 2 and W of cardinality w − n − 2. Without loss of generality, consider the set F = {1, 2, . . . , n + 2} describing the the code bits y1 , y2 , . . . , yn+2 . Now, we consider Write-operations on the index sets Wi,j = W ∪ F for all distinct i, j ∈ F and Read-operations on the index set Ri,j = R ∪ {i, j}, again for all i, j ∈ W . Again, we are interested in the number of bits that need to be changed if the information x1 , . . . , xin changes to xi1 , . . . , xin+2 (for which there are 2n − 1 posibilities). Let pi be a predicate which is only true if the code bit yi has to be changed, i.e. inverted. If we consider the read operation on R ∪ {i1 , . . . , in }, then all bits in R remained the same. So, the only way, the Write-operation induces a change is that at least one of the code bits yi1 , . . . , yin (because the read operation will read only two bits from this intersection) are changed. Considering all such read operations we have to fulfill the following term: ^ _ _ ^ pi = pi . S∈[n+2]:|S|=n

i∈S

S∈[n+2]:|S|≥3

i∈S

 P2 So, at least 3 bits have to be inverted in every Write-operation. Hence, there are 2n+2 − i=0 n+2 possibilities to i encode all 2n − 1 possible changes to the information vector x1 , . . . , xn . However, each of the read operations accesses only n bits (in the intersection) it needs 2n − 1 possible outcomes. If we combine two of the read operations, this implies for the whole set that they need at least 2n+2 − 7 possible outcomes. Consider the intersection of read operations that have only n − 2 positions in common. If these bits are the same, then each of the residual pairs of two bits must differ, leaving 9 possibilities. In the other case, we have 2n−2 − 1 possibilities for the P2commonbit vector and 16 possibilities for the ‘private’ bit pairs. So, there are i=0 n+2 − 7 codes missing which is larger than 1 for all n ≥ 1.  i

9

General RW-Codes

Since only few perfect Boolean RW-Codes exist, we consider more general RW-Codes where r +w > m+n. For this, we use a modified matrix based approach and choose all matrix entries randomly. Then, we prove that with positive probability the read and write operations work. We call the following approach a random matrix (n, k, m)b -RW-Code, where the information vector consists of n symbols, k slack variables are used, and an m-symbol code is generated. Again, the alphabet Σ has size b. Let M be a (uniformly) random (n + k) × m-matrix over Σ. We use the matrix based representation M (x | v)T = y. The read and write operation can be derived from the matrix based read and write operations as follows. Read Given the r ≥ k + n readable positions in the code vector we derive the matrix M 0 by choosing all rows of M corresponding to the rows available for reading. In this matrix M 0 , we choose k + n linear independent row vectors yielding matrix M 00 . If such many linear independent rows do not exist, then the read operation fails. Corresponding to the rows, we reduce the code vector, yielding y 0 and compute the content vector and the slack variable vector by (M 00 )−1 y 0 . Write Consider the variable sub-matrix M → of m of the k rightmost columns. Now, there are m − w non-writable positions corresponding to rows of the code vector y and variable sub-matrix M → . If these m−w rows of M → are not linear independent, then the write operation fails. Otherwise, we add some other k − m + w linear independent rows of M → (Again, if these do not exist, the operation fails). These rows correspond now to non-writable positions. Using the complement of this rows as writable positions we can directly apply the matrix based RW-Code Write operation. We now investigate the success probability of Read and Write. Lemma 7 An (n + k) × n base b ≥ 2 random matrix has an n × n-sub-matrix (by choosing n of the rows) which is invertible with probability larger than 1 − b−k+2 . Proof: We choose the rows as follows. Add a row if it is not linear dependent from the already chosen. If ` rows have been chosen, then the next one is not linear independent with probability 1 − b−n+` .

12

Hence, an n × n random Boolean matrix is invertible with probability (1 − b−n )(1 − b−n+1 ) · · · (1 − b−2 )(1 − b−1 )

=

(1 − b−1 )

≥ (1 − b−1 ) ≥ (1 − b−1 )

n Y i=2 n Y i=2 ∞ Y

(1 − b−i ) 4−b

−i

4−b

−i

P

i=2

≥ (1 − b

−1

)4−

∞ i=2

b−i

1

≥ (1 − b−1 )4− b(b−1) 1 ≥ (1 − b−1 ). 2 Let P (`, k) denote the probability that a random ` × n matrix has rank k. Then, we have the following recursive equations: P (0, 0) = 1 P (0, k) = 0 , for k > 1 , P (` + 1, 0) = b−n P (`, 0) , for ` ≥ 0 , P (`, k) = P (` − 1, k − 1)(1 − b−n+k−1 ) +P (` − 1, k)b−n+k , for 1 ≤ k ≤ n, ` . Lemma 8 It holds for ` < k ≤ n: P (`, `) ≥ (1 − 2b`−n−2 )(1 − b`−n−1 ) P (`, `) ≤ (1 − b`−n−1 ) P (` − 1, k − 1) ≤ P (`, k) for ` > k b −n+k b P (`, k) ≤ P (`, k − 1) b−1 Proof: (2): follows by P (`, `) = P (` − 1, ` − 1)(1 − b−n+`−1 ) for ` ≥ 1 and P (0, 0) = 1. (3): First, note that: = P (` − 1, ` − 1)(b−n+`−1 ) +(1 − b`−n−2 )P (` − 1, ` − 2) (b−n+`−1 ) ≤ (1 − b`−n−2 )(b−n+`−1 ) (1 + P (` − 1, ` − 2)) ≤ (1 − b`−n−2 )(b−n+`−1 ) P (` + 1, `) = P (`, `)(b−n+` ) +(1 − b`−n−1 )P (`, ` − 1)(b−n+` ) ≥ (1 − b`−n−1 )(b−n+` ) ((1 − b`−n−2 )(1 − 2b`−n−3 ) +P (`, ` − 1)) P (`, ` − 1)

13

(2) (3) (4)

Now, P (` + 1, `) ≥ P (`, ` − 1) since (1 − b`−n−1 )b(1 − b`−n−2 )(1 − 2b`−n−3 ) ≥ 1 − b`−n−2 because b − 1 ≥ b`−n + b`−n−1 + b`−n−2 = (1 + 1/b + 1/b2 )(b`−n ) which holds for b ≥ 3 or (b = 2 and ` ≤ n − 1). For the case ` = n and b = 2, it turns out that also P (n + 1, n) ≥ P (n, n − 1). (4) For ` = 0, this claim is true by definition. For ` > 1 note that: P (`, k)

Now, (4) implies P (i + k, k) ≤



n−2 X

= P (` − 1, k − 1)(1 − b−n+k−1 ) +P (` − 1, k)b−n+k ≤ P (` − 2, k − 1)b−n+k−1 (1 − b−n+k−1 ) +P (` − 1, k)b−n+k ≤ P (` − 1, k)b−n+k−1 (1 − b−n+k−1 ) +P (` − 1, k)b−n+k ≤ P (` − 1, k)(b−n+k−1 (1 − b−n+k−1 ) +b−n+k ) 1 ≤ P (` − 1, k)b−n+k (1 + ) b

b b−1

i

 b

−(n+k)i

P (`, i) ≤

.

n−2 X k=0

k=0



n−2 X

b b−1

`−k

b−(n+k)(`−k)

b`−k b−(n+k)(`−k)

k=0



n−2 X

b`−k−(n+k)(`−k)

k=0



n−2 X

b`−k−(n+k)(`−n+2)

k=0

≤ b`−n(`−n+2)

n−2 X

b−k(`−n−1)

k=0

= b`−n(`−n+2) ≤

1 − b−(`−n−1)(n−1) 1 − b−(`−n−1)

1 `−n`−n2 +2 b 2

So, the weight is concentrated around P (`, n) and P (`, n − 1) for ` ≥ n + 2. Applying the recurrence equalities it follows that that P (n + i, n) > 1 − b−i+2 .  Hence, if only few combinations of read and write index sets occur, then the overhead for general codes is small. Theorem 9 Consider a general Read-Write-Coding system with read index sets R = {R1 , . . . , R|R| } with |Ri | ≥ r and with a write index sets W = {W1 , . . . , W|W| } with |Wi | ≥ w. Then, there is such a restricted (n, r, w, m)b Read-Write-Coding system for any r > n + logb |R| + 2 and w + r > m + n + logb |W| + 2. 14

Proof: The probability of having a valid code for reading using r rows is at least 1 − b−(r−n)+2 . If we sum up the error probabilities of all possible read operations, we end up with an error probability less than |R|b−(r−n)+2 which is less than 1 if r > n + logb |R| + 3. For the write operation, we consider a k × (m − w) sub-matrix for k = r − n. Here, we need a positive probability to find m − w ≤ k independent rows. The other failure case for the write operation can be neglected. Hence, with a success probability higher than 1 − bm−w−k+2 we succeed. Thus, for m − w − r + n + 2 > logb |W| the summed error probability is smaller than 1. Combining both error probabilities gives an overall error probability of less than 1. So, there exists a matrix allowing these restricted read and write index sets.  Yet, the overall number of possible read and write index sets is quite high. The following theorem shows that random Boolean matrices perform quite well for most of the read and write index sets. Theorem 10 For any base b, a random matrix based (k, n, m)b -RW-Code successfully performs a read operation for a random choice of n + k + ` code symbols with probability 1 − b−`+2 , and successfully performs a write operation for a random choice of m − k + ` writable code symbols with probability 1 − b−`+2 . Proof: Follows directly from Lemma 7 and from the definition of the Read and Write operation of random matrix based (k, n, m)b -RW-Codes. 

10

Conclusions

In [11], new methods are introduced for a wide-area distributed hash table (DHT) that provides high-throughput and low-latency network storage. They suggest a system called DHash++ which based on the peer-to-peer network Chord [33]. One feature of DHash++ is to use an erasure-resilient code for storing data. They store each 8192-byte block as 14 1171-byte erasure-coded fragments, any seven of which are sufficient to reconstruct the block, using the IDA (Information Dispersal Algorithm) coding algorithm of [26]. The benefit of this code is low latency since the fastest peers can be used to reconstruct the stored data. On the other side, this system suffers from the write-penalty problem. This is a promising application where Read-Write-Codes might improve peer-to-peer-networks. However, the RW-Codes, presented here, are primarily designed for distributed storage area networks. System reliability is a crucial feature there, and since storage is distributed among n devices, fault-tolerance data distribution schemes are highly required to beware from data loss. As a rule of thumb, one observes that the mean time before failure of a disk decreases by a factor of n1 . In addition, reducing system latency is also of great concern since access to the devices on which the data is stored is magnitudes slower than any memory access which leads to comparatively high I/O costs, and at last, SANs base on a well-designed layout trying to keep the entire system complexity at a level which can efficiently be administrated and maintained since this complexity translates directly into monetary savings. Employed RAID schemes (RAID 4/5) and Reed-Solomon encoding are simple and efficient techniques that, on one hand, focus on aggregating disk performance by employing parallel access to them, and on the other hand, guarantee sufficient reliability to tolerate multiple disk failures, all at a low cost. However, all of such codes have some significant drawbacks in common. All of them require m − n checksum disks when encoding an n block document into an m block encoding em which leads to the following problems. 1. Changing all n blocks of the original document implies additional and thus, expensive access to each of the m − n checksum disks; this write penalty leads to a performance bottleneck. 2. Changing all n blocks of the original document while at least one of the m disks is missing results in inconsistencies. 3. Once the stripe size of an encoding is chosen, it cannot be changed any more and therefore, it cannot be adjusted to potentially changing demands (e.g. changing degree of redundancy, adjusting to a higher number of tolerable failures, changing stripe size, changing document size, etc.) There are methods to avoid scheme inconsistencies, e.g. in the case of parity declustering. Here, an advanced parity encoding scheme basing on complete, or at least balanced incomplete block design [15], is employed in which not all disks of the block design have to be rewritten in case of a complete information update. Such block designs often induce some significant memory overhead, or it is not clear if one exist for some special layout. Additionally, once a block design is applied the encoding scheme is fixed as well, and it also cannot be adjusted any further. 15

RW-Codes can significantly reduce, or even eliminated these problems. In a perfect (n, r, w, m)-RW-Code (with r +w = n+m), any r disks suffice for reading and any max{r, w} disks suffice for writing. The overhead is described by m/n, and any of these combinations (n, r, w, m) with n ≤ r, w ≤ m can be chosen. In the case of ChameleonCodes, these parameters can be adjusted during runtime without reading and rewriting all disks. Furthermore, the information of m − r disks can recovered and the system can continue working if at not more than m − max{r, w} disks fail. There is some redundancy against faulty hard disks. However, redundancy is better solved by reducing it with disk-wise checksums to the problem of failed disks. Interestingly, there is also a security feature which allows the theft of up to m − w disks. The main advantage of RW-codes in a SAN is that the currently r, respectively w, fastest disks can be used for read or write operation. This leads to more efficient storage area networks.

References [1] M. Adler, Y. Bartal, J. W. Byers, M. Luby, and D. Raz. A modular analysis of network transmission protocols. In Israel Symposium on Theory of Computing Systems, pages 54–62, 1997. [2] M. K. Aguilera, R. Janakiraman, and L. Xu. Reliable and secure distributed storage using erasure codes. [3] M. Blaum, J. Brady, F. Bruck, and H. van Tilborg. Array codes. In Handbook of Coding Theory, volume 2, chapter 22. V.S. Pless and W.C. Huffman, 1999. [4] M. Blaum, J. Brady, J. Bruck, J. Menon, and A. Vardy. The EVENODD code and its generalization: An efficient scheme for tolerating multiple disk failures in RAID architectures. In H. Jin, T. Cortes, and R. Buyya, editors, High Performance Mass Storage and Parallel I/O: Technologies and Applications, chapter 14, pages 187–208. IEEE Computer Society Press and Wiley, New York, NY, 2001. [5] W. A. Burkhard and J. Menon. Disk array storage system reliability. In Symposium on Fault-Tolerant Computing, pages 432–441, 1993. [6] J. Byers, M. Luby, and M. Mitzenmacher. A digital fountain approach to asynchronous reliable multicast. IEEE Journal on Selected Areas in Communications, 20(8), Oct, 2002. [7] J. W. Byers, M. Luby, and M. Mitzenmacher. Accessing multiple mirror sites in parallel: Using tornado codes to speed up downloads. In INFOCOM (1), pages 275–283, 1999. [8] J. W. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A digital fountain approach to reliable distribution of bulk data. In SIGCOMM’98, pages 56–67, Sep 1998. [9] Y. M. Chee, C. J. Colbourn, and A. C. H. Ling. Asymptotically optimal erasure-resilient codes for large disk arrays. Discrete Appl. Math., 102(1-2):3–36, 2000. [10] J. A. Cooley, J. L. Mineweaser, L. D. Servi, and E. T. Tsung. Software-based erasure codes for scalable distributed storage. In IEEE Symposium on Mass Storage Systems, pages 157–164, 2003. [11] F. Dabek, J. Li, E. Sit, J. Robertson, M. F. Kaashoek, and R. Morris. Designing a DHT for low latency and high throughput. In NSDI, pages 85–98, 2004. [12] N. S. F.J. Mac Williams. The Theory of Error Correcting Codes. North-Holland Mathematical Library, 1977. [13] A. C. H. Xia. Robustore: Robust performance for distributed storage systems. Technical Report CS2005-0838, University of California, San Diego, October 2005. [14] C. Harrelson, L. Ip, and W. Wang. Limited randomness lt codes. [15] M. Holland and G. Gibson. Parity declustering for continuous operation in redundant disk arrays. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 23–35, 1992. [16] M. Krohn, M. Freedman, and D. eres. On-the-fly verification of rateless erasure codes for efficient content distribution. In IEEE Symp. on Security and Privacy, Oakland, CA, May 2004., 2004. 16

[17] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. Oceanstore: An architecture for global-scale persistent storage. In Proceedings of ACM ASPLOS. ACM, November 2000. [18] M. Luby. Lt codes. In FOCS ’02: Proceedings of the 43rd Symposium on Foundations of Computer Science, page 271, Washington, DC, USA, 2002. IEEE Computer Society. [19] M. G. Luby, M. Mitzenmacher, M. A. Shokrollahi, D. A. Spielman, and V. Stemann. Practical loss-resilient codes. In Proceedings of 9th Annual ACM Symposium on Theory of Computing, pages 150–159, 1997. [20] P. Maymounkov and D. Mazieres. Rateless codes and big downloads. In Proc. of the 2nd International Workshop on Peer-to-Peer Systems, Feb. 2003. [21] J. Nonnenmacher and E. Biersack. Asynchronous multicast push: Amp. In Proceedings of ICCC’97, pp. 419– 430, Cannes, France, November, 1997. [22] D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM Conference on Management of Data (SIGMOD), pages 109–116, June 1988. [23] W. Peterson. Error Correcting Codes. The MIT Press, Wiley and Sons, 1961. [24] J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software, Practice and Experience, 27(9):995–1012, 1997. [25] J. S. Plank and M. G. Thomason. On the practical use of ldpc erasure codes for distributed storage applications. Technical Report CS-03-510, University of Tennessee, September 2003. [26] M. O. Rabin. Efficient dispersal of information for security, load balacing and fault tolerance. Journal of the ACM, 36(2):335–348, 1989. [27] I. S. Reed and G. Solomon. Polynomial codes over certain finite fields. J. SIAM, 8:300–304, 1960. [28] S. Rhea, C. Wells, P. Eaton, D. Geels, B. Zhao, H. Weatherspoon, and J. Kubiatowicz. Maintenance-free global data storage, 2001. [29] L. Rizzo. Effective erasure codes for reliable computer communication protocols. ACM Computer Communication Review, 27(2):24–36, Apr. 1997. [30] A. Roumy, S. Guemghar, G. Caire, and S. Verd´u. Design methods for irregular repeat-accumulate codes. IEEE Transactions on Information Theory, 50(8):1711–1727, 2004. [31] S. K. S.B. Wicker. Fundamentals of Codes, Graphs, and Iterative Decoding. Kluwer Academic Publishers, Norwell, MA, 2003. ´ [32] A. Shokrollahi. Raptor codes. Technical report, Laboratoire d’algorithmique, Ecole Polytechnique F´ed´erale de Lausanne, Lausanne, Switzerland, 2003. Available from http://algo.epfl.ch/. [33] I. Stoica, R. Morris, D. R. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM, pages 149–160, 2001.

17