source: trunk/docs/proposed/lossmodel.lyx

Last change on this file was 3782c27, checked in by Shawn Willden <shawn-tahoe@…>, at 2009-07-28T22:44:30Z

Lossmodel updates

Various improvements to the lossmodel, plus addition of README.lossmodel
that provides a link to the PDF.

  • Property mode set to 100644
File size: 55.8 KB
Line 
1#LyX 1.6.2 created this file. For more info see http://www.lyx.org/
2\lyxformat 345
3\begin_document
4\begin_header
5\textclass amsart
6\use_default_options true
7\begin_modules
8theorems-ams
9theorems-ams-extended
10\end_modules
11\language english
12\inputencoding auto
13\font_roman default
14\font_sans default
15\font_typewriter default
16\font_default_family default
17\font_sc false
18\font_osf false
19\font_sf_scale 100
20\font_tt_scale 100
21
22\graphics default
23\float_placement h
24\paperfontsize default
25\spacing single
26\use_hyperref false
27\papersize default
28\use_geometry false
29\use_amsmath 1
30\use_esint 1
31\cite_engine basic
32\use_bibtopic false
33\paperorientation portrait
34\secnumdepth 3
35\tocdepth 3
36\paragraph_separation indent
37\defskip medskip
38\quotes_language english
39\papercolumns 1
40\papersides 1
41\paperpagestyle default
42\tracking_changes false
43\output_changes false
44\author ""
45\author ""
46\end_header
47
48\begin_body
49
50\begin_layout Title
51Tahoe Distributed Filesharing System Loss Model
52\end_layout
53
54\begin_layout Author
55Shawn Willden
56\end_layout
57
58\begin_layout Date
5907/22/2009
60\end_layout
61
62\begin_layout Address
63South Weber, Utah
64\end_layout
65
66\begin_layout Email
67shawn@willden.org
68\end_layout
69
70\begin_layout Abstract
71The abstract goes here
72\end_layout
73
74\begin_layout Section
75Problem Statement
76\end_layout
77
78\begin_layout Standard
79The allmydata Tahoe distributed file system uses Reed-Solomon erasure coding
80 to split files into
81\begin_inset Formula $N$
82\end_inset
83
84 shares which are delivered to randomly-selected peers in a distributed
85 network.
86 The file can later be reassembled from any
87\begin_inset Formula $k\leq N$
88\end_inset
89
90 of the shares, if they are available.
91\end_layout
92
93\begin_layout Standard
94Over time shares are lost for a variety of reasons.
95 Storage servers may crash, be destroyed or simply be removed from the network.
96 To mitigate such losses, Tahoe network clients employ a repair agent which
97 scans the peers once per time period
98\begin_inset Formula $A$
99\end_inset
100
101 and determines how many of the shares remain.
102 If less than
103\begin_inset Formula $L$
104\end_inset
105
106 (
107\begin_inset Formula $k\leq L\leq N$
108\end_inset
109
110) shares remain, then the repairer reconstructs the file shares and redistribute
111s the missing ones, bringing the availability back up to full.
112\end_layout
113
114\begin_layout Standard
115The question we're trying to answer is "What is the probability that we'll
116 be able to reassemble the file at some later time
117\begin_inset Formula $T$
118\end_inset
119
120?".
121 We'd also like to be able to determine what values we should choose for
122 
123\begin_inset Formula $k$
124\end_inset
125
126,
127\begin_inset Formula $N$
128\end_inset
129
130,
131\begin_inset Formula $A$
132\end_inset
133
134, and
135\begin_inset Formula $L$
136\end_inset
137
138 in order to ensure
139\begin_inset Formula $Pr[loss]\leq r$
140\end_inset
141
142 for some threshold probability
143\begin_inset Formula $r$
144\end_inset
145
146.
147 This is an optimization problem because although we could obtain very low
148 
149\begin_inset Formula $Pr[loss]$
150\end_inset
151
152 by selecting conservative parameters, these choices have costs.
153 The peer storage and bandwidth consumed by the share distribution process
154 are approximately
155\begin_inset Formula $\nicefrac{N}{k}$
156\end_inset
157
158 times the size of the original file, so we would like to minimize
159\begin_inset Formula $\nicefrac{N}{k}$
160\end_inset
161
162, consistent with
163\begin_inset Formula $Pr[loss]\leq r$
164\end_inset
165
166.
167 Likewise, a frequent and aggressive repair process keeps the number of
168 shares available close to
169\begin_inset Formula $N,$
170\end_inset
171
172 but at a cost in bandwidth and processing time as the repair agent downloads
173 
174\begin_inset Formula $k$
175\end_inset
176
177 shares, reconstructs the file and uploads new shares to replace those that
178 are lost.
179\end_layout
180
181\begin_layout Section
182Reliability
183\end_layout
184
185\begin_layout Standard
186The probability that the file becomes unrecoverable is dependent upon the
187 probability that the peers to whom we send shares are able to return those
188 copies on demand.
189 Shares that are corrupted are detected and discarded, so there is no need
190 to distinguish between corruption and loss.
191\end_layout
192
193\begin_layout Standard
194Many factors affect share availability.
195 Availability can be temporarily interrupted by peer unavailability due
196 to network outages, power failures or administrative shutdown, among other
197 reasons.
198 Availability can be permanently lost due to failure or corruption of storage
199 media, catastrophic damage to the peer system, administrative error, withdrawal
200 from the network, malicious corruption, etc.
201\end_layout
202
203\begin_layout Standard
204The existence of intermittent failure modes motivates the introduction of
205 a distinction between
206\noun on
207availability
208\noun default
209 and
210\noun on
211reliability
212\noun default
213.
214 Reliability is the probability that a share is retrievable assuming intermitten
215t failures can be waited out, so reliability considers only permanent failures.
216 Availability considers all failures, and is focused on the probability
217 of retrieval within some defined time frame.
218\end_layout
219
220\begin_layout Standard
221Another consideration is that some failures affect multiple shares.
222 If multiple shares of a file are stored on a single hard drive, for example,
223 failure of that drive may lose them all.
224 Catastrophic damage to a data center may destroy all shares on all peers
225 in that data center.
226\end_layout
227
228\begin_layout Standard
229While the types of failures that may occur are quite consistent across peers,
230 their probabilities differ dramatically.
231 A professionally-administered server with redundant storage, power and
232 Internet located in a carefully-monitored data center with automatic fire
233 suppression systems is much less likely to become either temporarily or
234 permanently unavailable than the typical virus and malware-ridden home
235 computer on a single cable modem connection.
236 A variety of situations in between exist as well, such as the case of the
237 author's home file server, which is administered by an IT professional
238 and uses RAID level 6 redundant storage, but runs on old, cobbled-together
239 equipment, and has a consumer-grade Internet connection.
240\end_layout
241
242\begin_layout Standard
243To begin with, let's use a simple definition of reliability:
244\end_layout
245
246\begin_layout Definition
247
248\noun on
249Reliability
250\noun default
251 is the probability
252\begin_inset Formula $p_{i}$
253\end_inset
254
255 that a share
256\begin_inset Formula $s_{i}$
257\end_inset
258
259 will survive to (be retrievable at) time
260\begin_inset Formula $T=A$
261\end_inset
262
263, ignoring intermittent failures.
264 That is, the probability that the share will be retrievable at the end
265 of the current repair cycle, and therefore usable by the repairer to regenerate
266 any lost shares.
267\end_layout
268
269\begin_layout Standard
270Reliability
271\begin_inset Formula $p_{i}$
272\end_inset
273
274 is clearly dependent on
275\begin_inset Formula $A$
276\end_inset
277
278.
279 Short repair cycles offer less time for shares to
280\begin_inset Quotes eld
281\end_inset
282
283decay
284\begin_inset Quotes erd
285\end_inset
286
287 into unavailability.
288\end_layout
289
290\begin_layout Subsection
291Peer Reliability
292\end_layout
293
294\begin_layout Standard
295Since peer reliability is the basis for any computations we may do on share
296 and file reliability, we must have a way to estimate it.
297 Reliability modeling of hardware, software and human performance are each
298 complex topics, the subject of much ongoing research.
299 In particular, the reliability of one of the key components of any peer
300 from our perspective -- the hard drive where file shares are stored --
301 is the subject of much current debate.
302\end_layout
303
304\begin_layout Standard
305A common assumption about hardware failure is that it follows the
306\begin_inset Quotes eld
307\end_inset
308
309bathtub curve
310\begin_inset Quotes erd
311\end_inset
312
313, with frequent failures during the first few months, a constant failure
314 rate for a few years and then a rising failure rate as the hardware wears
315 out.
316 This curve is often flattened by burn-in stress testing, and by periodic
317 replacement that assures that in-service components never reach
318\begin_inset Quotes eld
319\end_inset
320
321old age
322\begin_inset Quotes erd
323\end_inset
324
325.
326\end_layout
327
328\begin_layout Standard
329In any case, we're generally going to ignore all of that complexity and
330 focus on the bottom of the bathtub, assuming constant failure rates.
331 This is a particularly reasonable assumption as long as we're focused on
332 failures during a particular, relatively short interval
333\begin_inset Formula $A$
334\end_inset
335
336.
337 Towards the end of this paper, as we examine failures over many repair
338 intervals, the assumption becomes more tenuous, and we note some of the
339 issues.
340\end_layout
341
342\begin_layout Subsubsection
343Estimate Adaptation
344\end_layout
345
346\begin_layout Standard
347Even assuming constant failure rates, however, it will be rare that the
348 duration of
349\begin_inset Formula $A$
350\end_inset
351
352 coincides with the available failure rate data, particularly since we want
353 to view
354\begin_inset Formula $A$
355\end_inset
356
357 as a tunable parameter.
358 It's necessary to be able adapt failure rates baselined against any given
359 duration to the selected value of
360\begin_inset Formula $A$
361\end_inset
362
363.
364\end_layout
365
366\begin_layout Standard
367Another issue is that failure rates of hardware, etc., are necessarily continuous
368 in nature, while the per-interval failure/survival rates that are of interest
369 for file reliability calculations are discrete -- a peer either survives
370 or fails during the interval.
371 The continuous nature of failure rates means that the common and obvious
372 methods for estimating failure rates result in values that follow continuous,
373 not discrete distributions.
374 The difference is minor for small failure probabilities, and converges
375 to zero as the number of intervals goes to infinity, but is important enough
376 in some cases to be worth correcting for.
377\end_layout
378
379\begin_layout Standard
380Continuous failure rates are described in terms of mean time to failure,
381 and under the assumption that failure rates are constant, are exponentially
382 distributed.
383 Under these assumptions, the probability that a machine fails at time
384\begin_inset Formula $t$
385\end_inset
386
387, is
388\begin_inset Formula \[
389f\left(t\right)=\lambda e^{-\lambda t}\]
390
391\end_inset
392
393where
394\begin_inset Formula $\lambda$
395\end_inset
396
397 represents the per unit-time failure rate.
398 The probability that a machine fails at or before time
399\begin_inset Formula $A$
400\end_inset
401
402 is therefore
403\begin_inset Formula \begin{align}
404F\left(t\right) & =\int_{0}^{A}f\left(x\right)dx\nonumber \\
405 & =\int_{0}^{A}\lambda e^{-\lambda x}dx\nonumber \\
406 & =1-e^{-\lambda A}\label{eq:failure-time}\end{align}
407
408\end_inset
409
410
411\end_layout
412
413\begin_layout Standard
414Note that
415\begin_inset Formula $A$
416\end_inset
417
418 and
419\begin_inset Formula $\lambda$
420\end_inset
421
422 in
423\begin_inset CommandInset ref
424LatexCommand ref
425reference "eq:failure-time"
426
427\end_inset
428
429 must be expressed in consistent time units.
430 If they're different, unit conversions should be applied in the normal
431 way.
432 For example, if the estimate for
433\begin_inset Formula $\lambda$
434\end_inset
435
436 is 750 failures per million hours, and
437\begin_inset Formula $A$
438\end_inset
439
440 is one month, then either
441\begin_inset Formula $A$
442\end_inset
443
444 should be represented as
445\begin_inset Formula $30\cdot24/1000000=.00072$
446\end_inset
447
448, or
449\begin_inset Formula $\lambda$
450\end_inset
451
452 should be converted to failures per month.
453 Or both may be converted to hours.
454\end_layout
455
456\begin_layout Subsubsection
457Acquiring Peer Reliability Estimates
458\end_layout
459
460\begin_layout Standard
461Need to write this.
462\end_layout
463
464\begin_layout Subsection
465Uniform Reliability
466\begin_inset CommandInset label
467LatexCommand label
468name "sub:Fixed-Reliability"
469
470\end_inset
471
472
473\end_layout
474
475\begin_layout Standard
476In the simplest case, the peers holding the file shares all have the same
477 reliability
478\begin_inset Formula $p$
479\end_inset
480
481, and are all independent from one another.
482 Let
483\begin_inset Formula $K$
484\end_inset
485
486 be a random variable that represents the number of shares that survive
487 
488\begin_inset Formula $A$
489\end_inset
490
491.
492 Each share's survival can be viewed as an independent Bernoulli trial with
493 a success probability of
494\begin_inset Formula $p$
495\end_inset
496
497, which means that
498\begin_inset Formula $K$
499\end_inset
500
501 follows the binomial distribution with parameters
502\begin_inset Formula $N$
503\end_inset
504
505 and
506\begin_inset Formula $p$
507\end_inset
508
509.
510 That is,
511\begin_inset Formula $K\sim B(N,p)$
512\end_inset
513
514.
515\end_layout
516
517\begin_layout Theorem
518Binomial Distribution Theorem
519\end_layout
520
521\begin_layout Theorem
522Consider
523\begin_inset Formula $n$
524\end_inset
525
526 independent Bernoulli trials
527\begin_inset Foot
528status collapsed
529
530\begin_layout Plain Layout
531A Bernoulli trial is simply a test of some sort that results in one of two
532 outcomes, one of which is designated success and the other failure.
533 The classic example of a Bernoulli trial is a coin toss.
534\end_layout
535
536\end_inset
537
538 that succeed with probability
539\begin_inset Formula $p$
540\end_inset
541
542, and let
543\begin_inset Formula $K$
544\end_inset
545
546 be a random variable that represents the number,
547\begin_inset Formula $m$
548\end_inset
549
550, of successes,
551\begin_inset Formula $0\le m\le n$
552\end_inset
553
554.
555 We say that
556\begin_inset Formula $K$
557\end_inset
558
559 follows the Binomial Distribution with parameters n and p, denoted
560\begin_inset Formula $K\sim B(n,p)$
561\end_inset
562
563.
564 The probability mass function (PMF) of K is a function that gives the probabili
565ty that
566\begin_inset Formula $K$
567\end_inset
568
569 takes a particular value
570\begin_inset Formula $m$
571\end_inset
572
573 (the probability that there are exactly
574\begin_inset Formula $m$
575\end_inset
576
577 successful trials, and therefore
578\begin_inset Formula $n-m$
579\end_inset
580
581 failures).
582 The PMF of K is
583\begin_inset Formula \begin{equation}
584Pr[K=m]=f(m;n,p)=\binom{n}{m}p^{m}(1-p)^{n-m}\label{eq:binomial-pmf}\end{equation}
585
586\end_inset
587
588
589\end_layout
590
591\begin_layout Proof
592Consider the specific case of exactly
593\begin_inset Formula $m$
594\end_inset
595
596 successes followed by
597\begin_inset Formula $n-m$
598\end_inset
599
600 failures, because each success has probability
601\begin_inset Formula $p$
602\end_inset
603
604, each failure has probability
605\begin_inset Formula $1-p$
606\end_inset
607
608, and the trials are independent, the probability of this exact case occurring
609 is
610\begin_inset Formula $p^{m}\left(1-p\right)^{\left(n-m\right)}$
611\end_inset
612
613, the product of the probabilities of the outcome of each trial.
614\end_layout
615
616\begin_layout Proof
617Now consider any reordering of these
618\begin_inset Formula $m$
619\end_inset
620
621 successes and
622\begin_inset Formula $n$
623\end_inset
624
625 failures.
626 Any such reordering occurs with the same probability
627\begin_inset Formula $p^{m}\left(1-p\right)^{\left(n-m\right)}$
628\end_inset
629
630, but with the terms of the product reordered.
631 Since multiplication is commutative, each such reordering has the same
632 probability.
633 There are n-choose-m such orderings, and each ordering is an independent
634 event, meaning we can sum the probabilities of the individual orderings,
635 so the probability that any ordering of
636\begin_inset Formula $m$
637\end_inset
638
639 successes and
640\begin_inset Formula $n-m$
641\end_inset
642
643 failures occurs is given by
644\begin_inset Formula \[
645\binom{n}{m}p^{m}\left(1-p\right)^{\left(n-m\right)}\]
646
647\end_inset
648
649which is the right-hand-side of equation
650\begin_inset CommandInset ref
651LatexCommand ref
652reference "eq:binomial-pmf"
653
654\end_inset
655
656.
657\end_layout
658
659\begin_layout Standard
660A file survives if at least
661\begin_inset Formula $k$
662\end_inset
663
664 of the
665\begin_inset Formula $N$
666\end_inset
667
668 shares survive.
669 Equation
670\begin_inset CommandInset ref
671LatexCommand ref
672reference "eq:binomial-pmf"
673
674\end_inset
675
676 gives the probability that exactly
677\begin_inset Formula $i$
678\end_inset
679
680 shares survive, for any
681\begin_inset Formula $1\leq i\leq n$
682\end_inset
683
684, so the probability that fewer than
685\begin_inset Formula $k$
686\end_inset
687
688 survive is the sum of the probabilities that
689\begin_inset Formula $0,1,2,\ldots,k-1$
690\end_inset
691
692 shares survive.
693 That is:
694\end_layout
695
696\begin_layout Standard
697\begin_inset Formula \begin{equation}
698Pr[file\, lost]=\sum_{i=0}^{k-1}\binom{n}{i}p^{i}(1-p)^{n-i}\label{eq:simple-failure}\end{equation}
699
700\end_inset
701
702
703\end_layout
704
705\begin_layout Subsection
706Independent Reliability
707\begin_inset CommandInset label
708LatexCommand label
709name "sub:Independent-Reliability"
710
711\end_inset
712
713
714\end_layout
715
716\begin_layout Standard
717Equation
718\begin_inset CommandInset ref
719LatexCommand ref
720reference "eq:simple-failure"
721
722\end_inset
723
724 assumes that all shares have the same probability of survival, but as explained
725 above, this is not necessarily true.
726 A more accurate model allows each share
727\begin_inset Formula $s_{i}$
728\end_inset
729
730 an independent probability of survival
731\begin_inset Formula $p_{i}$
732\end_inset
733
734.
735 Each share's survival can still be treated as an independent Bernoulli
736 trial, but with success probability
737\begin_inset Formula $p_{i}$
738\end_inset
739
740.
741 Under this assumption,
742\begin_inset Formula $K$
743\end_inset
744
745 follows a generalized binomial distribution with parameters
746\begin_inset Formula $N$
747\end_inset
748
749 and
750\begin_inset Formula $p_{1},p_{2},\dots,p_{N}$
751\end_inset
752
753.
754\end_layout
755
756\begin_layout Standard
757The PMF for this generalized
758\begin_inset Formula $K$
759\end_inset
760
761 does not have a simple closed-form representation.
762 However, the PMFs for random variables representing individual share survival
763 do.
764 Let
765\begin_inset Formula $K_{i}$
766\end_inset
767
768 be a random variable such that:
769\end_layout
770
771\begin_layout Standard
772\begin_inset Formula \[
773K_{i}=\begin{cases}
7741 & \textnormal{if }s_{i}\textnormal{ survives}\\
7750 & \textnormal{if }s_{i}\textnormal{ fails}\end{cases}\]
776
777\end_inset
778
779
780\end_layout
781
782\begin_layout Standard
783The PMF for
784\begin_inset Formula $K_{i}$
785\end_inset
786
787 is very simple:
788\begin_inset Formula \[
789Pr[K_{i}=j]=\begin{cases}
790p_{i} & j=1\\
7911-p_{i} & j=0\end{cases}\]
792
793\end_inset
794
795 which can also be expressed as
796\begin_inset Formula \[
797Pr[K_{i}=j]=f\left(j\right)=\left(1-p_{i}\right)\left(1-j\right)+p_{i}\left(j\right)\]
798
799\end_inset
800
801
802\end_layout
803
804\begin_layout Standard
805Note that since each
806\begin_inset Formula $K_{i}$
807\end_inset
808
809 represents the count of shares
810\begin_inset Formula $s_{i}$
811\end_inset
812
813 that survives (either 0 or 1), if we add up all of the individual survivor
814 counts, we get the group survivor count.
815 That is:
816\begin_inset Formula \[
817\sum_{i=1}^{N}K_{i}=K\]
818
819\end_inset
820
821Effectively, we have separated
822\begin_inset Formula $K$
823\end_inset
824
825 into the series of Bernoulli trials that make it up.
826\end_layout
827
828\begin_layout Theorem
829Discrete Convolution Theorem
830\end_layout
831
832\begin_layout Theorem
833Let
834\begin_inset Formula $X$
835\end_inset
836
837 and
838\begin_inset Formula $Y$
839\end_inset
840
841 be discrete random variables with probability mass functions given by
842\begin_inset Formula $Pr\left[X=x\right]=f(x)$
843\end_inset
844
845 and
846\begin_inset Formula $Pr\left[Y=y\right]=g(y).$
847\end_inset
848
849 Let
850\begin_inset Formula $Z$
851\end_inset
852
853 be the discrete random random variable obtained by summing
854\begin_inset Formula $X$
855\end_inset
856
857 and
858\begin_inset Formula $Y$
859\end_inset
860
861.
862\end_layout
863
864\begin_layout Theorem
865The probability mass function of
866\begin_inset Formula $Z$
867\end_inset
868
869 is given by
870\begin_inset Formula \[
871Pr[Z=z]=h(z)=\left(f\star g\right)(z)\]
872
873\end_inset
874
875where
876\begin_inset Formula $\star$
877\end_inset
878
879 denotes the discrete convolution operation:
880\begin_inset Formula \[
881\left(f\star g\right)\left(n\right)=\sum_{m=-\infty}^{\infty}f\left(m\right)g\left(m-n\right)\]
882
883\end_inset
884
885
886\end_layout
887
888\begin_layout Proof
889The proof is beyond the scope of this paper.
890\end_layout
891
892\begin_layout Standard
893If we denote the PMF of
894\begin_inset Formula $K$
895\end_inset
896
897 with
898\begin_inset Formula $f$
899\end_inset
900
901 and the PMF of
902\begin_inset Formula $K_{i}$
903\end_inset
904
905 with
906\begin_inset Formula $g_{i}$
907\end_inset
908
909 (more formally,
910\begin_inset Formula $Pr[K=x]=f(x)$
911\end_inset
912
913 and
914\begin_inset Formula $Pr[K_{i}=x]=g_{i}(x)$
915\end_inset
916
917) then since
918\begin_inset Formula $K=\sum_{i=1}^{N}K_{i}$
919\end_inset
920
921, according to the discrete convolution theorem
922\begin_inset Formula $f=g_{1}\star g_{2}\star g_{3}\star\ldots\star g_{N}$
923\end_inset
924
925.
926 Since convolution is associative, this can also be written as
927\begin_inset Formula $ $
928\end_inset
929
930
931\begin_inset Formula \begin{equation}
932f=(\ldots((g_{1}\star g_{2})\star g_{3})\star\ldots)\star g_{N})\label{eq:convolution}\end{equation}
933
934\end_inset
935
936Therefore,
937\begin_inset Formula $f$
938\end_inset
939
940 can be computed as a sequence of convolution operations on the simple PMFs
941 of the random variables
942\begin_inset Formula $K_{i}$
943\end_inset
944
945.
946 In fact, for large
947\begin_inset Formula $N$
948\end_inset
949
950, equation
951\begin_inset CommandInset ref
952LatexCommand ref
953reference "eq:convolution"
954
955\end_inset
956
957 turns out to be a more effective means of computing the PMF of
958\begin_inset Formula $K$
959\end_inset
960
961 than the binomial theorem.
962 even in the case of shares with identical survival probability.
963 The reason it's better is because the calculation of
964\begin_inset Formula $\binom{n}{m}$
965\end_inset
966
967 in equation
968\begin_inset CommandInset ref
969LatexCommand ref
970reference "eq:binomial-pmf"
971
972\end_inset
973
974 produces very large values that overflow unless arbitrary precision numeric
975 representations are used.
976\end_layout
977
978\begin_layout Standard
979Note also that it is not necessary to have very simple PMFs like those of
980 the
981\begin_inset Formula $K_{i}$
982\end_inset
983
984.
985 Any share or set of shares that has a known PMF can be combined with any
986 other set with a known PMF by convolution, as long as the two share sets
987 are independent.
988 The reverse holds as well; given a group with an empirically-derived PMF,
989 in it's theoretically possible to solve for an individual PMF, and thereby
990 determine
991\begin_inset Formula $p_{i}$
992\end_inset
993
994 even when per-share data is unavailable.
995\end_layout
996
997\begin_layout Subsection
998Multiple Failure Modes
999\begin_inset CommandInset label
1000LatexCommand label
1001name "sub:Multiple-Failure-Modes"
1002
1003\end_inset
1004
1005
1006\end_layout
1007
1008\begin_layout Standard
1009In modeling share survival probabilities, it's useful to be able to analyze
1010 separately each of the various failure modes.
1011 For example, if reliable statistics for disk failure can be obtained, then
1012 a probability mass function for that form of failure can be generated.
1013 Similarly, statistics on other hardware failures, administrative errors,
1014 network losses, etc., can all be estimated independently.
1015 If those estimates can then be combined into a single PMF for a share,
1016 then we can use it to predict failures for that share.
1017\end_layout
1018
1019\begin_layout Standard
1020Combining independent failure modes for a single share is straightforward.
1021 If
1022\begin_inset Formula $p_{i,j}$
1023\end_inset
1024
1025 is the probability of survival of the
1026\begin_inset Formula $j$
1027\end_inset
1028
1029th failure mode of share
1030\begin_inset Formula $i$
1031\end_inset
1032
1033,
1034\begin_inset Formula $1\leq j\leq m$
1035\end_inset
1036
1037, then
1038\begin_inset Formula \[
1039Pr[K_{i}=k]=f_{i}(k)=\begin{cases}
1040\prod_{j=1}^{m}p_{i,j} & k=1\\
10411-\prod_{j=1}^{m}p_{i,j} & k=0\end{cases}\]
1042
1043\end_inset
1044
1045is the survival PMF.
1046\end_layout
1047
1048\begin_layout Subsection
1049Multi-share failures
1050\begin_inset CommandInset label
1051LatexCommand label
1052name "sub:Multi-share-failures"
1053
1054\end_inset
1055
1056
1057\end_layout
1058
1059\begin_layout Standard
1060If there are failure modes that affect multiple computers, we can also construct
1061 the PMF that predicts their survival.
1062 The key observation is that the PMF has non-zero probabilities only for
1063 
1064\begin_inset Formula $0$
1065\end_inset
1066
1067 survivors and
1068\begin_inset Formula $n$
1069\end_inset
1070
1071 survivors, where
1072\begin_inset Formula $n$
1073\end_inset
1074
1075 is the number of shares in the set.
1076 If
1077\begin_inset Formula $p$
1078\end_inset
1079
1080 is the probability of survival, the PMF of
1081\begin_inset Formula $K$
1082\end_inset
1083
1084, a random variable representing the number of survivors is
1085\begin_inset Formula \[
1086Pr[K=k]=f(k)=\begin{cases}
1087p & k=n\\
10880 & 0<i<n\\
10891-p & k=0\end{cases}\]
1090
1091\end_inset
1092
1093
1094\end_layout
1095
1096\begin_layout Standard
1097Group failures due to multiple independent causes can be combined as in
1098 section
1099\begin_inset CommandInset ref
1100LatexCommand ref
1101reference "sub:Multiple-Failure-Modes"
1102
1103\end_inset
1104
1105, as long as they apply to the whole group.
1106\end_layout
1107
1108\begin_layout Example
1109Putting the Pieces Together
1110\end_layout
1111
1112\begin_layout Standard
1113Sections
1114\begin_inset CommandInset ref
1115LatexCommand ref
1116reference "sub:Fixed-Reliability"
1117
1118\end_inset
1119
1120 through
1121\begin_inset CommandInset ref
1122LatexCommand ref
1123reference "sub:Multi-share-failures"
1124
1125\end_inset
1126
1127 provide ways of calculating the survival probability mass functions for
1128 a variety of share failure structures and modes.
1129 As an example of how these pieces can be used, consider a network with
1130 the following peers:
1131\end_layout
1132
1133\begin_layout Itemize
1134Four servers located in a data center in Nebraska.
1135 The machines have multiply-redundant Internet connections, with a failure
1136 probability of 0.0001.
1137 They store their shares on RAID arrays with failure probability of 0.0002.
1138 The administrative staff makes data-destroying errors with probability
1139 0.003.
1140\end_layout
1141
1142\begin_layout Itemize
1143Four servers located in a data center on the island of Hawaii.
1144 These servers have identical failure probabilities as the servers in Nebraska,
1145 except that the data center is near the edge of the crater on Mount Kilauea
1146 (nobody said examples had to be realistic).
1147 There is a 0.04 chance that the volcano will erupt and bury the data center
1148 in molten lava, destroying it entirely.
1149\end_layout
1150
1151\begin_layout Itemize
1152Four PCs located in random homes, connected to the Internet via assorted
1153 cable modems and DSL.
1154 Their network connections fail with probability 0.009.
1155 Their disks fail with probability 0.001.
1156 Their users destroy data with probability 0.05.
1157\end_layout
1158
1159\begin_layout Standard
1160If one share is placed on each of these 12 computers, what's the probability
1161 mass function of share survival? To more compactly describe PMFs, we'll
1162 denote them as probability vectors of the form
1163\begin_inset Formula $\left[\alpha_{o},\alpha_{1},\alpha_{2},\ldots\alpha_{n}\right]$
1164\end_inset
1165
1166 where
1167\begin_inset Formula $\alpha_{i}$
1168\end_inset
1169
1170 is the probability that exactly
1171\begin_inset Formula $i$
1172\end_inset
1173
1174 shares survive.
1175\end_layout
1176
1177\begin_layout Standard
1178The servers in the two data centers have individual failure probabilities
1179 of RAID failure (.0002) and administrative error (.003) giving an individual
1180 survival probability of
1181\begin_inset Formula \[
1182(1-.0002)\cdot(1-.003)=.9998\cdot.997=.9968\]
1183
1184\end_inset
1185
1186
1187\end_layout
1188
1189\begin_layout Standard
1190Using
1191\begin_inset Formula $p=.9968,n=4$
1192\end_inset
1193
1194 in equation
1195\begin_inset CommandInset ref
1196LatexCommand ref
1197reference "eq:binomial-pmf"
1198
1199\end_inset
1200
1201 gives the survival PMF
1202\begin_inset Formula \[
1203\left[1.049\times10^{-10},1.307\times10^{-7},6.105\times10^{-5},0.01271,0.9872\right]\]
1204
1205\end_inset
1206
1207which applies to each group of four servers.
1208 However, each data center also has a .0001 chance of data connection loss,
1209 which affects all four servers at once, and Hawaii has the additional .04
1210 probability of severe lava burn.
1211 If the network fails at a location, all the machines go offline together.
1212 The probability that 0 machines survive is the probability that they all
1213 fail for individual reasons (
1214\begin_inset Formula $1.049\cdot10^{-10}$
1215\end_inset
1216
1217) plus the probability they all fail because of a network outage (
1218\begin_inset Formula $.0001$
1219\end_inset
1220
1221) less the probability they fail for both reasons:
1222\begin_inset Formula \[
1223\left(1.049\times10^{-10}\right)+\left(0.0001\right)-\left[\left(1.049\times10^{-10}\right)\cdot\left(0.0001\right)\right]\approxeq0.0001\]
1224
1225\end_inset
1226
1227
1228\end_layout
1229
1230\begin_layout Standard
1231That's the
1232\begin_inset Formula $i=0$
1233\end_inset
1234
1235 element of the combined PMF.
1236 The combined probability of survival of
1237\begin_inset Formula $0<i\leq4$
1238\end_inset
1239
1240 servers is simpler: it's the probability they survive individual failure,
1241 from the individual failure PMF above, times the probability they survive
1242 network failure (.9999).
1243 So the combined survival PMF, which we'll denote as
1244\begin_inset Formula $n(i)$
1245\end_inset
1246
1247 of the Nebraska servers is
1248\begin_inset Formula \[
1249n(i)=\left[0.0001,1.306\times10^{-7},6.104\times10^{-5},0.01268,0.9872\right]\]
1250
1251\end_inset
1252
1253which has the interesting property that complete failure is 1000 times more
1254 likely than survival of one server.
1255 This is because the probability of a network outage is so much greater
1256 than simultaneous
1257\begin_inset Foot
1258status collapsed
1259
1260\begin_layout Plain Layout
1261Of course, the failures need not be truly simultaneous, they just have happen
1262 in the same interval between repair runs.
1263\end_layout
1264
1265\end_inset
1266
1267 independent failure of three servers.
1268\end_layout
1269
1270\begin_layout Standard
1271We apply the same process for the Hawaii servers, but with group survival
1272 probability of
1273\begin_inset Formula $(1-.0001)(1-.04)=.9799$
1274\end_inset
1275
1276 gives the survival PMF
1277\begin_inset Formula \[
1278h(i)=\left[0.0201,1.280\times10^{-7},5.982\times10^{-5},0.01242,0.9674\right]\]
1279
1280\end_inset
1281
1282
1283\end_layout
1284
1285\begin_layout Standard
1286Applying the convolution operator to
1287\begin_inset Formula $n(i)$
1288\end_inset
1289
1290 and
1291\begin_inset Formula $h(i)$
1292\end_inset
1293
1294, the survival PMF of all eight servers is:
1295\end_layout
1296
1297\begin_layout Standard
1298\begin_inset Formula \[
1299\left(n\star h\right)\left(i\right)=\begin{cases}
13002.010\times10^{-6} & i=0\\
13012.639\times10^{-9} & i=1\\
13021.233\times10^{-6} & i=2\\
13032.560\times10^{-4} & i=3\\
13040.01994 & i=4\\
13051.769\times10^{-6} & i=5\\
13062.756\times10^{-4} & i=6\\
13070.02452 & i=7\\
13080.9559 & i=8\end{cases}\]
1309
1310\end_inset
1311
1312
1313\end_layout
1314
1315\begin_layout Standard
1316\begin_inset VSpace defskip
1317\end_inset
1318
1319
1320\end_layout
1321
1322\begin_layout Standard
1323Note that losing four shares (
1324\begin_inset Formula $i=4$
1325\end_inset
1326
1327) is 10,000 times more likely than losing three (
1328\begin_inset Formula $i=5$
1329\end_inset
1330
1331).
1332 This is because both data centers have a whole-center failure mode, and
1333 the Hawaii center's lava burn probability is so high.
1334 Similarly, the probability of losing all of them is 1000 times higher than
1335 the probability of losing all but one.
1336\end_layout
1337
1338\begin_layout Standard
1339For the home PCs, their individual probability of survival is
1340\begin_inset Formula \[
1341(1-.009)\cdot(1-.001)\cdot(1-.05)=.991\cdot.999\cdot.95=.9405\]
1342
1343\end_inset
1344
1345
1346\end_layout
1347
1348\begin_layout Standard
1349We can then apply equation
1350\begin_inset CommandInset ref
1351LatexCommand ref
1352reference "eq:binomial-pmf"
1353
1354\end_inset
1355
1356 with
1357\begin_inset Formula $N=4$
1358\end_inset
1359
1360 and
1361\begin_inset Formula $p=.9405$
1362\end_inset
1363
1364 to compute the PMF
1365\begin_inset Formula $g(i),0\leq i\leq4$
1366\end_inset
1367
1368 for the PCs and finally compute
1369\begin_inset Formula $f(i)=\left(g\star\left(n\star h\right)\right)\left(i\right)$
1370\end_inset
1371
1372, the PMF of the whole share set.
1373 Summing the values of
1374\begin_inset Formula $f(i)$
1375\end_inset
1376
1377 for
1378\begin_inset Formula $0\leq i\leq k-1$
1379\end_inset
1380
1381 gives the probability that less than
1382\begin_inset Formula $k$
1383\end_inset
1384
1385 shares survive and the file is unrecoverable.
1386 For this example, those sums are shown in table
1387\begin_inset CommandInset ref
1388LatexCommand vref
1389reference "tab:Example-PMF"
1390
1391\end_inset
1392
1393.
1394\begin_inset Float table
1395wide false
1396sideways false
1397status collapsed
1398
1399\begin_layout Plain Layout
1400\align center
1401\begin_inset Tabular
1402<lyxtabular version="3" rows="13" columns="4">
1403<features>
1404<column alignment="center" valignment="top" width="0">
1405<column alignment="center" valignment="top" width="0">
1406<column alignment="center" valignment="top" width="0">
1407<column alignment="center" valignment="top" width="0">
1408<row>
1409<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
1410\begin_inset Text
1411
1412\begin_layout Plain Layout
1413\begin_inset Formula $k$
1414\end_inset
1415
1416
1417\end_layout
1418
1419\end_inset
1420</cell>
1421<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
1422\begin_inset Text
1423
1424\begin_layout Plain Layout
1425\begin_inset Formula $Pr[K=k]$
1426\end_inset
1427
1428
1429\end_layout
1430
1431\end_inset
1432</cell>
1433<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
1434\begin_inset Text
1435
1436\begin_layout Plain Layout
1437\begin_inset Formula $Pr[file\, loss]=Pr[K<k]$
1438\end_inset
1439
1440
1441\end_layout
1442
1443\end_inset
1444</cell>
1445<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
1446\begin_inset Text
1447
1448\begin_layout Plain Layout
1449\begin_inset Formula $N/k$
1450\end_inset
1451
1452
1453\end_layout
1454
1455\end_inset
1456</cell>
1457</row>
1458<row>
1459<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1460\begin_inset Text
1461
1462\begin_layout Plain Layout
14631
1464\end_layout
1465
1466\end_inset
1467</cell>
1468<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1469\begin_inset Text
1470
1471\begin_layout Plain Layout
1472\begin_inset Formula $1.60\times10^{-9}$
1473\end_inset
1474
1475
1476\end_layout
1477
1478\end_inset
1479</cell>
1480<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1481\begin_inset Text
1482
1483\begin_layout Plain Layout
1484\begin_inset Formula $2.53\times10^{-11}$
1485\end_inset
1486
1487
1488\end_layout
1489
1490\end_inset
1491</cell>
1492<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1493\begin_inset Text
1494
1495\begin_layout Plain Layout
149612
1497\end_layout
1498
1499\end_inset
1500</cell>
1501</row>
1502<row>
1503<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1504\begin_inset Text
1505
1506\begin_layout Plain Layout
15072
1508\end_layout
1509
1510\end_inset
1511</cell>
1512<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1513\begin_inset Text
1514
1515\begin_layout Plain Layout
1516\begin_inset Formula $3.80\times10^{-8}$
1517\end_inset
1518
1519
1520\end_layout
1521
1522\end_inset
1523</cell>
1524<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1525\begin_inset Text
1526
1527\begin_layout Plain Layout
1528\begin_inset Formula $1.63\times10^{-9}$
1529\end_inset
1530
1531
1532\end_layout
1533
1534\end_inset
1535</cell>
1536<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1537\begin_inset Text
1538
1539\begin_layout Plain Layout
15406
1541\end_layout
1542
1543\end_inset
1544</cell>
1545</row>
1546<row>
1547<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1548\begin_inset Text
1549
1550\begin_layout Plain Layout
15513
1552\end_layout
1553
1554\end_inset
1555</cell>
1556<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1557\begin_inset Text
1558
1559\begin_layout Plain Layout
1560\begin_inset Formula $4.04\times10^{-7}$
1561\end_inset
1562
1563
1564\end_layout
1565
1566\end_inset
1567</cell>
1568<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1569\begin_inset Text
1570
1571\begin_layout Plain Layout
1572\begin_inset Formula $3.70\times10^{-8}$
1573\end_inset
1574
1575
1576\end_layout
1577
1578\end_inset
1579</cell>
1580<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1581\begin_inset Text
1582
1583\begin_layout Plain Layout
15844
1585\end_layout
1586
1587\end_inset
1588</cell>
1589</row>
1590<row>
1591<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1592\begin_inset Text
1593
1594\begin_layout Plain Layout
15954
1596\end_layout
1597
1598\end_inset
1599</cell>
1600<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1601\begin_inset Text
1602
1603\begin_layout Plain Layout
1604\begin_inset Formula $2.06\times10^{-6}$
1605\end_inset
1606
1607
1608\end_layout
1609
1610\end_inset
1611</cell>
1612<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1613\begin_inset Text
1614
1615\begin_layout Plain Layout
1616\begin_inset Formula $4.44\times10^{-7}$
1617\end_inset
1618
1619
1620\end_layout
1621
1622\end_inset
1623</cell>
1624<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1625\begin_inset Text
1626
1627\begin_layout Plain Layout
16283
1629\end_layout
1630
1631\end_inset
1632</cell>
1633</row>
1634<row>
1635<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1636\begin_inset Text
1637
1638\begin_layout Plain Layout
16395
1640\end_layout
1641
1642\end_inset
1643</cell>
1644<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1645\begin_inset Text
1646
1647\begin_layout Plain Layout
1648\begin_inset Formula $2.10\times10^{-5}$
1649\end_inset
1650
1651
1652\end_layout
1653
1654\end_inset
1655</cell>
1656<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1657\begin_inset Text
1658
1659\begin_layout Plain Layout
1660\begin_inset Formula $2.50\times10^{-6}$
1661\end_inset
1662
1663
1664\end_layout
1665
1666\end_inset
1667</cell>
1668<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1669\begin_inset Text
1670
1671\begin_layout Plain Layout
16722.4
1673\end_layout
1674
1675\end_inset
1676</cell>
1677</row>
1678<row>
1679<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1680\begin_inset Text
1681
1682\begin_layout Plain Layout
16836
1684\end_layout
1685
1686\end_inset
1687</cell>
1688<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1689\begin_inset Text
1690
1691\begin_layout Plain Layout
1692\begin_inset Formula $0.000428$
1693\end_inset
1694
1695
1696\end_layout
1697
1698\end_inset
1699</cell>
1700<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1701\begin_inset Text
1702
1703\begin_layout Plain Layout
1704\begin_inset Formula $2.35\times10^{-5}$
1705\end_inset
1706
1707
1708\end_layout
1709
1710\end_inset
1711</cell>
1712<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1713\begin_inset Text
1714
1715\begin_layout Plain Layout
17162
1717\end_layout
1718
1719\end_inset
1720</cell>
1721</row>
1722<row>
1723<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1724\begin_inset Text
1725
1726\begin_layout Plain Layout
17277
1728\end_layout
1729
1730\end_inset
1731</cell>
1732<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1733\begin_inset Text
1734
1735\begin_layout Plain Layout
1736\begin_inset Formula $0.00417$
1737\end_inset
1738
1739
1740\end_layout
1741
1742\end_inset
1743</cell>
1744<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1745\begin_inset Text
1746
1747\begin_layout Plain Layout
1748\begin_inset Formula $0.000452$
1749\end_inset
1750
1751
1752\end_layout
1753
1754\end_inset
1755</cell>
1756<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1757\begin_inset Text
1758
1759\begin_layout Plain Layout
17601.7
1761\end_layout
1762
1763\end_inset
1764</cell>
1765</row>
1766<row>
1767<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1768\begin_inset Text
1769
1770\begin_layout Plain Layout
17718
1772\end_layout
1773
1774\end_inset
1775</cell>
1776<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1777\begin_inset Text
1778
1779\begin_layout Plain Layout
1780\begin_inset Formula $0.0157$
1781\end_inset
1782
1783
1784\end_layout
1785
1786\end_inset
1787</cell>
1788<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1789\begin_inset Text
1790
1791\begin_layout Plain Layout
1792\begin_inset Formula $0.00462$
1793\end_inset
1794
1795
1796\end_layout
1797
1798\end_inset
1799</cell>
1800<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1801\begin_inset Text
1802
1803\begin_layout Plain Layout
18041.5
1805\end_layout
1806
1807\end_inset
1808</cell>
1809</row>
1810<row>
1811<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1812\begin_inset Text
1813
1814\begin_layout Plain Layout
18159
1816\end_layout
1817
1818\end_inset
1819</cell>
1820<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1821\begin_inset Text
1822
1823\begin_layout Plain Layout
1824\begin_inset Formula $0.00127$
1825\end_inset
1826
1827
1828\end_layout
1829
1830\end_inset
1831</cell>
1832<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1833\begin_inset Text
1834
1835\begin_layout Plain Layout
1836\begin_inset Formula $0.0203$
1837\end_inset
1838
1839
1840\end_layout
1841
1842\end_inset
1843</cell>
1844<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1845\begin_inset Text
1846
1847\begin_layout Plain Layout
18481.3
1849\end_layout
1850
1851\end_inset
1852</cell>
1853</row>
1854<row>
1855<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1856\begin_inset Text
1857
1858\begin_layout Plain Layout
185910
1860\end_layout
1861
1862\end_inset
1863</cell>
1864<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1865\begin_inset Text
1866
1867\begin_layout Plain Layout
1868\begin_inset Formula $0.0230$
1869\end_inset
1870
1871
1872\end_layout
1873
1874\end_inset
1875</cell>
1876<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1877\begin_inset Text
1878
1879\begin_layout Plain Layout
1880\begin_inset Formula $0.0216$
1881\end_inset
1882
1883
1884\end_layout
1885
1886\end_inset
1887</cell>
1888<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1889\begin_inset Text
1890
1891\begin_layout Plain Layout
18921.2
1893\end_layout
1894
1895\end_inset
1896</cell>
1897</row>
1898<row>
1899<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1900\begin_inset Text
1901
1902\begin_layout Plain Layout
190311
1904\end_layout
1905
1906\end_inset
1907</cell>
1908<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1909\begin_inset Text
1910
1911\begin_layout Plain Layout
1912\begin_inset Formula $0.208$
1913\end_inset
1914
1915
1916\end_layout
1917
1918\end_inset
1919</cell>
1920<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1921\begin_inset Text
1922
1923\begin_layout Plain Layout
1924\begin_inset Formula $0.0446$
1925\end_inset
1926
1927
1928\end_layout
1929
1930\end_inset
1931</cell>
1932<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1933\begin_inset Text
1934
1935\begin_layout Plain Layout
19361.1
1937\end_layout
1938
1939\end_inset
1940</cell>
1941</row>
1942<row>
1943<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
1944\begin_inset Text
1945
1946\begin_layout Plain Layout
194712
1948\end_layout
1949
1950\end_inset
1951</cell>
1952<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
1953\begin_inset Text
1954
1955\begin_layout Plain Layout
1956\begin_inset Formula $0.747$
1957\end_inset
1958
1959
1960\end_layout
1961
1962\end_inset
1963</cell>
1964<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
1965\begin_inset Text
1966
1967\begin_layout Plain Layout
1968\begin_inset Formula $0.253$
1969\end_inset
1970
1971
1972\end_layout
1973
1974\end_inset
1975</cell>
1976<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
1977\begin_inset Text
1978
1979\begin_layout Plain Layout
19801
1981\end_layout
1982
1983\end_inset
1984</cell>
1985</row>
1986</lyxtabular>
1987
1988\end_inset
1989
1990
1991\end_layout
1992
1993\begin_layout Plain Layout
1994\begin_inset Caption
1995
1996\begin_layout Plain Layout
1997\align left
1998\begin_inset CommandInset label
1999LatexCommand label
2000name "tab:Example-PMF"
2001
2002\end_inset
2003
2004Example PMF
2005\end_layout
2006
2007\end_inset
2008
2009
2010\end_layout
2011
2012\begin_layout Plain Layout
2013
2014\end_layout
2015
2016\end_inset
2017
2018
2019\end_layout
2020
2021\begin_layout Standard
2022The table demonstrates the importance of the selection of
2023\begin_inset Formula $k$
2024\end_inset
2025
2026, and the tradeoff against file size expansion.
2027 Note that the survival of exactly 9 servers is significantly less likely
2028 than the survival of 8 or 10 servers.
2029 This is, again, an artifact of the group failure modes.
2030 Because of this, there is no reason to choose
2031\begin_inset Formula $k=9$
2032\end_inset
2033
2034 over
2035\begin_inset Formula $k=10$
2036\end_inset
2037
2038.
2039 Normally, reducing the number of shares needed for reassembly improve the
2040 file's chances of survival, but in this case it provides a minuscule gain
2041 in reliability at the cost of a 10% increase in bandwidth and storage consumed.
2042\end_layout
2043
2044\begin_layout Subsection
2045Share Duplication
2046\end_layout
2047
2048\begin_layout Standard
2049Before moving on to consider issues other than single-interval file loss,
2050 let's analyze one more possibility, that of
2051\begin_inset Quotes eld
2052\end_inset
2053
2054cheap
2055\begin_inset Quotes erd
2056\end_inset
2057
2058 file repair via share duplication.
2059\end_layout
2060
2061\begin_layout Standard
2062Initially, files are split using erasure coding, which creates
2063\begin_inset Formula $N$
2064\end_inset
2065
2066 unique shares, any
2067\begin_inset Formula $k$
2068\end_inset
2069
2070 of which can be used to to reconstruct the file.
2071 When shares are lost, proper repair downloads some
2072\begin_inset Formula $k$
2073\end_inset
2074
2075 shares, reconstructs the original file and then uses the erasure coding
2076 algorithm to reconstruct the lost shares, then redeploys them to peers
2077 in the network.
2078 This is a somewhat expensive process.
2079\end_layout
2080
2081\begin_layout Standard
2082A cheaper repair option is simply to direct some peer that has share
2083\begin_inset Formula $s_{i}$
2084\end_inset
2085
2086 to send a copy to another peer, thus increasing by one the number of shares
2087 in the network.
2088 This is not as good as actually replacing the lost share, though.
2089 Suppose that more shares were lost, leaving only
2090\begin_inset Formula $k$
2091\end_inset
2092
2093 shares remaining.
2094 If two of those shares are identical, because one was duplicated in this
2095 fashion, then only
2096\begin_inset Formula $k-1$
2097\end_inset
2098
2099 shares truly remain, and the file can no longer be reconstructed.
2100\end_layout
2101
2102\begin_layout Standard
2103However, such cheap repair is not completely pointless; it does increase
2104 file survivability.
2105 But by how much?
2106\end_layout
2107
2108\begin_layout Standard
2109Effectively, share duplication simply increases the probability that
2110\begin_inset Formula $s_{i}$
2111\end_inset
2112
2113 will survive, by providing two locations from which to retrieve it.
2114 We can view the two copies of the single share as one, but with a higher
2115 probability of survival than would be provided by either of the two peers.
2116 In particular, if
2117\begin_inset Formula $p_{1}$
2118\end_inset
2119
2120 and
2121\begin_inset Formula $p_{2}$
2122\end_inset
2123
2124 are the probabilities that the two peers will survive, respectively, then
2125\begin_inset Formula \[
2126Pr[s_{i}\, survives]=p_{1}+p_{2}-p_{1}p_{2}\]
2127
2128\end_inset
2129
2130
2131\end_layout
2132
2133\begin_layout Standard
2134More generally, if a single share is deployed on
2135\begin_inset Formula $n$
2136\end_inset
2137
2138 peers, each with a PMF
2139\begin_inset Formula $f_{i}(j),0\leq j\leq1,1\leq i\leq n$
2140\end_inset
2141
2142, the share survival count is a random variable
2143\begin_inset Formula $K$
2144\end_inset
2145
2146 and the probability of share loss is
2147\begin_inset Formula \[
2148Pr[K=0]=(f_{1}\star f_{2}\star\ldots\star f_{n})(0)\]
2149
2150\end_inset
2151
2152
2153\end_layout
2154
2155\begin_layout Standard
2156From that, we can construct a share PMF in the obvious way, which can then
2157 be convolved with the other share PMFs to produce the share set PMF.
2158\end_layout
2159
2160\begin_layout Example
2161Suppose a file has
2162\begin_inset Formula $N=10,k=3$
2163\end_inset
2164
2165 and that all servers have survival probability
2166\begin_inset Formula $p=.9$
2167\end_inset
2168
2169.
2170 Given a full complement of shares,
2171\begin_inset Formula $Pr[\textrm{file\, loss}]=3.74\times10^{-7}$
2172\end_inset
2173
2174.
2175 Suppose that four shares are lost, which increases
2176\begin_inset Formula $Pr[\textrm{file\, loss}]$
2177\end_inset
2178
2179 to
2180\begin_inset Formula $.00127$
2181\end_inset
2182
2183, a value
2184\begin_inset Formula $3400$
2185\end_inset
2186
2187 times greater.
2188 Rather than doing a proper reconstruction, we could direct four peers still
2189 holding shares to send a copy of their share to new peer, which changes
2190 the composition of the shares from one of six, unique
2191\begin_inset Quotes eld
2192\end_inset
2193
2194standard
2195\begin_inset Quotes erd
2196\end_inset
2197
2198 shares, to one of two standard shares, each with survival probability
2199\begin_inset Formula $.9$
2200\end_inset
2201
2202 and four
2203\begin_inset Quotes eld
2204\end_inset
2205
2206doubled
2207\begin_inset Quotes erd
2208\end_inset
2209
2210 shares, each with survival probability
2211\begin_inset Formula $2p-p^{2}\approxeq.99$
2212\end_inset
2213
2214.
2215\end_layout
2216
2217\begin_layout Example
2218Combining the two single-peer share PMFs with the four double-share PMFs
2219 gives a new file survival probability of
2220\begin_inset Formula $6.64\times10^{-6}$
2221\end_inset
2222
2223.
2224 Not as good as a full repair, but still quite respectable.
2225 Also, if storage were not a concern, all six shares could be duplicated,
2226 for a
2227\begin_inset Formula $Pr[file\, loss]=1.48\times10^{-7}$
2228\end_inset
2229
2230, which is actually three time better than the nominal case.
2231\end_layout
2232
2233\begin_layout Example
2234The reason such cheap repairs may be attractive in many cases is that distribute
2235d bandwidth is cheaper than bandwidth through a single peer.
2236 This is particularly true if that single peer has a very slow connection,
2237 which is common for home computers -- especially in the outbound direction.
2238\end_layout
2239
2240\begin_layout Section
2241Long-Term Reliability
2242\end_layout
2243
2244\begin_layout Standard
2245Thus far, we've focused entirely on the probability that a file survives
2246 the interval
2247\begin_inset Formula $A$
2248\end_inset
2249
2250 between repair times.
2251 The probability that a file survives long-term, though, is also important.
2252 As long as the probability of failure during a repair period is non-zero,
2253 a given file will eventually be lost.
2254 We want to know the probability of surviving for time
2255\begin_inset Formula $T$
2256\end_inset
2257
2258, and how the parameters
2259\begin_inset Formula $A$
2260\end_inset
2261
2262 (time between repairs) and
2263\begin_inset Formula $L$
2264\end_inset
2265
2266 (allowed share low watermark) affect survival time.
2267\end_layout
2268
2269\begin_layout Standard
2270To model file survival time, let
2271\begin_inset Formula $T$
2272\end_inset
2273
2274 be a random variable denoting the time at which a given file becomes unrecovera
2275ble, and
2276\begin_inset Formula $R(t)=Pr[T>t]$
2277\end_inset
2278
2279 be a function that gives the probability that the file survives to time
2280 
2281\begin_inset Formula $t$
2282\end_inset
2283
2284.
2285 
2286\begin_inset Formula $R(t)$
2287\end_inset
2288
2289 is the cumulative distribution function of
2290\begin_inset Formula $T$
2291\end_inset
2292
2293.
2294\end_layout
2295
2296\begin_layout Standard
2297Most survival functions are continuous, but
2298\begin_inset Formula $R(t)$
2299\end_inset
2300
2301 is inherently discrete and stochastic.
2302 The time steps are the repair intervals, each of length
2303\begin_inset Formula $A$
2304\end_inset
2305
2306, so
2307\begin_inset Formula $T$
2308\end_inset
2309
2310-values are multiples of
2311\begin_inset Formula $A$
2312\end_inset
2313
2314.
2315 During each interval, the file's shares degrade according to the probability
2316 mass function of
2317\begin_inset Formula $K$
2318\end_inset
2319
2320.
2321\end_layout
2322
2323\begin_layout Subsection
2324Aggressive Repair
2325\end_layout
2326
2327\begin_layout Standard
2328Let's first consider the case of an aggressive repairer.
2329 Every interval, this repairer checks the file for share losses and restores
2330 them.
2331 Thus, at the beginning of each interval, the file always has
2332\begin_inset Formula $N$
2333\end_inset
2334
2335 shares, distributed on servers with various individual and group failure
2336 probabilities, which will survive or fail per the output of random variable
2337 
2338\begin_inset Formula $K$
2339\end_inset
2340
2341.
2342\end_layout
2343
2344\begin_layout Standard
2345For any interval, then, the probability that the file will survive is
2346\begin_inset Formula $f\left(k\right)=Pr[K\geq k]$
2347\end_inset
2348
2349.
2350 Since each interval success or failure is independent, and assuming the
2351 share reliabilities remain constant over time,
2352\begin_inset Formula \begin{equation}
2353R\left(t\right)=f(k)^{t}\end{equation}
2354
2355\end_inset
2356
2357
2358\end_layout
2359
2360\begin_layout Standard
2361This simple survival function makes it simple to select parameters
2362\begin_inset Formula $N$
2363\end_inset
2364
2365 and
2366\begin_inset Formula $K$
2367\end_inset
2368
2369 such that
2370\begin_inset Formula $R(t)\geq r$
2371\end_inset
2372
2373, where
2374\begin_inset Formula $r$
2375\end_inset
2376
2377 is a user-specified parameter indicating the desired probability of survival
2378 to time
2379\begin_inset Formula $t$
2380\end_inset
2381
2382.
2383 Specifically, we can solve for
2384\begin_inset Formula $f\left(k\right)$
2385\end_inset
2386
2387 in
2388\begin_inset Formula $r\leq f\left(k\right)^{t}$
2389\end_inset
2390
2391, giving:
2392\begin_inset Formula \begin{equation}
2393f\left(k\right)\geq\sqrt[t]{r}\end{equation}
2394
2395\end_inset
2396
2397
2398\end_layout
2399
2400\begin_layout Standard
2401So, given a PMF
2402\begin_inset Formula $f\left(k\right)$
2403\end_inset
2404
2405, to assure the survival of a file to time
2406\begin_inset Formula $t$
2407\end_inset
2408
2409 with probability at least
2410\begin_inset Formula $r$
2411\end_inset
2412
2413, choose
2414\begin_inset Formula $k$
2415\end_inset
2416
2417 such that
2418\begin_inset Formula $f\left(k\right)\geq\sqrt[t]{r}$
2419\end_inset
2420
2421.
2422 For example, if
2423\begin_inset Formula $A$
2424\end_inset
2425
2426 is one month, and
2427\begin_inset Formula $r=1-\nicefrac{1}{10^{6}}$
2428\end_inset
2429
2430 and
2431\begin_inset Formula $t=120$
2432\end_inset
2433
2434, or 10 years, we calculate
2435\begin_inset Formula $f\left(k\right)\geq\sqrt[120]{.999999}\approx0.999999992$
2436\end_inset
2437
2438.
2439 Per the PMF of table
2440\begin_inset CommandInset ref
2441LatexCommand ref
2442reference "tab:Example-PMF"
2443
2444\end_inset
2445
2446, this means
2447\begin_inset Formula $k=2$
2448\end_inset
2449
2450, achieves the goal, at the cost of a six-fold expansion in stored file
2451 size.
2452 If the lesser goal of no more than
2453\begin_inset Formula $\nicefrac{1}{1000}$
2454\end_inset
2455
2456 probability of loss is taken, then since
2457\begin_inset Formula $\sqrt[120]{.9999}=.999992$
2458\end_inset
2459
2460,
2461\begin_inset Formula $k=5$
2462\end_inset
2463
2464 achieves the goal with an expansion factor of
2465\begin_inset Formula $2.4$
2466\end_inset
2467
2468.
2469\end_layout
2470
2471\begin_layout Subsection
2472Repair Cost
2473\end_layout
2474
2475\begin_layout Standard
2476The simplicity and predictability of aggressive repair is attractive, but
2477 there is a downside: Repairs cost processing power and bandwidth.
2478 The processing power is proportional to the size of the file, since the
2479 whole file must be reconstructed and then re-processed using the Reed-Solomon
2480 algorithm, while the bandwidth cost is proportional to the number of missing
2481 shares that must be replaced,
2482\begin_inset Formula $N-K$
2483\end_inset
2484
2485.
2486\end_layout
2487
2488\begin_layout Standard
2489Let
2490\begin_inset Formula $c\left(s,d,k\right)$
2491\end_inset
2492
2493 be a cost function that combines the processing cost of regenerating a
2494 file of size
2495\begin_inset Formula $s$
2496\end_inset
2497
2498 and the bandwidth cost of downloading a file of size
2499\begin_inset Formula $s$
2500\end_inset
2501
2502 and uploading
2503\begin_inset Formula $d$
2504\end_inset
2505
2506 shares each of size
2507\begin_inset Formula $\nicefrac{s}{k}$
2508\end_inset
2509
2510.
2511 Also, let
2512\begin_inset Formula $D$
2513\end_inset
2514
2515 denote the random variable
2516\begin_inset Formula $N-K$
2517\end_inset
2518
2519, which is the number of shares that must be redistributed to bring the
2520 file share set back up to
2521\begin_inset Formula $N$
2522\end_inset
2523
2524 after degrading during an interval.
2525 The probability mass function of
2526\begin_inset Formula $D$
2527\end_inset
2528
2529 is
2530\begin_inset Formula \[
2531Pr[D=d]=f(d)=\begin{cases}
2532Pr\left[K=N\right]+Pr[K<k] & d=0\\
2533Pr\left[K=N-d\right] & 0<d\leq N-k\\
25340 & N-k<d\leq N\end{cases}\]
2535
2536\end_inset
2537
2538
2539\end_layout
2540
2541\begin_layout Standard
2542The expected cost of repairs in a given interval, then, is simply
2543\begin_inset Formula $c\left(s,E\left[D\right],k\right)$
2544\end_inset
2545
2546 where E is the expected value function -- in this case:
2547\begin_inset Formula \begin{align*}
2548E[D] & =\sum_{d=0}^{N}d\cdot Pr\left[D=d\right]\\
2549 & =0\cdot Pr\left[D=0\right]+\sum_{d=1}^{N-k}\left\{ d\cdot Pr\left[K=N-d\right]\right\} +\sum_{d=N-k+1}^{N}\left\{ d\cdot0\right\} \\
2550 & =\sum_{d=1}^{N-k}d\cdot Pr\left[K=N-d\right]\end{align*}
2551
2552\end_inset
2553
2554
2555\end_layout
2556
2557\begin_layout Standard
2558Since each interval starts with a full complement of shares, the expected
2559 repair cost for each interval is the same, and the cost for file that survives
2560 for
2561\begin_inset Formula $t$
2562\end_inset
2563
2564 intervals is
2565\begin_inset Formula $t\cdot c\left(s,E\left[D\right]\right)$
2566\end_inset
2567
2568.
2569 To calculate the lifetime repair cost, we just take the limit over all
2570 intervals as
2571\begin_inset Formula $t\rightarrow\infty$
2572\end_inset
2573
2574, discounting each cost by the probability that the file has already failed.
2575 So, the lifetime expected repair cost is
2576\begin_inset Formula \begin{align*}
2577\sum_{t=1}^{\infty}R\left(t-1\right)c\left(s,E\left[D\right],k\right) & =c\left(s,E\left[D\right],k\right)\sum_{t=1}^{\infty}R\left(t-1\right)\\
2578 & =c\left(s,E\left[D\right],k\right)\sum_{t=1}^{\infty}f\left(k\right)^{t-1}\\
2579 & =c\left(s,E\left[D\right],k\right)\cdot\frac{1}{1-f\left(k\right)}\\
2580 & =\frac{c\left(s,E\left[D\right],k\right)}{1-f\left(k\right)}\end{align*}
2581
2582\end_inset
2583
2584
2585\end_layout
2586
2587\begin_layout Standard
2588It is also necessary to discount future cost, since CPU and bandwidth are
2589 both going to get cheaper over time.
2590 To accommodate this, we throw in an addition per-period discount rate
2591\begin_inset Formula $r$
2592\end_inset
2593
2594.
2595 In accordance with common discount rate usage, the discount multiplier
2596 at time
2597\begin_inset Formula $t$
2598\end_inset
2599
2600 is
2601\begin_inset Formula $\left(1-r\right)^{t}$
2602\end_inset
2603
2604.
2605 This gives:
2606\begin_inset Formula \begin{align*}
2607\sum_{t=1}^{\infty}\left(1-r\right){}^{t}R\left(t-1\right)c\left(s,E\left[D\right],k\right) & =c\left(s,E\left[D\right],k\right)\sum_{t=1}^{\infty}\left(1-r\right)^{t}f\left(k\right)^{t-1}\\
2608 & =c\left(s,E\left[D\right],k\right)\sum_{t=1}^{\infty}\left(1-r\right)^{t}f\left(k\right)^{t-1}\\
2609 & =c\left(s,E\left[D\right],k\right)\left(1-r\right)\sum_{t=1}^{\infty}\left(1-r\right)^{t-1}f\left(k\right)^{t-1}\\
2610 & =\frac{c\left(s,E\left[D\right],k\right)\left(1-r\right)}{1-\left(1-r\right)f\left(k\right)}\end{align*}
2611
2612\end_inset
2613
2614If
2615\begin_inset Formula $r=0$
2616\end_inset
2617
2618 this collapses to the previous result, as one would expect.
2619\end_layout
2620
2621\begin_layout Subsection
2622Non-aggressive Repair
2623\end_layout
2624
2625\begin_layout Standard
2626Need to write this.
2627\end_layout
2628
2629\begin_layout Section
2630Time-Sensitive Retrieval
2631\end_layout
2632
2633\begin_layout Standard
2634The above work has almost entirely ignored the distinction between availability
2635 and reliability.
2636\end_layout
2637
2638\begin_layout Standard
2639Need to write this.
2640\end_layout
2641
2642\end_body
2643\end_document
Note: See TracBrowser for help on using the repository browser.