DAS: Fraud Detection
Authors: Robert Raynor @EigenDA and Lulu Zhou @EigenDA
Special thanks to Joachim Neu @a16z for insightful feedback, valuable discussion, and for advocating the strong adversarial model for DAS.
In this post, we explore data availability sampling (DAS) under the strongest adversarial model considered so far—where the adversary sees all sampling requests before choosing which chunks to reveal—and show that even in this worst-case scenario, it’s still very unlikely they can fool even a small fraction of a large enough group of observers.
This is the first blog post in our series on the DAS protocol. Stay tuned for the second installment, DAS: Data Recovery, coming soon.
We assume the reader has a basic understanding of the data availability sampling (DAS) protocol, which is well explained here and there.
Section 1: Definition and Notations
In this section, we introduce the definitions and notation used to analyze the reliability of the DAS protocol.
1.1 DAS protocol overview
In DAS protocol, the data which is supposed to be available is called blob. To enable statistical estimation without requiring full observation of the blob, DAS protocols employ Reed-Solomon encoding. Given a blob of size \ell chunks, this encoding extends it to a total size of \ell / \gamma chunks, where \gamma is the coding rate. A key property of this encoding is that any subset of at least \ell encoded chunks is sufficient to fully reconstruct the original blob.
There is a group of observers who randomly sample k chunks each and request them from the data provider to check the availability of the blob. They consider the blob as available if they received all the samples they requested.
1.2 Property of the protocol
We define availability as follows:
Definition (Availability) A data blob is available if the honest observers collectively have a set of chunks from which the blob can be reconstructed (through Reed-Solomon decoding).
In this blog post, we aim to prove that if a certain fraction of the honest observers receive all the samples they request (and thus consider the blob available), then the blob is actually available (in the sense of the above definition).
1.3 Statistical Formulation
The set of honest observers is denoted by \mathcal{I}. We define:
- \mathcal{C} as the set of encoded chunks corresponding to the blob.
- \mathcal{C}_i \subseteq \mathcal{C} as the set of chunks sampled by observer i \in \mathcal{I}
The sample space of the observers’ observations is then given by
For an element \omega \in \Omega, each entry \omega_{i,a} \in \{0,1\} represents the presence (1) or absence (0) of a chunk a in the responses to the samples of observer i.
As mentioned, a blob is available if the observers collectively have enough chunks to reconstruct the blob, i.e., if the condition:
is satisfied. Corresponding to this condition, we define the event R that the blob is available:
Each observer makes use of its own local detector to detect the unavailability of a blob. We will consider the detector as defined by the event T_i that all samples requested by an observer are received:
Section 2: Analysis of Adversary’s Game
In this section, we analyze a strong adversary and show that the probability of a certain portion of honest observers being wrongly convinced that the data is available—while the blob is not actually available, i.e., cannot be decoded from the chunks collectively sampled by honest observers—is low.
2.1: Parameters
We provide a summary of the key parameters:
| Term | Symbol | Description |
|---|---|---|
| Total Chunks | c | The total number of chunks after encoding (currently c = 8192) |
| Coding Rate | \gamma | The total number of chunks before encoding / total number of chunks after encoding, currently \gamma = 1/8. \ell = \gamma c is the minimum number of chunks required to recover the blob. |
| Observer set | \mathcal{I} | The set of all the honest observers |
| Total Honest observers | m | The total number of honest observers. m = |\mathcal{I}| |
| Chunk set of observer i | \mathcal{C}_i | The set of chunks sampled by observer i |
| Number of chunks sampled | k | Number of chunks sampled per blob by each observer. k =|\mathcal{C}_i| |
2.2: Comparison with Existing Analyses
A commonly used assumption in DAS is that each chunk request is unlinkable to its originating observer, which makes the probability of successfully deceiving any individual observer—without revealing the blob—decay exponentially as the number of requests increases. This amounts to a weak adversary model (called “bulletin board model” here), because the adversary is quite restricted by this assumption. However, in practice, this unlinkability assumption often cannot be guaranteed. The “weak model” is thus considered unrealistic at this point.
An alternative and considerably stronger adversarial model assumes that the adversary can link chunk requests to the originating observers. There is existing analysis that derives the minimal number of observers required to make it unlikely for an adversary to trick all of them into believing the data is available when it is not.
In this blog post, we consider an even stronger adversary—one that can first observe all sampling requests from all observers, and only then selectively choose a subset of chunks to release in response. In the following section, we will prove that even under this stronger adversarial model, it is highly unlikely for the adversary to successfully trick even a certain small fraction of a sufficiently large group of observers into believing the data is available when it is not.
2.3 Analysis for the Strong Adversary
We consider a powerful adversarial model for the malicious data provider: it first collects the requested samples from each observer—knowing exactly which chunks each node requested—and then selects a minimal set of chunks that can satisfy the requests of as many observers as possible.
We formalize the challenge faced by the strong adversary as a game, defined as follows:
(Sample Coverage Game) The adversary is given the sampling requests from m observers, where each observer samples k random chunks. The adversary wins if it can fulfill the sampling requests of at least \alpha_1 m observers (tricking them into believing the blob to be available) while revealing fewer than \gamma c chunks (so that the blob remains unavailable).
In the remainder of this subsection, we will prove that the probability of the adversary winning this game is negligible as m becomes sufficiently large.
2.3.1 Concentration results for sampler sets
(Theorem: Sample Concentration) Suppose there’s a set of honest observers with size m . We want to make sure that the probability that a portion of honest observers, denoted as \alpha_1 m, can be convinced that the data is available, while the blob is not collectively available to them, is negligible. In other words, we would like to make sure that the probability that there exists a set of more than \alpha_1 m honest observers whose samples are contained within a set of size \gamma c - 1 is smaller than a negligible number \epsilon = 2^{-128}. That is:
Proof:
If |\cup_{i\in \mathcal{N}} \mathcal{C}_i| \le \gamma c - 1, then it trivially satisfies |\cup_{i\in \mathcal{N}} \mathcal{C}_i| \le \gamma c. For simplicity of notation, we prove the following bound instead: P(\exists \mathcal{N} \subset \mathcal{I} : |\mathcal{N}| \ge \alpha_1 m \space \land \space |\cup_{i\in \mathcal{N}} \mathcal{C}_i| \le \gamma c) \le \epsilon.
We will proceed in three steps. First, we consider an upper bound on the number of possible ways the adversary can choose to release \gamma c chunks. Second, for each of these combinations, we show that the probability of tricking a significant number of observers into believing the data is available is extremely low. Finally, we combine these results to argue that—regardless of the strategy the adversary employs—it is overall highly unlikely that many observers can be tricked into believing the blob is available when it is not.
Step 1:
Let C_0 be a set of \gamma c chunks. There are \text{choose}(c, \gamma c) different ways of choosing C_0. Utilizing the the information-theoretic bound, we have
where H(\gamma)=-[\gamma \ln(\gamma) + (1-\gamma)\ln(1-\gamma)].
Step 2:
Let \mathcal{X} be the set of observers such that all k of their samples fall completely within the set C_0. Note that |\mathcal{X}| \ge \alpha_1 m \Leftrightarrow \exists \mathcal{N} \subset \mathcal{I} : |\mathcal{N}| \ge \alpha_1 m \space \land \space \cup_{i\in \mathcal{N}} \mathcal{C}_i \subseteq C_0 . We aim to derive the probability that |\mathcal{X}| \ge \alpha_1 m. Since each observer is sampling k chunks randomly chosen by itself, the probability that one of its samples falls into C_0 is \frac{\gamma c}{c} = \gamma. Since each sample is independently chosen, the probability that all the k chunks sampled by a given observer is fully contained within C_0 is p = \gamma^k. Furthermore, these events are independent among the observers. Therefore, |\mathcal{X}|, the number of observers whose samples fall completely within C_0, is a binomial random variable (i.e., the sum of m independent Bernouilli random variables with parameter p).
We can make use of the Chernoff-Hoeffding bound for |\mathcal{X}|. By substituting \epsilon with \alpha_1 - p in the additive form of Chernoff–Hoeffding theorem, we get
where D(\alpha_1||p) is the KL-divergence between the Bernoulli distributions with parameters \alpha_1, q and D(\alpha_1||p) = \alpha_1 \ln(\frac{\alpha_1}{p})+(1-\alpha_1)\ln(\frac{1-\alpha_1}{1-p}).
Recall that p = \gamma ^ k, and |\mathcal{X}| \ge \alpha_1 m \Leftrightarrow \exists \mathcal{N} \subset \mathcal{I} : |\mathcal{N}| \ge \alpha_1 m \space \land \space \cup_{i\in \mathcal{N}} \mathcal{C}_i \subseteq C_0 which gives:
For simplicity of notation in the following parts, we introduce \beta_1 = m/c \Leftrightarrow m = \beta_1 c. Therefore, we have:
Step 3:
Thus, by the union bound, the total probability that there exists a collection of \gamma c chunks which can convince more than \alpha_1 m honest observers is bound by:
Assuming c = 8192, to make the probability above negligible (\leq \exp(-8192 \times 0.011 ) \le \exp(-90) \le 2^{-128}), the following inequality should be satisfied:
We derived the bound for \beta_1, given that
\gamma = 1/8 and k, \alpha_1 specified in the table below:
| k | \alpha_1 | \beta_1 | m =\beta_1 c |
|---|---|---|---|
| 4 | 1/8 | 0.585 | 4791 |
| 5 | 1/8 | 0.421 | 3442 |
| 6 | 1/8 | 0.328 | 2686 |
| 7 | 1/8 | 0.269 | 2202 |
| 8 | 1/8 | 0.228 | 1866 |
| 12 | 1/8 | 0.142 | 1159 |
| 16 | 1/8 | 0.103 | 840 |
| 20 | 1/8 | 0.081 | 659 |
| 24 | 1/8 | 0.067 | 542 |
The code for calculating the results shown in the table above can be found here.
Here is the interpretation of row 3 of the table to illustrate its meaning. Suppose we have 2686 honest observers. Each of them request k=6 random chunks of a blob. There could be two outcomes of the sampling:
- Less than 1/8 of them received all the chunks they request. In this case, 7/8 of them will consider the data as unavailable. As the result, the social consensus will decide that the data is unavailable.
- At least 1/8 of them received all the chunks they request. In this case, these 2686*1/8=336 honest observers collectively own more than \gamma c =1024 unique chunks with high probability (> 1 - 2^{-128}), which means they are able to recover the blob.
In other words, the probability for the adversary to win the sample coverage game is negligible for k = 6, \alpha_1 = 1/8 and m = 2686.
Section 3: Conclusion
In this blog post, we model and formally prove key properties of the DAS protocol. Under a strong adversarial model—where the malicious data provider can attribute sample requests to the originating observers and select a subset of observers to serve after receiving all requests—we show that it is impossible to make even a small fraction of observers wrongly believe the data is available without actually revealing the blob, provided the number of observers is sufficiently large. This result lays a solid foundation for enabling the community to collectively make correct decisions about data availability, thereby making social consensus feasible.