Epsilon-Biased Spaces

1. Introduction

Consider a set of ${n}$ ${0-1}$ random variables ${X =X_1, X_2,...,X_n}$ with the support ${\Omega \subset \{0,1\}^n}$ such that ${(X_1, X_2,...,X_n)}$ are uniformly distributed (${\mathcal{U}_n}$) over ${\{0,1\}^n}$ . What happens if we relax the notion of uniformity a little bit? How large does ${\Omega}$ have to be if we need ${(X_1, X_2,...,X_n)}$ to be only approximately uniformly distributed?

There are quite a few well studied notions of what is meant to approximate some distribution. One such idea is of the statistical distance:

Definition 1 (Statistical Distance) If ${P(x)}$ and ${Q(x)}$ are two probability distributions on some (finite) space ${\mathcal{X}}$, then the statistical distance between ${P}$ and ${Q}$ is defined as

$\displaystyle \Delta(P,Q) = max_{S: S \subset \mathcal{X}} |P(x \in S) - Q(x \in S)| \ \ \ \ \ (1)$

Suppose, given an ${S \in }$, one could efficiently determine what ${P(S)}$ and ${Q(S)}$ were. Then this definition inspires a statistical test to identify if ${P}$ and ${Q}$ are far apart: Find out ${P(S)}$ and ${Q(S)}$ for a fixed ${S}$ and see if they differ by a lot. If we were trying to prove that ${P}$ and ${Q}$ are ${\epsilon}$-close, clearly, the worst input we need to consider to confirm this is the set ${S}$ that satisfies equation 1 with equality. In other words, low-statistical distance guarantees that ${P}$ and ${Q}$ can pass such a closeness test.

Suppose a BPP algorithm say ${\mathcal{A}}$ uses uniformly distributed random variables ${X}$ to return the right answer with error probability at most ${\delta}$ and we want to replace them by some approximately uniform random variables to reduce the amount of randomness required. ${\mathcal{A}}$ can be regarded as a function of the input and the random variables ${X}$. Suppose ${\mathcal{A}}$ functions correctly when ${X \in S}$ for some ${S \subset \{0,1\}^n}$. Then

$\displaystyle Prob_{X \sim \mathcal{U}_n}[ X \in \{0,1\}^n \setminus S] \leq \delta.$

If ${\mathcal{D}_{n}}$ is some approximately uniform distribution with ${\Delta(\mathcal{D}_n,\mathcal{U}_n) \leq \epsilon}$ then,

$\displaystyle Prob_{X \in \mathcal{D}_n} [ X \in \{0,1\}^n \setminus S] \leq \delta + \epsilon.$

Thus, a distribution that is close to the uniform distribution in statistical distance can be used to replace purely uniform bits. However, statistical distance may be too strong for us and may not allow us enough freedom to significantly decrease the amount of randomness required. Suppose that the set of elements of ${\{0,1\}^n}$ that make ${\mathcal{A}}$ work correctly on input ${x}$, say ${K_x \subset \{0,1\}^n}$. We can then replace the Uniform distribution to generate the random bits by any distribution ${\mathcal{D}}$ such that ${|Prob_{x \sim \mathcal{D}}[ x \in K] - Prob_{x \sim \mathcal{U}_n} [x \in K]| \leq \epsilon}$ for all ${K \in \cup_{x} K_x}$. This may not need, depending on ${\mathcal{A}}$, that ${\mathcal{D}}$ be within ${\epsilon}$ statistical distance of ${\mathcal{U}_n}$. In this case, it makes sense to have a definition that considers errors on only some subsets of ${\{0,1\}^n}$ ( ${\cup_{x}K_x}$) and not all of them.

In discrepancy theory, a notion of ${\epsilon}$-approximation is studied which turns out to be interesting in this context.

Definition 2 (${\epsilon}$-approximation) Let ${(\mathcal{X},\mathcal {S})}$ be a set system where ${\mathcal {S} \subset 2^{X}}$. A subset ${Y \subset X}$ is said to be an ${\epsilon}$-approximation of ${X}$ with respect to ${\mathcal {S}}$, if for every ${S \in \mathcal {S}}$,

$\displaystyle | \frac{|S \cap X|}{ |X|} - \frac{|S \cap Y|}{|Y|} | \leq \epsilon$

Equivalently, if ${D_{A}}$ denotes the uniform distribution over elements of the set ${A}$, then

$\displaystyle | Prob_{x \sim D_{X}}[ x \in S] - Prob_{x \sim D_{Y}} [x \in Y] | \leq \epsilon$

An ${\epsilon}$-approximation measures tries to approximate the uniform distribution upto an ${\epsilon}$ error with respect to all the sets in ${\mathcal {S}}$. If ${\mathcal {S}}$ were ${2^{\mathcal{V}}}$ then we recover the definition of the statistical distance.

Let ${\mathcal{X}}$ be ${\{0,1\}^n}$ and ${\mathcal {S}}$ be the set of all parity functions. Then an ${\epsilon}$-approximation for the set system ${(\mathcal{X},\mathcal {S})}$ is called as an ${\epsilon}$-biased sample space. In other words, an ${\epsilon}$-biased sample space on ${\{0,1\}^n}$ fools all parity functions of ${X}$. The challenge is to construct small ${\epsilon}$-biased spaces. As we will see, such a relaxation dramatically reduces the sample space size from ${2^n}$ to ${O(\frac{n^2}{\epsilon^2})}$.

Definition 3 (${\epsilon}$-Biased Random Variables) ${X = (X_1, X_2,...,X_n)}$ over ${\Omega \subset \{0,1\}^n}$ is said to be ${\epsilon}$-biased collection of random variables and ${\Omega}$ an epsilon biased space, if for every ${S \subset [n]}$

$\displaystyle Prob_{X \sim \mathcal{D}_{\Omega}}[ \sum_{i \in S} X_i = 0] - Prob_{X \sim \mathcal{D}_{\Omega}}[\sum_{i \in S} X_i = 1] \leq \epsilon$

2. Construction of ${\epsilon}$-biased spaces

First of all, what does a ${0}$-biased space correspond to? As you might imagine, there is only once such space and that is the uniform distribution on ${\{0,1\}^n}$ itself. How do we prove this? Consider any probability function that assigns a probability to every element of ${\{0,1\}^n}$. Now, ${0}$-bias condition induces several linear constraints (in fact equality constraints) on such a probability function. In fact, they are exactly ${2^n -1}$ in number. One can also quite easily see that they are all independent. Add to it the constraint that the probabilities over all elements in ${\{0,1\}^n}$ must sum up to ${1}$. Then we have ${2^n}$ independent linear equality constraints and ${2^n}$ variables. It is easy to see that ${U_n}$ satisfies all these constraints. Also such a system cannot have more than one solution.

Let’s now discuss a simple but cool construction of ${\epsilon}$-biased space. Let ${m = \lceil{\log_2{\frac{n}{\epsilon}}}\rceil}$. Choose ${\mathcal{F} = \mathcal{F}_{m}}$. We show how to construct an ${\epsilon}$-biased space of size ${O(\frac{n^2}{\epsilon^2})}$.

Let ${y}$ and ${z}$ be drawn uniformly randomly from ${\mathcal{F}}$. Let ${X_i = \langle y^{i},z\rangle}$. Here the inner product is the standard product where we see the elements of ${\mathcal{F}}$ as vectors in ${\{0,1\}^d}$. We claim that ${X_i}$ for ${1 \leq i\leq n}$ are ${\epsilon}$-biased. The proof is based on two very simple ideas:

1. A ${d}$ degree univariate polynomial over ${\mathcal{F}}$ has at most ${d}$ zeros in ${\mathcal{F}}$.
2. The elements of ${\mathcal{F}_{m}}$ = ${\mathcal{F}_{2^s}}$ can be seen as vectors of length ${s}$ on ${F_2}$ for the purpose of addition. That is ${x + y}$ in ${\mathcal{F}}$ is same as ${x +y mod (2) }$ when ${x}$ and ${y}$ are seen as elements in ${\{0,1\}^d}$.

Proposition 4 Let ${y}$ and ${z}$ be drawn uniformly at random from ${\mathcal{F} = \mathcal{F}_m}$ for ${m = \lceil{\log_2{\frac{n}{\epsilon}}}\rceil}$ and let ${X_i =\langle y^{i}, z \rangle}$ for ${1\leq i\leq n}$. Then ${X_i}$s are ${\epsilon}$-biased.

Proof: Let ${S = \{ i_1, i_2, ...,i_k\}}$.

$\displaystyle X_{i_1} + X_{i_2}+...+X_{i_k} = \langle (y^{i_1} + ...+y^{i_k}), z \rangle$

Let ${P_S(y) = (y^{i_1} + ...+y^{i_k})}$ Then,

$\displaystyle Prob [\sum_{j \in S} X_j = 0] = Prob[P_S(y) = 0] + (1 - Prob[P_S(y) =0])(Prob[ \langle P_S(y) , z \rangle = 0])$

Now, ${P_S(y)}$ is a polynomial of degree at most ${n}$. Thus it can have at most ${n}$ roots. ${P_S(y) = 0}$ if ${y}$ is a zero of the polynomial ${y_S}$. Now, ${Prob[ y \sim \mathcal{D}_{\mathcal{F}}] [P_S(y) =0] \leq \frac{n}{2^m} \leq \epsilon}$. Also since ${z}$ is uniformly random vector in ${\{0,1\}^d}$, for a non sero ${P_S(y)}$, ${Prob[\langle P_S(y), z \rangle = 0] = \frac{1}{2}}$.

Thus, we have:

$\displaystyle Prob [\sum_{j \in S} X_j = 0] \leq \epsilon + (1 - \epsilon)\frac{1}{2} \leq \frac{1}{2}+ \epsilon$

This completes the proof. $\Box$

VC-Dimension

Consider a set of vectors ${\mathcal{D}}$ in ${\{0,1\}^n}$. Suppose you and I both are aware of the collection ${\mathcal{D}}$ and you want to know which vector ${d \in \mathcal{D}}$ am I looking at currently. Let’s say you are charged 1 unit for every bit you ask me the value of. How many bit-values do you need to know before you can identify the vector ${d}$ correctly for any ${d \in \mathcal{D}}$?

For any ${S \in [n]}$ and a vector ${d \in \{0,1\}^n}$ let ${d_S}$ denote the vector of bit values of ${d}$ on the positions in ${S}$. If ${S}$ is the set of bit positions you should inquire about the values of, clearly ${S}$ should be such that for every ${Y \in \{0,1\}^S}$ then there must exist at most one vector ${d \in \mathcal{D}}$ such that ${d_S = Y}$.

Now suppose you cannot ask me for specific bits. Instead, I reveal to you the bit in each position of the vector I hold, one by one. How many bits do you need to see, in the worse case to correctly identify the vector ${d}$? Since you can only control the number of bits revealed, say ${k}$, then any set ${S}$ of ${k}$ positions must be such that for any ${Y \in \{0,1\}^S}$, and any vector ${v, v' \in V}$, ${v'_S = v_S \Rightarrow v' = v}$. ( For the case where you can indeed ask me for specific bits, there is another well developed theory of witness sets. (See Chapter 12 of Stasys Jukna’s excellent book Extremal Combinatorics)

Now suppose the underlying set of points is infinite instead of ${\{0,1\}^n}$ as in the example above. The idea of distinguishing a set from a collection economically still makes intuitive sense. But the formulation above doesn’t. How do we translate this intuition into a good definition?

One of the ways we can differentiate two distinct sets ${S}$ and ${T}$ is by producing an element ${x}$ that is ${S \Delta T}$, i.e. which belongs to only one of the sets. Suppose you have an access to an oracle that produces an arbitrary point in the underlying ground set ${V}$ and tells you if it belongs to the set in question or not, how many queries to the oracle are required to pin down whether you have ${S}$ or ${T}$?

Note that if you had the freedom to decide which point the oracle tells you the membership on, then indeed you need to query on exactly one point, i.e., some ${x \in S \Delta T}$ where ${\Delta}$ denotes the symmetric difference. However, in the setting above, you have no control over which point the oracle chooses to disclose the membership.

I’ll now write down a couple of definitions and show how one can give a lower bound on the number of oracle queries required to pin-point any set in a collection. Later I’ll show how one can give an upper bound on the number of queries too.

Let ${C = (V, \mathcal{D})}$ be a set system where ${V}$ be the underlying (possibly infinite) set and let ${\mathcal{D}}$ now be a collection of subsets of ${V}$ (again, possibly an infinite number of subsets). . Let ${S}$ be any subset of ${V}$ . ${C}$ is said to shatter ${S}$ if for every ${T \subset S}$, ${\exists}$ a ${D \in \mathcal{D}}$ such that ${D \cap S = T}$. The VC-dimension of ${C}$ is defined as the cardinality of the largest set ${S}$ that ${C}$ shatters. If ${C}$ shatters arbitrarily large sets, the VC-dimension of ${C}$ is said to be ${\infty}$.

If ${C}$ shatters a set of size at least ${d}$ then note that there exists a set of points ${S \in V, |S| \leq d-1}$ such that when the oracle reveals the membership information for every point in ${S}$, you can’t still be sure which set in ${\mathcal{D}}$ is being referred to. Here’s a simple proof of this fact:

Let ${X}$ be the set that you must identify and let ${S}$ be a set of size ${d}$ that ${C}$ shatters. Suppose the oracle reveals the membership information for all points in ${T \subset S}$ but ${T \neq S}$. Let ${x \in S \setminus T}$. Let ${Y \in \{0,1\}^T}$ denote the membership information revealed by the oracle ( if ${T = [|T|] ), Y_i = 1 \leftrightarrow i \in X \forall 1 \leq i \leq |T|}$). Now since ${C}$ shatters ${S}$, there must exist sets ${A \neq B \in \mathcal{D}}$ such that if ${Y(A)}$ and ${Y(B)}$ are the corresponding membership vectors of elements in ${T}$ for ${A}$ and ${B}$ then, ${Y(A) = Y(B) = Y}$ and ${x \in A}$ but ${x \notin B}$.

Since ${A}$ and ${B}$ have matching membership information for elements in ${T}$, one cannot decide if ${X =A}$ or ${X = B}$ (or neither!) by the membership information on ${T}$.

I’ll continue writing about how one can obtain upper bounds on the oracle queries using the VC-dimension idea, and show how one can compute VC-dimension bounds for interesting geometric sets and show a connection to ${\epsilon}$-approximations.