VC-Dimension

Consider a set of vectors {\mathcal{D}} in {\{0,1\}^n}. Suppose you and I both are aware of the collection {\mathcal{D}} and you want to know which vector {d \in \mathcal{D}} am I looking at currently. Let’s say you are charged 1 unit for every bit you ask me the value of. How many bit-values do you need to know before you can identify the vector {d} correctly for any {d \in \mathcal{D}}?

For any {S \in [n]} and a vector {d \in \{0,1\}^n} let {d_S} denote the vector of bit values of {d} on the positions in {S}. If {S} is the set of bit positions you should inquire about the values of, clearly {S} should be such that for every {Y \in \{0,1\}^S} then there must exist at most one vector {d \in \mathcal{D}} such that {d_S = Y}.

Now suppose you cannot ask me for specific bits. Instead, I reveal to you the bit in each position of the vector I hold, one by one. How many bits do you need to see, in the worse case to correctly identify the vector {d}? Since you can only control the number of bits revealed, say {k}, then any set {S} of {k} positions must be such that for any {Y \in \{0,1\}^S}, and any vector {v, v' \in V}, {v'_S = v_S \Rightarrow v' = v}. ( For the case where you can indeed ask me for specific bits, there is another well developed theory of witness sets. (See Chapter 12 of Stasys Jukna’s excellent book Extremal Combinatorics)

Now suppose the underlying set of points is infinite instead of {\{0,1\}^n} as in the example above. The idea of distinguishing a set from a collection economically still makes intuitive sense. But the formulation above doesn’t. How do we translate this intuition into a good definition?

One of the ways we can differentiate two distinct sets {S} and {T} is by producing an element {x} that is {S \Delta T}, i.e. which belongs to only one of the sets. Suppose you have an access to an oracle that produces an arbitrary point in the underlying ground set {V} and tells you if it belongs to the set in question or not, how many queries to the oracle are required to pin down whether you have {S} or {T}?

Note that if you had the freedom to decide which point the oracle tells you the membership on, then indeed you need to query on exactly one point, i.e., some {x \in S \Delta T} where {\Delta} denotes the symmetric difference. However, in the setting above, you have no control over which point the oracle chooses to disclose the membership.

I’ll now write down a couple of definitions and show how one can give a lower bound on the number of oracle queries required to pin-point any set in a collection. Later I’ll show how one can give an upper bound on the number of queries too.

Let {C = (V, \mathcal{D})} be a set system where {V} be the underlying (possibly infinite) set and let {\mathcal{D}} now be a collection of subsets of {V} (again, possibly an infinite number of subsets). . Let {S} be any subset of {V} . {C} is said to shatter {S} if for every {T \subset S}, {\exists} a {D \in \mathcal{D}} such that {D \cap S = T}. The VC-dimension of {C} is defined as the cardinality of the largest set {S} that {C} shatters. If {C} shatters arbitrarily large sets, the VC-dimension of {C} is said to be {\infty}.

If {C} shatters a set of size at least {d} then note that there exists a set of points {S \in V, |S| \leq d-1} such that when the oracle reveals the membership information for every point in {S}, you can’t still be sure which set in {\mathcal{D}} is being referred to. Here’s a simple proof of this fact:

Let {X} be the set that you must identify and let {S} be a set of size {d} that {C} shatters. Suppose the oracle reveals the membership information for all points in {T \subset S} but {T \neq S}. Let {x \in S \setminus T}. Let {Y \in \{0,1\}^T} denote the membership information revealed by the oracle ( if {T = [|T|] ), Y_i = 1 \leftrightarrow i \in X \forall 1 \leq i \leq |T|}). Now since {C} shatters {S}, there must exist sets {A \neq B \in \mathcal{D}} such that if {Y(A)} and {Y(B)} are the corresponding membership vectors of elements in {T} for {A} and {B} then, {Y(A) = Y(B) = Y} and {x \in A} but {x \notin B}.

Since {A} and {B} have matching membership information for elements in {T}, one cannot decide if {X =A} or {X = B} (or neither!) by the membership information on {T}.

I’ll continue writing about how one can obtain upper bounds on the oracle queries using the VC-dimension idea, and show how one can compute VC-dimension bounds for interesting geometric sets and show a connection to {\epsilon}-approximations.

 

Leave a comment