Consider a set of vectors in . Suppose you and I both are aware of the collection and you want to know which vector am I looking at currently. Let’s say you are charged 1 unit for every bit you ask me the value of. How many bit-values do you need to know before you can identify the vector correctly for any ?
For any and a vector let denote the vector of bit values of on the positions in . If is the set of bit positions you should inquire about the values of, clearly should be such that for every then there must exist at most one vector such that .
Now suppose you cannot ask me for specific bits. Instead, I reveal to you the bit in each position of the vector I hold, one by one. How many bits do you need to see, in the worse case to correctly identify the vector ? Since you can only control the number of bits revealed, say , then any set of positions must be such that for any , and any vector , . ( For the case where you can indeed ask me for specific bits, there is another well developed theory of witness sets. (See Chapter 12 of Stasys Jukna’s excellent book Extremal Combinatorics)
Now suppose the underlying set of points is infinite instead of as in the example above. The idea of distinguishing a set from a collection economically still makes intuitive sense. But the formulation above doesn’t. How do we translate this intuition into a good definition?
One of the ways we can differentiate two distinct sets and is by producing an element that is , i.e. which belongs to only one of the sets. Suppose you have an access to an oracle that produces an arbitrary point in the underlying ground set and tells you if it belongs to the set in question or not, how many queries to the oracle are required to pin down whether you have or ?
Note that if you had the freedom to decide which point the oracle tells you the membership on, then indeed you need to query on exactly one point, i.e., some where denotes the symmetric difference. However, in the setting above, you have no control over which point the oracle chooses to disclose the membership.
I’ll now write down a couple of definitions and show how one can give a lower bound on the number of oracle queries required to pin-point any set in a collection. Later I’ll show how one can give an upper bound on the number of queries too.
Let be a set system where be the underlying (possibly infinite) set and let now be a collection of subsets of (again, possibly an infinite number of subsets). . Let be any subset of . is said to shatter if for every , a such that . The VC-dimension of is defined as the cardinality of the largest set that shatters. If shatters arbitrarily large sets, the VC-dimension of is said to be .
If shatters a set of size at least then note that there exists a set of points such that when the oracle reveals the membership information for every point in , you can’t still be sure which set in is being referred to. Here’s a simple proof of this fact:
Let be the set that you must identify and let be a set of size that shatters. Suppose the oracle reveals the membership information for all points in but . Let . Let denote the membership information revealed by the oracle ( if ). Now since shatters , there must exist sets such that if and are the corresponding membership vectors of elements in for and then, and but .
Since and have matching membership information for elements in , one cannot decide if or (or neither!) by the membership information on .
I’ll continue writing about how one can obtain upper bounds on the oracle queries using the VC-dimension idea, and show how one can compute VC-dimension bounds for interesting geometric sets and show a connection to -approximations.