## “No Estimation Without Representation!”

For so-called Statistically Valid Random Sampling (SVRS), it is necessary that the sample be “representative.” I have written other articles about the appeals of Medicare extrapolation cases, and this is another of those objections that is sometimes raised in the appeals process. This is a general principle in statistics which should be understood by everyone involved in sampling. Issues raised in appeals sometimes reveal a misunderstanding about what “representative” really means. Or, perhaps, the consultant is counting on the adjudicator buying into such a misunderstanding.

There is an intuitive version of representation that says a representative sample will “look like” the population, that is, they will have very similar descriptive statistics, such as mean, variance, and skewness. A histogram of the sample should look like a histogram of the population, one thinks. Indeed, this is the ideal, it is the result one hopes for if sampling is to be worthwhile. And it is, in fact, the theoretical asymptotic result. However, it might not be the case for any given representative sample.

It is a general principle of sampling that random sampling ensures representativeness. Misunderstandings arise with the failure to distinguish between an intuitively “representative sample” and a “representative sampling procedure.” A sample which is chosen through a “representative sampling procedure” is, by definition, a “representative sample.” This is true even if it fails to resemble the population. We can use a small illustration to show how this occurs. Suppose we have a population consisting of {1, 2, 3, 10} and we intend to select a sample of size three. Now, a basic principle of random sampling is that every possible sample has an equal (technically, “known”) probability of being selected. In this illustration, we can enumerate all the possible samples and compare the mean and variance to the population. I leave this as an exercise for the reader. Even without doing it, you might quickly see that the sample {1, 2, 3}, which is perfectly legitimate, has very different characteristics as compared to the population. We can go further than this, because we have not stated whether sampling should be with, or without, replacement. In sampling with replacement, {10, 10, 10} is one of the possible samples, and it is pathologically unlike the population! Yet, these samples are representative, if selected randomly from the population of samples. Larger populations with larger samples will not display such extreme behavior, yet the principle seen in the illustration can still be observed.

When one begins to evaluate a sample to see if it is representative, the question ultimately arises, “What will you do if you decide it is not representative?” The appeals consultant would have you throw it out and start over, but, what does that do to SVRS? According to random sampling, every sample should have a known probability of selection. But if you’re going to decide, after the fact, that some samples are not eligible, you can no longer know the ultimate probability of selection (after rejection). I suppose, if one could quantify the conditions under which a sample will be rejected (in advance), there might be a way to do this. I’ve never heard of anyone trying, and I suspect there are good reasons why not! In any case, the call to discard a sample after sampling is performed turns the sample into a judgement sample rather than a probability sample, and thus not SVRS. It’s just the same as if a scientist were to keep collecting samples until he got one he liked, and discard the others!

The next question is, what is it you actually want to be representative? This ought to be a real “aha” moment. You see, sampling is performed to collect “observations” from a limited number of “subjects.” The target values of the whole population are not observed (this applies to most cases; we omit the exceptions). But, it is the target values in the population that one wants the observed values from the sample to represent. So, it is actually not possible to compare the distribution of the population values (we don’t have them) and the sample values. Evaluation of “representativeness,” in this sense, cannot be done!

So what is it that our appeals consultants are doing? They’re comparing other (known) values from the population to their corresponding values from the sample. In the Medicare extrapolation scenario, this usually means comparing the distributions of the paid amounts as a proxy for the distributions of overpayments. Of course, one hardly need mention that the distribution of the paid amounts may not be anything like the distribution of the overpayments! So even if one were to entertain the validity of rejecting a sample for not being representative, we still don’t know the all-important piece of information–if the overpayments are representative or not! (Putting aside all the problems with defining “representative,” as described above.)

Perhaps the reader will be unable to avoid the gut feeling that, somehow, it’s just fair and right that the distribution of paid amounts for the sample should be similar to the distribution of paid amounts of the population. Well, there are some ways to deal with this that do not involve invalid sampling procedures. One helpful strategy is to deal with outliers first. Outliers can either be eliminated, or they can be put into a separate “certainty stratum.” Both approaches are done prior to sampling and do not break the rule about “known probabilities.” And speaking of “stratum,” stratified sampling plans can also be used to mitigate this problem. By controling the stratum sample sizes, one can ensure a more even “intuitive representation” without jeapordizing validity. Of course, when stratification is done, the consultant will certainly make some claim about how the strata are not correctly defined, or how the sample (as a whole, without regard to strata) is unlike the population. The possibilities are probably endless, but one thing is clear: An SVRS, which is always conducted with known probabilities of selection, is always a representative sample.