What is the meaning of the normal distribution?

(This was originally an answer to a Quora question.)

Ah, the question of meaning. I’m going to assume you are not asking about the mathematical-theoretical derivation of the Gaussian distribution, which is often called the “normal distribution.” This is something you can read about in books of Mathematical Statistics, if you wish. Rather, you want to know what it means, practically, for something to be “normal.” I’m going to assume you’ve already looked at graphs of a normal distribution, and have a general idea what it is.

And so, I think I should start by saying that “almost nothing is actually normal.” One goal of Statistics is to make models of reality that are useful for making predictions. Predictions are (mostly) results of mathematical calculations, so, the easier the mathematics, the easier it is to develop predictions, especially predictions that are reproducible, and understandable, so they can be evaluated and trusted by others. So what is a model? You can think of it as a simplified, or smoothed-over picture of reality that hides some of the bumps and bruises so we can see what is really important. For example, an architect might make a scale model of a building he is designing. The scale model does not have all of the features of the final product. But it can highlight the most obvious features, and it also makes the image small enough so that a person trying to evaluate it can see everything at once. A mathematical or statistical model does something similar for numbers.

So, what statisticians like to do, is take a collection of numbers (there could be millions or billions) and try to boil them down to a simple model. This model would be an equation, or a small group of numbers, that the statistician can look at and “see everything at once,” or at least what is important for the problem at hand. The model will not fit perfectly; that is, it doesn’t have all the details, just like the scale model of the building doesn’t have light switches, water lines, or wireless routers.

When we look at data, the basic form of description is the “distribution.” Imagine one of those coin-sorting machines where you can drop a handful of coins in the top and they fall into vertical tubes with each tube having all the coins of one denomination. When you drop your coins, some of the tubes might get fuller than others, and (ignoring thickness), you can see which coins you have more of. If you count the coins in each tube, that is your “distribution.” If you calculate the proportion of coins in each tube, you have a “relative frequency distribution.” And, if you select a coin at random from the original handful, the relative frequency corresponds to the probability of getting one of those coins. So distributions really describe our data in terms of how many of each kind we have, relative to others. But, due to the realities of measurement, and also some mathematical reasons, having discrete numbers (separate, distinct tubes) is not always the best way to describe the numbers. So, for this we develop the idea of a continuous distribution, where the numbers can have any decimal value. The easiest way to visualize what this is, is to think of the bars of a relative frequency distribution, and connect the tops with a smooth curve, in effect, “smoothing out” the differences between the distinct values.

For certain kinds of data, the “normal distribution” is one such continuous distribution that provides a smooth approximation to the relative frequencies of the real data. But, it is a very complicated formula. It is nothing that would be guessed by trial and error. It’s not the kind of thing you can understand if you have only gone as far as College Algebra in your mathematical pursuits. If we are to avoid the theoretical complexities, the best thing we can say about this formula is that it has a bunch of really useful mathematical properties which work very well in statistical theorems about estimation and hypothesis testing. If we can apply this model to the data, we are able to draw conclusions that have mathematical validity. Today, there are other methods (non-parametric, etc) that use a lot of computing power to draw similar conclusions without the constraints of using an approximating model that discards some of the information. Yet, these methods may also have drawbacks, particularly in the areas mentioned above, namely, whether they are “reproducible, and understandable, so they can be evaluated and trusted by others.”

Remembering that the normal distribution is a model, so it is not expected to describe the data perfectly, we can understand the common procedure where we produce a histogram (relative frequency distribution graph) and superimpose a normal curve on it to see “how well it fits.” There are tests to determine if the fit is “good enough” for the normal distribution to be used. But in general, the distribution should be symmetric and bell-shaped. In theory, normal distributions go on infinitely in positive and negative tails, but it is not necessary (and not feasible) for this to happen with the real values.

How do you “fit” a normal distribution? Only two parameters are needed, the mean, and the standard deviation. You can calculate these from the data you have. The standard deviation describes how much the data “spreads out.” If this explanation is not satisfying, try this visualization: Suppose you see a cloud of fruit flies buzzing around a rotten apple. take a photo, and measure all the distances of the flies from the center of the cloud. You could calculate the average distance, which is called the “mean absolute value,” which would be one way of describing how much the flies spread out. The standard deviation is like that, but it is the average of the squared distances, which give more influence to the farther-out values. Now imagine a flock of crows circling around a tasty cornfield. You are at a distance that makes the cloud of birds look very much like the fruit flies you saw before. Suppose you photograph this scene, and by some wild coincidence, all the birds line up exactly with the fruit flies in your previous photo! But the birds are much bigger, and much farther apart, in real life. So, the standard deviation will be much larger. So what this tells you is that the standard deviation is a kind of scaling factor. The two photos can be “standardized” so that they look the same, by changing the measurement units. And although the photos were taken miles apart, you can slide them together on the table, so that the centers are right on top of each other, and you can’t really tell the difference between them. This is exactly what the “standard normal” distribution does. We slide the number line under the data until the mean is zero, and then stretch or squeeze it until the standard deviation is one. In this sense, all normally distributed data looks the same, and since we know the changes we made (subtract the mean and divide by the standard deviation) we can convert any conclusions we made back to the original location and scale. This explanation of “standard normal” was a bit of a tangent. The real point is that normally distributed data are completely described by the mean and standard deviation. By modeling with the normal distribution, we have essentially collapsed all the data down to two numbers. How’s that for simplification?

The fruit flies, or the birds, exist in three dimensions in real life, and in two dimensions in the photo. So our visualization can be misleading if we take it too far—because what we really need to think about is only one dimension. The flies represent data, but to make this work, we really have to think of them as representing a number to the right or left of the mean. We should then drop them down to the x-axis, so that they pile up like a histogram. Now if you have a good imagination, you may realize that if there are not too many files, they will all fall down flat on the x-axis and not pile up at all! Not much of a bell curve! But this is where the mathematical magic of continuous distributions comes in. It is a fact that the probability of any specific number (infinite decimals) is zero in a continuous distribution. But using the techniques of calculus, ranges of numbers have positive probabilities. So even if the flies all fell flat, those closer to the mean are more dense, and thus for any range, the probability is higher than the same range farther away from the mean.

Now let’s get back to that question about the meaning of the normal distribution. Remember that the standard deviation is a kind of distance from the mean. With normally distributed data, approximately 68% of observations are within one standard deviation of the mean, and approximately 95% are within two standard deviations. Beyond that, there are very few—99.7% are within three standard deviations, and the probability beyond four is so small that for all practical purposes we can say that all the data are within four standard deviations. You’ve probably heard all this already, and there’s probably a big old “so what” sitting on your shoulder, which might be the reason you asked this question in the first place. And the truth is, the “meaning” you may be looking for, often has a great deal to do with what exactly the data is about. The normal distribution is telling you things like this:

  1. Our data exhibit variation that is centered around a mean
  2. There are more data close to the mean and increasingly less as the distance from the mean increases.
  3. Data more than two standard deviations away are rare, and data more than three standard deviations away are almost non-existent.

Interpretations in real life go something like this: The average human IQ is supposed to be 100, and the standard deviation is 15. This means most people (68%) have IQ’s between 85 and 115, which is roughly considered “normal.” For simplicity, let’s just look at the top half of the distribution. From 100 to 115, one standard deviation, you would expect to find 34% of the people. From 115 to 130, the second standard deviation, only 13.5%. From 130 to 145, the third standard deviation, only 2.5%. And over 145? Well that’s just a fraction of a percent (0.15%). If the model holds true for your school, and there were 2000 students in your school, just three of them would be in this group. Now you can understand, if you have heard talk about someone (Einstein) having an IQ over 160, how rare this is. 160 is the fourth standard deviation from the mean. We expect to find almost no data over that level.

In summary, what is the meaning of the normal distribution? It is a way of modeling the probabilities for certain kinds of data, that allows us to describe, in a simplified way, how the data are distributed, using a very simple description (2 parameters). Many real data fit the distribution quite well, so it is a widely used method of summarizing data for analysis.


“No Estimation Without Representation!”

For so-called Statistically Valid Random Sampling (SVRS), it is necessary that the sample be “representative.” I have written other articles about the appeals of Medicare extrapolation cases, and this is another of those objections that is sometimes raised in the appeals process. This is a general principle in statistics which should be understood by everyone involved in sampling. Issues raised in appeals sometimes reveal a misunderstanding about what “representative” really means. Or, perhaps, the consultant is counting on the adjudicator buying into such a misunderstanding.

There is an intuitive version of representation that says a representative sample will “look like” the population, that is, they will have very similar descriptive statistics, such as mean, variance, and skewness. A histogram of the sample should look like a histogram of the population, one thinks. Indeed, this is the ideal, it is the result one hopes for if sampling is to be worthwhile. And it is, in fact, the theoretical asymptotic result. However, it might not be the case for any given representative sample.

It is a general principle of sampling that random sampling ensures representativeness. Misunderstandings arise with the failure to distinguish between an intuitively “representative sample” and a “representative sampling procedure.” A sample which is chosen through a “representative sampling procedure” is, by definition, a “representative sample.” This is true even if it fails to resemble the population. We can use a small illustration to show how this occurs. Suppose we have a population consisting of {1, 2, 3, 10} and we intend to select a sample of size three. Now, a basic principle of random sampling is that every possible sample has an equal (technically, “known”) probability of being selected. In this illustration, we can enumerate all the possible samples and compare the mean and variance to the population. I leave this as an exercise for the reader. Even without doing it, you might quickly see that the sample {1, 2, 3}, which is perfectly legitimate, has very different characteristics as compared to the population. We can go further than this, because we have not stated whether sampling should be with, or without, replacement. In sampling with replacement, {10, 10, 10} is one of the possible samples, and it is pathologically unlike the population! Yet, these samples are representative, if selected randomly from the population of samples. Larger populations with larger samples will not display such extreme behavior, yet the principle seen in the illustration can still be observed.

When one begins to evaluate a sample to see if it is representative, the question ultimately arises, “What will you do if you decide it is not representative?” The appeals consultant would have you throw it out and start over, but, what does that do to SVRS? According to random sampling, every sample should have a known probability of selection. But if you’re going to decide, after the fact, that some samples are not eligible, you can no longer know the ultimate probability of selection (after rejection). I suppose, if one could quantify the conditions under which a sample will be rejected (in advance), there might be a way to do this. I’ve never heard of anyone trying, and I suspect there are good reasons why not! In any case, the call to discard a sample after sampling is performed turns the sample into a judgement sample rather than a probability sample, and thus not SVRS. It’s just the same as if a scientist were to keep collecting samples until he got one he liked, and discard the others!

The next question is, what is it you actually want to be representative? This ought to be a real “aha” moment. You see, sampling is performed to collect “observations” from a limited number of “subjects.” The target values of the whole population are not observed (this applies to most cases; we omit the exceptions). But, it is the target values in the population that one wants the observed values from the sample to represent. So, it is actually not possible to compare the distribution of the population values (we don’t have them) and the sample values. Evaluation of “representativeness,” in this sense, cannot be done!

So what is it that our appeals consultants are doing? They’re comparing other (known) values from the population to their corresponding values from the sample. In the Medicare extrapolation scenario, this usually means comparing the distributions of the paid amounts as a proxy for the distributions of overpayments. Of course, one hardly need mention that the distribution of the paid amounts may not be anything like the distribution of the overpayments! So even if one were to entertain the validity of rejecting a sample for not being representative, we still don’t know the all-important piece of information–if the overpayments are representative or not! (Putting aside all the problems with defining “representative,” as described above.)

Perhaps the reader will be unable to avoid the gut feeling that, somehow, it’s just fair and right that the distribution of paid amounts for the sample should be similar to the distribution of paid amounts of the population. Well, there are some ways to deal with this that do not involve invalid sampling procedures. One helpful strategy is to deal with outliers first. Outliers can either be eliminated, or they can be put into a separate “certainty stratum.” Both approaches are done prior to sampling and do not break the rule about “known probabilities.” And speaking of “stratum,” stratified sampling plans can also be used to mitigate this problem. By controling the stratum sample sizes, one can ensure a more even “intuitive representation” without jeapordizing validity. Of course, when stratification is done, the consultant will certainly make some claim about how the strata are not correctly defined, or how the sample (as a whole, without regard to strata) is unlike the population. The possibilities are probably endless, but one thing is clear: An SVRS, which is always conducted with known probabilities of selection, is always a representative sample.

A flashback of thoughts on education

Today I’m posting something that isn’t (necessarily) about statistics.

For as long as I’ve been alive, and based on my parents’ comments, apparently for some time before that, education has been “going downhill.”

We still hear anecdotes about how bad education is, we still hear about America test scores not being competitive in the world, and we still hear from college professors that freshmen are increasingly unprepared for college.

And now we have some new things.  It’s the “millennials” we hear, their faces buried in their phones, their 50-millisecond attention span, their inability to reason in depth, and so forth.  Well maybe it’s not as bad as all that.  But I was going through some old posts I wrote on another blog, more than 10 years ago, and I found this gem.  It seems even more relevant now than it was then.  Enjoy.

Tuesday, October 18, 2005
Education vs Instant Gratification

I think I have a new take on the “schools were better in the past” or “students were better in the past” argument. It’s this: We live in a culture of instant gratification, and education doesn’t fit.

Consider being a student a couple of hundred years ago. Suppose you were hungry. What would you do? Well, you would probably have to think of it in advance, and get prepared. Maybe you’d have to butcher a chicken, which you would have had to raise up from a chick. Maybe you’d have to go hunting for something, then butcher it, then cook it, and so, in a couple of hours, you’d have something to eat. If you wanted some vegetables, you’d better have planned ahead months in advance–planted a garden, tended it, put up and preserved the goodies. Then, when you’re hungry, you could take it out and prepare it (which might involve building a fire, etc). It would take lots of effort and advance planning. Of course, as a student, you might not have done all that yourself. But, chances are, you’d have been part of the process, helping your parents do exactly those things. So you would get the idea that if you wanted to eat, you’d better be prepared to put some work into it.

Today, you run to McDonalds or throw a frozen dinner in the microwave, and in a few minutes, you can eat. It’s pretty easy and doesn’t take much planning or work.

Suppose you were a student a couple of hundred years ago, and you were cold. What would you do? Throw another log on the fire–but first, you’d have to chop the wood, stack and dry it. Or maybe you use coal–dig it out of the ground and haul it home. Or maybe you gather buffalo chips in the fall. Or, you’d put on more clothes. But where do you get them? Long ago, you would have gathered straw, spun thread, and wove the cloth, and finally sewed the garment. More recently, you’d still have to buy the cloth and make your clothes. It was a long process that involved planning and work, to make sure you’d have something warm to put on. You probably participated with your parents in all these activities. You’d get the idea that if you wanted to be warm, you’d better be prepared to put some work into it.

Today, you turn up the thermostat or run to Walmart and buy a sweater. In a few minutes, you’re warm. It’s pretty easy and doesn’t take much planning or work.

Suppose you are a student a couple of hundred years ago, and you went to school. You’d know that everything important in life requires hard work and advance preparation. You’d take if for granted that nothing important comes easy. You’d automatically be prepared to work hard at school, just like everything else.

Today, every other experience of your life tells you that the things you want can be quickly and easily obtained. There is practically no chance that you would ever have to worry about not having your basic needs fulfilled, even if you do absolutely nothing. You see advertising that tells you how all the hardest jobs can be done without breaking a sweat, leaving you plenty of time to play and enjoy yourself. Unfortunately, there haven’t been any major advances in education in the last 200 years. Learning proceeds pretty much just as it always has, with lots of hard work and advance planning. But you have no analogue for this. Nothing in your life has given you a context for it. So, you scoff at your teacher’s admonition that you put hard work and effort into your learning. Life just doesn’t work that way, in your experience. Certainly, there must be a way that you can flip a switch, or run to the store, or pop something into an appliance, so that your educational needs are quickly fulfilled, and you can get back to playing and entertaining yourself.

Is there really any possible way that today’s students could be as good as yesteryear’s?

The Independence Fallacy

First, some context:

This article applies to SSOE (Statistical Sampling and Overpayment Extrapolation) as practiced in the Medicare program.  The principle is more generally applicable, but this is the specific use case.

The “Independence Fallacy” arises when a sampling plan is challenged (generally during an overpayment appeal) on the grounds that sample units are not “independent,” and thus the estimation methods employed are claimed to be invalid.

Second, what is the importance of independent sample units?

Most estimation procedures rely on certain theorems in statistics that require normal distributions and “independent, identically distributed” (iid) sample units.  In particular, the challenge will usually relate to the use of the “Central Limit Theorem” (but there are actually several “central limit theorems) which says that under certain conditions, the estimate will follow a normal distribution (approximately).  Independence is supposed to be one of the conditions required.

It is worthwhile to note that this condition of independence contrasts with what happens in something called a “Markov Chain,” in which each sample unit is a statistical function of the preceding one.  More precisely, the probability distribution of each sample point depends on what has been observed before.  Suppose you measure the speed of a bus at certain points between two stops.  Since the bus certainly goes through a pattern of acceleration and deceleration (though affected by traffic), the probability distribution of the speed at the next measurement point depends on the speed at the last measurement point.  In a scenario with independence, the observed value (or distribution) of any sample point has no relationship with the previous sample unit.  All sample units have the same distribution.

The claim that is made in overpayment cases is that sample points (which are individual overpayments) are not independent because some come from the same patient, or the same doctor, or the same day, etc.  In fact, such observations may be correlated within the population.  This correlation can occur if for some reason, all of one patient’s overpayments, on average, are greater than another patient’s overpayments, due to some special feature of the patient.  In this case the probability distribution of the two patients’ overpayments, prior to payment or even billing,  is different.  However, correlation does not equal dependence!

And now, things are going to get really technical.  We need to understand what the random variable actually is in this scenario.  The overpayments in the universe are not random variables.  In fact, they are fixed, though unknown, values.*  This is the reason for the phrase “prior to payment or even billing” in the previous paragraph.  Once payment is made, the overpayments become fixed, not random.  They do not have probability distributions.  (There is a distribution of values, but that is not the same thing.) The probability distribution only comes into play through sampling.

“So,” you may be asking, “how do we get a probability distribution out of fixed values?”  Why, the same way that we get it from tossing a die.  The six numbers on a die are fixed.  What has probability is not the actual number itself (or face of the die) but the outcome of a random experiment–tossing the die.  And in the case of tossing a fair die, each number or face has a 1/6th  probability of being “chosen.”  If you toss the die six times, there is no dependence from one toss to the next.  Now suppose you toss six separate fair dice, one at a time.  Again, there is no dependence between tosses. The second example does not have greater independence because it involves six different dice.

And so, the overpayments in the population are like the faces of the die.  They are fixed values with no probability distribution.  If all were reviewed, the result would be an exact number with no probability or statistics involved.  However, we do not review them all.  Instead, we create a random process (toss the die) in which several sample units are selected in sequence.  The values of the sample units are not known in advance, because the identity of the units (in the population) is not determined until the sampling procedure is carried out.  The order of the selected values is also not pre-determined.  This is why the sample units have an identical distribution, even though there can be widely varying values in the population.  Any of the population elements can end up in any of the sample unit positions.  This means the probability distribution of each sample unit is the same as the fixed relative distribution (unknown) of the population, and it is the same for every sample unit.  Therefore, no sample unit’s distribution depends on the previous sample unit, and they are independent.  Any sharing of characteristics relating to the origin of the sample unit is completely irrelevant.

In conclusion, the idea that sample units in overpayment extrapolations might not be independent is a fallacy, with no possible basis in statistical theory.

*An objection may be raised here, that different reviewers might determine different overpayments.  This is a separate issue, involving “measurement error” and “bias.”  These issues are addressed in the appeals process by re-reviewing claims in question, and do not affect the statistical theory.  Regardless of what different people may decide, there is a “truth” about each overpayment which the review process is intended to uncover.

Medicare Overpayment Extrapolation Consulting

I’ve been reading various websites that purport to give advice about defending against overpayment extrapolations done by various CMS contractors or the OIG. It doesn’t seem like many of them have any actual insider knowledge about how extrapolations are done or what might be helpful during the audit or during the appeals process.

In fifteen years of working with Medicare, I have never had the impression that any CMS contractor is “out to get” providers. On the contrary, I have seen repeated efforts to “bend over backwards” to help providers get into compliance with the Medicare program, sometimes working for years with extremely recalcitrant providers trying to get them to bill properly and stop abusing the Medicare payment system. Thus my first piece of Honest Advice is this: When Medicare tells you that you are doing something wrong, or may be doing something wrong and should internally investigate, LISTEN! This is your first opportunity to head off any possible extrapolation! If you have questions, or are unsure what the issue might be, work with the contractor (or your MAC) to understand the issue. Get medical review experts to advise you on proper billing and documentation.

I know documentation requirements are onerous. This is the government we’re talking about. I know medical professionals are chafing under the documentation load already. I know documentation requirements are raising the cost of medical care and not necessarily improving the quality. These are my opinions, which I think are widely shared in the industry. HOWEVER, you cannot escape the fact that Medicare payment is contingent on proper documentation. If you don’t have the documentation that is required, it is essentially illegal for you to accept payment, and it is an error for the MAC to give you payment (even when no documentation was requested). Time and time again I have seen the argument made that the government did not prove the service was not rendered, nor that it was not necessary. But this is irrelevant–if you did not prove the service was rendered, medically necessary, and covered by Medicare in your documentation, you are not entitled to payment from Medicare. It’s really that simple. In addition, if you knowingly accepted payment in the absence of proper documentation, you are getting into the realm of actual fraud, regardless of whether the actual services were proper and payable in every other way.

Many providers attempt to appeal an extrapolation by challenging the statistics. There are several (questionable) experts out there who make it their business to try to get extrapolations overturned with a variety of spurious arguments. This used to work sometimes, because ALJs where not properly educated about the statistical aspects of these appeals, and, particularly if the contractor did not attend the hearing, they were duped into believing these claims. Things have gotten much tighter now. There are a number of Departmental Appeals Board decisions and Circuit Court decisions that back up and provide legal precedent for extrapolation methods. It is true that mistakes occasionally occur in the extrapolation process, and providers should not neglect to examine the methodology closely. However, they should not rely on this, and should be prepared to accept the decision of a consultant (one with actual Medicare extrapolation experience) that the extrapolation was conducted properly.