A distribution can be thought of as the collection of all the chances of finding a particular configuration of objects.
In the absolute simplest distribution, say a collection of red balls, the chance of drawing a red ball will always be one. If you add a blue ball to four red balls, similarly, the distribution is a breakdown by category:
Note that in this particular case the order for each category is arbitrary, and doesn't necessarily have to be numeric in nature. Indeed, thinking about distributions as being related to categories often makes much more sense then thinking about them as being related to numbers, even when the categories are numbers.
So what exactly does such a distribution mean? It means that if I repeat the experiment a hundred times, around 80% of the time I'll pick a red ball, and give or take 20% of the time I'll pick a blue ball. Note that this tells me nothing about the ball I'll pick at any given time, and it's possible, with a small enough sample, that I could actually get a blue ball five times in a row. In practice one way of thinking about a distribution is that, if I drew a ball from the sample set one million times, the likelihood is high that I'll have 800,032 red balls and 199,968 blue balls tallied - i.e., around 80% of the tallies will be in the red ball column, and 20% of the tallies will be in the blue ball column.
When a meteorologist says that there's a 45% chance of rain, what in general they mean is that they (or rather the weather service) created enough simulations that for every hundred they created, approximately 45 showed rain, while approximately 55 did not. It doesn't mean that only 45% of the ground will get wet during any given day, only that if you run the experiment given the known parameters, 45% of the time you'll need your umbrella.
One additional note: suppose that it's mid-October (northern hemisphere) and you want to know if you'll have rain or snow? There are actually two different categories here - will precipitation occur, and will the temperature be above or below 32 degrees Fahrenheit (0 Celcius). These two conditions occur (to a first order approximation) independently of one another.
If you have a 45% chance of precipitation, and the temperature is above zero about 60% of the time, then the chance for rain is (chance of precipitation) * (chance of above zero temperatures => 45% * 60% => 27%, while the change for snow is (chance of precipitation) * (100% - chance of above zero temperatures) => 45% * (100 - 60)% => 45% * 40% = 18%. Why (100% - chance of above zero temperatures)? Because you're asking for the probability that something isn't going to occur, such that if the two are added together, you get a total probability of 100%. You can confirm this by realizing that 27% + 18% = 45%, which is the probability that you have some kind of precipitation.
Seeking out these (mostly) independent variables is a key part of what data scientists do, but they may be dealing with hundreds or even thousands of such variables. When they are all independent, then probabilities in general, are usually pretty easy to determine. Things get a little more complex when you start dealing with dependent variables. For instance, if I roll three six-sided dice (not getting into D&D games here) where each dice is a different color, then the likelihood of getting any given combination (such as 2, 5, 3) is the same - one in 6 * 6 * 6, or 6 to the third power (216).
On the other hand, let's say I asked you to roll three dice and sum them up, then asked you to tell me what the likelihood of getting that sum (10 here) is. This gets a little trickier. It turns out that you will have six permutations of (2,5,3) such as (3,5,2), (2,3,5), (5,3,2), and so forth, but you could also have (1,4,5) and its permutations, (2,4,4), (2,2,6), (3,3,4), (6,3,1) and all of their permutations. It turns out that this comes out to 24 difference configurations that describe the same result (or are in the same category) - they all add up to 10. On the other hand, you have only 1 configuration for a category of three dice adding up to 3 (1,1,1), three configurations for adding up to 4 (1,1,2), (1,2,1), (2,1,1) and so forth. After ten, the likelihood drops again symmetrically, ending up with only one combination resulting in a value of 18 (6,6,6).
This is an example of a one of the earliest named distributions, the Binomial Distribution, discovered (or at least investigated mathematically) by early Dutch mathematician Jacobi Bernouli. The distribution is called binomial because if you take the algebraic expression of
(a + b) * (a + b) * .... 14 times total
(i.e., to the 14th power), the resulting polynomial uses these same numbers. The fourteen comes from the count of numbers (or categories) from 3 to 18.
What's even more intriguing about this distribution, which looks vaguely bell-like, is that if you use four dice, you get much the same shape of distribution, just for the numbers between 4 and 24. As you ramp up the number of dice, the shape just becomes even more refined, if scaled up, to the extent that with a large enough number of dice, the distribution will merge to become continuous. This also holds if the dice have eight pips or twenty pips instead of six. This limiting case is known as the Gaussian Distribution (after Carl Fredrick Gauss, who worked out much of its mathematical underpinnings and implications), though it is also often referred to as the Normal Distribution.
The Gaussian Distribution is frequently used to describe populations in which you have multiple independent variables that each contribute to an overall aggregate, such as a person's height. There are genetic factors that may indicate whether a given person is short or tall, and there are also environmental factors. Collectively, the sum of these factors will make you neither very short or very tall, but rather somewhere in the middle.
The likelihood that a person is of a given height (assuming nothing else is known about that person), will then be a Gaussian distribution, where the area under the curve of that distribution to a given height divided by the total area will the probability that a person is of that height. The peak of the curve then represents the mean (or for the most part, average) value, where half the population is found. Thus, if the mean is at 5'11" for men, then half of the curve will be shorter than this and half will be taller.
Statisticians also introduce the concept of variance and standard deviation. The variance determines the spread of the distribution, with the standard deviation then being the square root of this value. In the normal distribution, one standard deviation encompasses about 36% of the total area from the mean. Put another way, if the mean is considered to be 5'11", one standard deviation above the mean might be about 6'5", which implies that 86% of all men are 6'5 or shorter. Anyone 6'11" would be two standard deviations (93% of the population), and someone 7'5" would be three standard deviations (99.6% of the population. Put another way, only four tenths of one percent of the population will be over 7'5". Similarly, anyone who's 5'5" would be one standard deviation in the other direction, with 4'11, and 4'5 rounding out the distributions.
Normal distributions occur all over the place. They occur frequently in population dynamics where aggregate characteristics create variability in a population. This same characteristic makes them very useful for getting an idea about how accurate a given survey may have been in representing the total population. To understand this, think about political positions. In general, for a large population, determining the average intent of that population is easy in principle and very difficult in practice. This is the case, for example, in determining who among candidates for a political most closely represents the intent of the people. This is typically done by holding an election, with the one receiving the most votes winning.
However, elections are expensive, complex, and usually are scheduled at infrequent intervals. Because those same candidates (and their supporters) want to better understand the electorate (to appeal to the broadest numbers during the election) they or others conduct polls or surveys involving far fewer people. This group, called a sample, can be polled, with the likelihood that any given member being representative of the overall population also following a Gaussian curve. This is typically called a margin of error. This doesn't mean that the survey was itself wrong. Rather it indicates the uncertainty that the given sample is truly representative, which is typically a function of sample size, but occasionally is due to bias in choosing the samples.
Again going back to the weather example, most experiments (runs of given models) will tend to cluster around the mean, but if there's a lot of uncertainty in the data, the variability between the extreme cases in those experiments can mean that the confidence in the results may be low. A pollster would say that the results are within the margin of error, with that margin usually also following a Gaussian distribution. It's likely that the broader election will then fall somewhere within the margin of error (or, put another way, within say two standard deviations either way).
The 2016 US presidential election actually gives a good indication of what margin of error means in the real world. The two candidates, Hillary Clinton and Donald Trump, showed up in most polls as being roughly 2% points apart. Given the partisan nature of the race, a 2% lead is actually pretty good, but the margin of error was surprisingly large - about 3.5%. What that meant in practice was that there was about a 30% chance that the poll numbers were uncertain enough that Trump could have eked out a win. Now, 30% chance is not great odds, but it means that if you were to simulate the race 100 times, Trump would have won in thirty of those. It turned out that the actual outcome was one of those races.
In the aftermath of this, there was a fair amount of argument about whether the polls themselves weren't reliable. As it turned out, though, many of the better polls all acknowledged the same thing - the margin of error was high in that race, given that it was fairly tight. Moreover, the polls were accurate about the fact that Clinton was the popular vote winner (by about two percent), but because the races in three battleground states were so close, and due to the complications introduced by the nature of the Electoral College, she lost in terms of total delegates.
While the Normal Distribution is symmetric, there are a class of distributions that look similar but may be skewed in one fashion or another. These usually result in cases where the median value (the point where 50% of the curve is to one side or another) does not correspond to the mean of the curve (the variance is asymmetric). Intelligence test scores occasionally follow this kind of distribution. It's actually pretty rare to find people who would score below about 70 on a Stanford Binet type IQ test, but it's not at all unusual to have people score significantly above 130.
There are a number of other types of distributions, including logistics and power distributions, but I'm going to save these for separate articles, as they tend to have interesting characteristics that are worth exploring in depth.
Kurt is the founder and CEO of Semantical, LLC, a consulting company focusing on enterprise data hubs, metadata management, semantics, and NoSQL systems. He has developed large scale information and data governance strategies for Fortune 500 companies in the health care/insurance sector, media and entertainment, publishing, financial services and logistics arenas, as well as for government agencies in the defense and insurance sector (including the Affordable Care Act). Kurt holds a Bachelor of Science in Physics from the University of Illinois at Urbana–Champaign.