### Binary compression of continuous data

“Which am I supposed to use: mean or median?”

Early on in my stats classes, we talk about how to describe the distribution of a piece of numerical information — things like height, weight, age, income, and so on that can vary continuously. A good description should cover three aspects of the distribution:

• A measure of center — this is a single number that somehow is supposed to be in the middle of the range of values. Measures of center include means (averages), medians (halfway points), and modes (most common values).
• A measure of spread — this is a number that represents how widely the values range from your measure of center: a distribution with a small spread will be tightly clustered around a single value, while a distribution with a large spread will have a wide range of values. If you’re measuring the center with the mean, you’d often use standard deviation to measure the spread, while if you’re measuring the center with the median, you might use something like interquartile range instead.
• A description of the shape of the distribution: is it symmetrical? bell-shaped? skewed to the right or left? Does it have one peak, multiple peaks, or no clear peak?

My students often want to know which is the best measure of center — mean, median, mode, or something else? I tell them, different measures of center have different strengths and weaknesses, but there’s no way to perfectly capture all of the rich information in the whole dataset with just one number. That’s why we look for multifaceted descriptions that also describe the spread and shape!

But the question got me thinking: asking for a measure of center is a little like asking for the best way of collapsing the whole distribution into one single representative value. That definitely loses a lot of information, but what if we try to lose just a little less — say, by collapsing the distribution into two representative values? The result would be a “binary” distribution: one that only takes on two distinct values with some proportion each.

For example, the people in my daughter’s daycare building have an age distribution that looks something like this:

There are a lot of young kids under 10 years old, and then a few adult staff members of various ages. I’d be hard pressed to use a single number to capture the “center” of this distribution — the mean might be around 15 (not a typical age for either kids or adults), the median would be about 5 (the ages of the oldest kids), and the mode would be 3 or so (completely ignoring the adult staff). But it’s not so hard to compress the distribution into just two ages:

It’s not such a stretch to say something like “Of the people at the daycare, 75% are around 3 years old, and the rest are around 35 years old.”

This makes especially good sense for the daycare, where there are already two groups and we can just look the averages within each group. But it turns out there’s a precise way to approximate any* continuous distribution with a binary version! Here’s how:

To specify a binary distribution, you need three pieces of information: the two values it takes, and what fraction take each value. (The two fractions have to add up to 100%, so they count as just one piece of information.) We can adjust those three pieces of information so that the binary distribution matches the original in terms of its center, spread, and shape — more specifically, there’s a unique binary distribution that has…

• the same mean (average value)
• the same standard deviation (square root of average squared distance from the mean)
• and the same skewness** (average cubed number of standard deviations above the mean)

as the original distribution. “Skewness” here is a measure of shape: a positive skewness means that the distribution has more unusually large values than unusually small ones, and vice versa for negative skewness:

Here’s how to calculate the matching binary approximation, given the continuous distribution’s mean $\mu$, standard deviation $\sigma$, and skewness $\gamma$.

• First, we’ll calculate what proportions belong to each of the two groups. In terms of the skewness $\gamma$, the proportion belonging to the first group is

$p = \dfrac12 + \dfrac{\gamma}{2\sqrt{4+\gamma^2}}$

and the proportion belonging to the second group is

$q = \dfrac12 - \dfrac{\gamma}{2\sqrt{4+\gamma^2}}$

These add up to $1$, like you’d hope.

• Next, we can use $p$ and $q$ to calculate what value each group has, in terms of the desired mean and standard deviation $\mu$ and $\sigma$. The first group has value $\mu - \sigma \sqrt{q/p}$, and the second group has value $\mu + \sigma \sqrt{p/q}$.

This isn’t hard to implement in a spreadsheet, so I tried it out on two real-life datasets. Here’s a rough graph of the distribution of how long people live in the US:

The average life expectancy is about $78$ years old, but that’s a kind of compromise between the most common lifetime in the high $80$s and the long tail of people who die much earlier. (Notice also the tragically high rate of death in the first year of life; I’m glad my daughter has made it past that now.) If we use the binary approximation, then we get two groups: $82\%$ who survive to about $86$ years old, and $18\%$ who die around $43$ years old:

I find this less misleading to think about than a single average, but more manageable to process than the entire distribution!

Here’s another example: US household incomes. I had a tougher time getting good data about what fraction of US households have each size income, but here’s a rough mockup from an income percentile calculator:

It’s a little jagged because of rounding error, but the peak around $30K, and the long tail of a few people making more than ten times that, both show up clearly. When we apply the binary approximation, we get a group of $88\%$ making around$56K, and a second group of around $12\%$ making around $347K: I find it much easier to grasp a society with most people making$50K a year and a small group making $350K than to hold the entire income distribution in my head. You still lose some information this way — you miss the substantial fraction of households living below the poverty line, as well as the few super-rich who would be far off the right-hand side of the plot — but you don’t lose nearly as much as you do when you just say the average US household income is around$96K.

I have a few lingering questions, maybe you can help me with:

• What are some more examples of real-life distributions and their binary approximations? Does it ever go wrong and give results that don’t seem particularly meaningful?
• Is there a nice class of distributions for which it always works out that the larger group falls close to the mode, and the smaller group is partway along the tail, like in these two examples?
• In the daycare example, there were two real groups of people that underlie the binary approximation. If you just start with continuous data, is there always some way of getting this binary approximation by sorting the individuals into two groups and then taking the average of each group?
• Does this method of binary approximation appear anywhere else? I’ve looked, but as far as I can tell it’s original to me.

Thanks for reading! If you have thoughts or questions, I’d love to hear them in the comments.

*For a continuous distributions, it is theoretically possible for it to fail to have a mean, standard deviation, or skewness if the defining integrals don’t all converge. For example, the Cauchy distribution is symmetric with a median at $0$, but it doesn’t have a mean. This is almost never an issue for real-life data sets.

**There are several definitions of skewness; this is Pearson’s moment coefficient of skewness, also called the third standardized moment.