Distributions and Summary Statistics

Note
There are no specific requirements for this tutorial. You will benefit from an understanding of the programming language R but you can also pick up the basics right here!

Overview Link to heading

Our understanding of reality in the modern world is largely mediated through the analysis and interpretation of data. COVID-19, climate change, artificial intelligence, mass claims — many complex phenomena cannot be properly understood or managed without assistance from massive data sets. At this point the purely literary methods of traditional doctrinal analysis reach their limit.

Beyond a certain size, data sets are not intelligible to humans anymore.1 This threshold is rather low — even a few dozen data points are sufficient. This is why data sets are approached with statistical methods, in order to summarize and reduce them to a form that is understandable and usable by humans.

At first glance Summary statistics appear to offer a persuasive and objective mathematical clarity. The arithmetic mean (also ‘average’ or just ‘mean’) is one of the most popular summary statistics. The mean is easy to understand, easy to apply and widely used. Regrettably, the mindless application of the mean often leads to biased results, ignorance of diversity and an exclusive focus on unrealistic ideals (recall the ‘average citizen’ rhetoric in political discourse).

In this tutorial we will discuss several methods of summarizing data. First with synthetic data, then with a ‘real’ legal dataset:

  • Summary statistics (mean, median, quartiles)2
  • Distributions (histogram, density diagram, box plot)

Distributions are better suited to summarizing data fairly because less information is lost compared to summary statistics. Additionally, in data sets with a large number of data points there is often not just the one characteristic summary statistic, but multiple characteristic ranges and tendencies.

Variability and Uncertainty Link to heading

Distributions model two concepts much better than summary statistics: variability and uncertainty.

Variability is everywhere in the world. Even if the average male in the United States is 1.75 m, the actual range and distribution of height in the country is very diverse. Many males be close to 1.75 m, but some may be taller or shorter, and fewer than that may be much taller or much shorter. The difference in mean height between men and women is also relevant and determines the position of the entire distribution.

Uncertainty is present when our knowledge of reality is limited, irrespective of its variability. For example, a bank robber may measure exactly 1.90 m in height, but witnesses may only be able to describe him as ‘between 1.80 m and 2.00 m’ because of the stressful situation and bias in human perception. Maybe they would simply describe him as ‘very tall’.

Variability is a property of reality (ontological), uncertainty is a property of knowledge (epistemic). Both can occur at the same time. The calculation and visualization of distributions allows us to confront variability and uncertainty in a rational manner.

Note
Qualitative judgments can be rational, too. However, this tutorial primarily deals with quantitative methods. We will also focus on numeric data. I will create a separate tutorial for categorical data at some point.

Preparation Link to heading

We will use the programming language R to generate, analyze and visualize data. All of the code is presented so that you can enter it line-by-line into your R console and execute it yourself.

Give it a try! Do not just read the tutorial, but try to execute the code as well! You will be able to verify the results, better understand the details and experiment for yourself a little. Programming is only enjoyable when you do it yourself. Reading by itself is never enough.

Tip

I recommend using WebR for this tutorial.

WebR is a web application that runs directly in your browser and executes R code locally in a sandbox on your machine, not in the cloud. You do not have to install anything.

Of course you can also execute the code from the tutorial in a local integrated development environment (IDE) such as RStudio or in the cloud, for example with Posit Cloud.

Freeze Random Numbers Link to heading

1set.seed(999)
Info

Random Number Generator

The function set.seed(999) freezes the initialization value for the random number generator. If you use this initialization you should generate the same random numbers that I am seeing. You can also ignore this line. Your results and diagrams will look slightly different, but this will not affect your learning experience.

Generate Data Link to heading

The lines of code below generate 1,000 data points each for three different classic stochastic distributions. These data points could have come from an empirical study or an opinion poll. I chose the number 1,000 because many representative opinion polls aim for a sample size of approximately 1,000 respondents and this will help you develop a feeling for statistical practice.

Here we practice with three idealized distributions to simplify and exemplify methodology:

  • Simplification, because we can avoid all the code for downloading and ingesting data
  • Exemplification, because each of these distributions possesses certain characteristic properties that exemplify typical problems with analysis and interpretation

We will analyze these distributions soon, I promise!

Normal Distribution Link to heading

1normal <- rnorm(1000, mean = 100, sd = 15)

Log-Normal Distribution Link to heading

1lognormal <- rlnorm(1000, meanlog = 4.47, sdlog = 0.5)

Beta Distribution Link to heading

1beta <- rbeta(1000, 0.2, 0.2) * 190

Show Values of Distributions Link to heading

Let us attempt to display the values for each distribution. Displaying raw values is a good idea for small data sets (and sometimes for larger ones as well) because you may be able to immediately recognize certain patterns and develop hypotheses or ideas for further exploration of the data. If there are serious data errors you may be able to recognize them at this stage as well.

The bracket suffix [1:50] limits the display to the first 50 data points. I chose this limitation to avoid making you scroll through an ocean of data to reach the next section. With larger data sets, displaying all data points will flood the R console and the interface may freeze or crash. Fortunately, this is improbable with only 1,000 data points. Unless you have a very, very old computer.

Show Normal Distribution Values Link to heading

1print(normal[1:50])
1##  [1]  95.77390  80.31161 111.92776 104.05106  95.84040  91.50964  71.82013
2##  [8]  80.99813  85.48375  83.18486 119.88196 102.00966 114.08124 102.58807
3## [15] 114.36476  79.55971 101.02503 101.50986 113.52017  68.88464  81.57155
4## [22] 109.64566  94.60356 104.41053  83.12097 109.63398  83.39894  86.72739
5## [29]  76.68857  98.09982 135.73996 109.01914 102.69042 116.20797  96.29782
6## [36]  68.29395  94.44209 107.84302 107.76708  78.96234  92.71545 100.12747
7## [43]  80.76830  83.32632 104.50998 104.14718  69.23684 100.21285 108.73400
8## [50]  99.47910
Tip

Try displaying the data without the limit! Instead of print(normal[1:50]) simply enter print(normal) or set different limits from 1:50!

Only once you have flooded your console by accident you will realize how futile the manual inspection of some data sets can be. At least your system won’t crash from showing these 1,000 data points.

Show Log-Normal Distribution Values Link to heading

1print(lognormal[1:50])
1##  [1]  85.12648 257.71340  48.41812 140.10341 136.86963  34.38457 151.97229
2##  [8] 169.33980  84.81856  50.08362 138.04280 169.14846 186.62926 127.13527
3## [15]  44.42033  51.78645 107.27701 221.26954  66.90303  42.03478  96.06316
4## [22] 107.17145  51.65440  41.74021  65.47462 211.78821  41.83182  97.50702
5## [29]  60.25510 107.93089  67.23228 202.07011 160.05965  74.40073 111.34302
6## [36] 131.56600  70.06619  72.66396 115.78188  82.68581  75.36887 118.73016
7## [43]  65.71656  99.97671  60.43055 176.53575  61.78021  72.46802  98.20388
8## [50]  49.97018

Show Beta Distribution Values Link to heading

1print(beta[1:50])
 1##  [1] 1.820029e+02 1.180641e+01 1.899884e+02 1.363032e+02 1.367371e+01
 2##  [6] 1.092780e+02 2.324482e+01 1.790829e+02 4.682861e+01 1.899997e+02
 3## [11] 1.029877e+02 2.068828e+00 1.758332e+02 1.136942e+01 1.895999e+02
 4## [16] 1.461639e+01 1.801766e-02 4.144685e-05 1.899997e+02 1.847211e+02
 5## [21] 4.433401e+01 1.027253e+02 7.646949e+01 1.664601e+02 1.843718e+01
 6## [26] 1.885473e+02 1.644399e+02 2.383747e-01 2.190459e-05 2.669806e+01
 7## [31] 1.788748e+02 7.946888e+00 1.198917e+01 1.900000e+02 1.840867e+02
 8## [36] 1.718406e+02 1.323951e+02 9.033565e+01 1.833539e+02 1.900000e+02
 9## [41] 1.899991e+02 1.898603e+02 8.795029e+01 1.154381e+01 7.248238e-02
10## [46] 1.022053e+01 6.038260e+01 9.061349e-03 1.872870e+01 1.076143e+02

Result Link to heading

A real mess. No insights, right? As a matter of fact, there are patterns in the data, we just have no hope of recognizing them in this jumbled confusion of raw data.

Histograms Link to heading

First order of business for any new data set: visualize it. Histograms are one of the most important visual tools for summarizing quantitative data.

Histograms divide the full range of a quantitative variable into evenly sized bins and count the number of data points in these bins. Bins could also be described as ‘intervals’ or ‘subdivisions’.

Histograms are a great way to get a first impression of the distribution of a quantitative variable.

Histogram: Normal Distribution Link to heading

1hist(normal, breaks = 20)
Tip
The parameter breaks = 20 sets the number of bins to be displayed. Try parameter values that are different from 20 and observe how the histogram changes!

We immediately see the typical signs of a normal distribution:

  • Symmetry around the mean
  • A great many data points in the center
  • Progressively fewer data points as we move farther away from the center

A normal distribution has only one peak: we call it unimodal.

Due to its bell shape the normal distribution is also known as the ‘bell curve’ or the ‘Gauss distribution’ (named after Carl Friedrich Gauss). It is often (but not always!) observed in nature. For example, human height is approximately normally distributed.

The Intelligence Quotient (IQ) is modeled as a normal distribution with a mean of 100 and a standard deviation of 15. In the case of IQ the normal distribution is due to convention and continuous standardization.3 If you look closely at the parameters above you will notice that I chose this normal distribution with mean 100 and standard deviation 15 as our study example.

Info

Standard Deviation

The standard deviation is the mean distance of all data points from the overall mean. In other words: each data point possesses a certain distance to the mean. The standard deviation is the mean of these distances. We will not spend much time on this, but the standard deviation is an important statistic and is affected by similar problems as the regular mean.

Histogram: Log-Normal Distribution Link to heading

1hist(lognormal, breaks = 20)

The log-normal distribution with our chosen parameters is visibly different from the normal distribution. Specifically: it is skewed. When the center is on the left side as it is here, we call it ‘right-skewed’. Odd, but this is the way it is.

Note
The skew of a log-normal distribution depends on its parameterization. I selected very specific parameters to construct this specific type of skew.

Histogram: Beta Distribution Link to heading

1hist(beta, breaks = 20)

The shape of the beta distribution in our example is wholly different from the normal distribution or the log-normal distribution. In particular: there are two peaks, we therefore call it bimodal.

Beta distributions can take rather exotic forms. Take a look at the different parameterizations on Wikipedia.

Recap: Histograms Link to heading

Three different distributions, three markedly different shapes.

The mean is a reasonably good description of the central tendeny of a normal distribution, a not-so-good description of this particular log-normal distribution and a terrible fit for this particular beta distribution.

Many failures to interpret the mean appropriately are based on the implicit assumption that it represents the center of a normal distribution and is reasonably representative of the data (the famed ‘average citizen’). Even where the underlying distribution is at least approximately normal in truth, the over-reliance on the mean may result in significant day-to-day problems for non-average people (consider tall people and the ever-shrinking size of seating in economy class on airplanes).

However, there often is no underlying normal distribution. Perhaps the data follows a beta distribution, or a log-normal distribution, or a gamma distribution or something even more exotic. Unless you know for certain, be cautious of assuming a normal distribution when interpreting the mean.

Info

Families of Distributions

Not all normal, log-normal and beta distributions appear as I have shown them here. In fact, they are families of distributions and individual distributions must be specified by choosing parameters such as the mean or the standard deviation. For example, a different log-normal distribution might be left-skewed.

Density Diagrams Link to heading

Density diagrams are refinements of histograms. You could call them fluid histograms, where bins continuously flow into each other. I usually prefer density diagrams over histograms in my own work because then I don’t have to worry about the number of bins and results are quickly interpretable.

Density Diagram: Normal Distribution Link to heading

1plot(density(normal))
Tip
You can ignore the y-axis when interpreting a density diagram. It is chosen such that an integral over its area will yield a value of 1. This means that the area contains 100% of the data.

Density Diagram: Log-Normal Distribution Link to heading

1plot(density(lognormal))

Density Diagram: Beta Distribution Link to heading

1plot(density(beta))

Recap: Density Diagrams Link to heading

As with the histograms, we see a symmetric distribution with a single peak for the normal distribution, then an asymmetric right-skewed log-normal distribution and finally a symmetric but bimodal beta distribution.

Summary Statistics Link to heading

You can quickly calculate six key statistics for a dataset with the very comfortable function summary(). It provides the following output:

Statistic Description
Minimum The smallest value in the data set.
1st Quartile 1/4 of values are smaller than this statistic, 3/4 are larger.
Median Half of all data points are smaller than the median, half are larger.
Mean The sum of all data points, divided by the number of data points.
3rd Quartile 3/4 of values are smaller than this statistic, 1/4 are larger.
Maximum The largest value in the data set.
Info

Five Number Summary

summary() shows six statistics. Sometimes data sets are described with a ‘five number summary’ instead, which can be calculated with fivenum() in R. We rely on summary() in this tutorial to include the mean. Note that the median is calculated differently in fivenum()

Summary Statistics: Normal Distribution Link to heading

1summary(normal)
1##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
2##   53.90   89.42   99.86   99.51  109.55  145.92

The mean and median for a normal distribution are almost identical. Either represents a reasonable description of its center. This is the ideal that people tend to have in mind when interpreting the mean or median.

Summary Statistics: Log-Normal Distribution Link to heading

1summary(lognormal)
1##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
2##   20.93   63.22   88.01  101.34  128.54  433.77

Mean and median for the log-normal distribution differ notably. A review of the histogram and density diagram shows that the median is a better fit for the center of the distribution. The mean is biased upwards because of outliers to the right.

Summary Statistics: Beta Distribution Link to heading

1summary(beta)
1##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
2##   0.000   6.764  98.912  96.506 184.584 190.000

For this particular beta distribution neither the mean nor the median is an adequate measure to describe the key characteristics of the distribution. A bimodal distribution with two peaks at either end cannot be properly summarized with a single measure.

Recap: Summary Statistics Link to heading

We have created visuals and calculated summary statistics for three different distributions. Based on a comparison of diagrams and summary measures we now understand how the quality of measures of central tendency is dependent on the shape of the distribution.

Box Plots Link to heading

Box plots (or ‘box-and-whiskers plots’) are compact diagrams based on many of the summary statistics we just calculated with summary(). Since they are not entirely intuitive, the table below lists their key features:

Statistic Display in Box Plot
1st Quartile Left border of the box
Median Bold line in the middle of the box
3rd Quartile Right border of the box
$ 1,5 \times IQR $ The ‘whiskers’
Outliers Individual data points beyond the whiskers
Info
IQR is the inter-quartile range, meaning the range between first and third quartile. This range is usually multiplied by 1.5 to define the boundaries of the whiskers. Note that this multiplier can usually be set to an arbitrary value in software.

Box Plot: Normal Distribution Link to heading

1boxplot(normal, horizontal = TRUE, boxwex = 0.3)

For the normal distributions a box plot offers good results. The symmetric form and the median as a description of the center create an appropriate visual.

Note
boxwex = 0.3 simply defines the size of the displayed boxes. It has no further substantive meaning.

Box Plot: Log-Normal Distribution Link to heading

1boxplot(lognormal, horizontal = TRUE, boxwex = 0.3)

The box plot is also a good visual for the log-normal distribution. This is primarily due to the center of the box being defined by the median, not by the mean.

Box Plot: Beta Distribution Link to heading

1boxplot(beta, horizontal = TRUE, boxwex = 0.3)

The beta distribution is again the distribution that breaks the plot. This is not surprising, as the individual summary statistics produced by summary() aren’t very useful either. A bimodal distribution simply cannot be captured well by a diagram based on these statistics.

Recap: Box Plots Link to heading

Box plots produce good results for the normal and log-normal distributions. The beta distribution diagram is mostly useless. Unsurprising, but an important lesson: always try different types of diagrams.

Practice: US Judge Ratings Link to heading

About the Data Set Link to heading

The data set USJudgeRatings comes pre-installed with R and contains various ratings for 43 US-American Superior Court judges, as assigned by attorneys (Hartigan 1977).

With ?USJudgeRatings you can learn more about the data set.

The exact source of the data is difficult to trace, so you should not over-interpret the results. That being said, the data set is quite useful for practice purposes and can give you a taste of what ‘real’ data is like.

Description of Variables Link to heading

  • CONT: Number of contacts of lawyer with judge
  • INTG: Judicial integrity
  • DMNR: Demeanor
  • DILG: Diligence
  • CFMG: Case flow managing
  • DECI: Prompt decisions
  • PREP: Preparation for trial
  • FAMI: Familiarity with law
  • ORAL: Sound oral rulings
  • WRIT: Sound written rulings
  • PHYS: Physical ability
  • RTEN: Worthy of retention

Practice Link to heading

  1. Display the whole data set with print(USJudgeRatings)!
  2. Display all variables with names(USJudgeRatings)!
  3. Apply summary() to individual variables of the dataset! Example: summary(USJudgeRatings$CONT).
  4. Visualize individual variables with hist(), plot(density()) and boxplot()!
  5. You can apply summary() and boxplot() to entire data sets to summarize them quickly. Try boxplot(USJudgeRatings, horizontal = TRUE, las = 1)!

Replication Details Link to heading

1sessionInfo()
 1## R version 4.0.4 (2021-02-15)
 2## Platform: x86_64-pc-linux-gnu (64-bit)
 3## Running under: Debian GNU/Linux 11 (bullseye)
 4## 
 5## Matrix products: default
 6## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
 7## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so
 8## 
 9## locale:
10##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
11##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
12##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
13##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
14##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
15## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
16## 
17## attached base packages:
18## [1] stats     graphics  grDevices utils     datasets  methods   base     
19## 
20## other attached packages:
21## [1] knitr_1.39
22## 
23## loaded via a namespace (and not attached):
24##  [1] bookdown_0.27   digest_0.6.29   R6_2.5.1        lifecycle_1.0.3
25##  [5] jsonlite_1.8.0  magrittr_2.0.3  evaluate_0.15   highr_0.9      
26##  [9] blogdown_1.10   stringi_1.7.8   cachem_1.0.6    rlang_1.1.1    
27## [13] cli_3.6.1       jquerylib_0.1.4 bslib_0.4.0     vctrs_0.6.2    
28## [17] rmarkdown_2.14  tools_4.0.4     stringr_1.5.0   glue_1.6.2     
29## [21] xfun_0.31       yaml_2.3.5      fastmap_1.1.0   compiler_4.0.4 
30## [25] htmltools_0.5.3 sass_0.4.2

  1. Statistics and computers come into play as soon as the scale of data cannot be processed by humans unaided. Whenever the scale of data cannot be easily processed by standard computers we call it ‘big data’. ↩︎

  2. I have excluded the mode from this tutorial to focus on the more practically important measures of central tendency. ↩︎

  3. The Flynn Effect regularly moves the center of the distribution, which is why tests need to be recalibrated every so often. I believe this is affects the position of the mean and not the shape of the distribution. ↩︎