This workshop covers the basics of generating and visualizing statistical distributions in R using the tidyverse
package. We will create three different distributions (normal, uniform, and exponential) and visualize them using kernel density plots.
Quick Overview
Loading Package: Let’s load the tidyverse package, a collection of R packages for data science, including data manipulation and visualization tools.
# Check if tidyverse is installed and install it if not
if (!requireNamespace("tidyverse", quietly = TRUE)) {
install.packages("tidyverse")
}
# Load the tidyverse package
library(tidyverse)
Warning message:
“package ‘dplyr’ was built under R version 4.3.2”
Warning message:
“package ‘stringr’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr [39m 1.1.4 [32m✔[39m [34mreadr [39m 2.1.4
[32m✔[39m [34mforcats [39m 1.0.0 [32m✔[39m [34mstringr [39m 1.5.1
[32m✔[39m [34mggplot2 [39m 3.4.4 [32m✔[39m [34mtibble [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3 [32m✔[39m [34mtidyr [39m 1.3.0
[32m✔[39m [34mpurrr [39m 1.0.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
Setting a Seed: Ensures reproducibility of results. The set.seed(123) function sets the random number generator to a fixed sequence.
# Set seed for reproducibility
set.seed(123)
Defining Population Size: Sets the number of data points for each distribution (population_size = 100,000).
# Define the population size
population_size <- 100000
Generating Populations: Creates three different distributions: normal, uniform, and exponential, each with 100,000 data points.
# Generate populations
populations <- tibble(
normal = rnorm(population_size, mean = 50, sd = 10),
uniform = runif(population_size, min = 0, max = 100),
exponential = rexp(population_size, rate = 0.02)
)
Now let’s look at the first 5 rows of the tibble.
head(populations)
normal | uniform | exponential |
---|---|---|
<dbl> | <dbl> | <dbl> |
44.39524 | 60.44927 | 29.292887 |
47.69823 | 51.97372 | 74.372971 |
65.58708 | 96.64309 | 1.807305 |
50.70508 | 80.39748 | 10.901159 |
51.29288 | 47.63254 | 1.336612 |
67.15065 | 89.03851 | 47.032041 |
Reshaping Data: Transforms the data into a long format, with a new column for distribution types and their values, suitable for ggplot.
# Reshape for plotting
populations_long <- populations %>%
pivot_longer(cols = everything(), names_to = "distribution", values_to = "value")
head(populations_long)
distribution | value |
---|---|
<chr> | <dbl> |
normal | 44.39524 |
uniform | 60.44927 |
exponential | 29.29289 |
normal | 47.69823 |
uniform | 51.97372 |
exponential | 74.37297 |
Plotting Distributions: Uses ggplot to create kernel density plots, with different facets for each distribution.
# Plot the populations using kernel density plots
ggplot(populations_long, aes(x = value, fill = distribution)) +
geom_density() +
facet_wrap(~ distribution, scales = "free_x") +
labs(title = "Population Distributions", x = "Value", y = "Density") +
theme_minimal()
Defining a Custom Function: sample_means is a user-defined function that takes two arguments: a population (a vector of data) and a sample_size (an integer).
# Function to take samples and compute means
sample_means <- function(population, sample_size) {
replicate(1000, mean(sample(population, sample_size)))
}
Setting up sample sizes : This following creates a vector sample_sizes containing different sizes. These sizes represent the number of data points in each sample that will be drawn from the populations.
# Sample sizes for more intervals
sample_sizes <- c(5, 10, 20, 30, 40, 50, 100)
Replication for Robustness: Within the function, replicate(1000, …) is used to repeat a process 1000 times, enhancing the statistical robustness of the results. Sampling and Mean Calculation: In each repetition, the function randomly samples sample_size elements from population and calculates their mean. The result is an array of 1000 sample means.
# Generate sample means data
sample_means_data <- map_df(sample_sizes, function(size) {
map_df(populations, sample_means, sample_size = size)
}) %>%
pivot_longer(cols = everything(), names_to = "distribution", values_to = "value") %>%
mutate(size = factor(rep(sample_sizes, each = 3000), levels = sample_sizes))
map_df
is used to apply the sample_means
function across all values in sample_sizes
.sample_means
is further applied to each distribution in the populations
dataframe.pivot_longer
function converts the data into a long format, which is more suitable for visualization.mutate
function is used to add a new column named size
. This column categorizes each mean by its corresponding sample size.sample_means_data
.Plotting the Data:
# Plotting the distributions of sample means using kernel density plots
# Faceted by both distribution type and sample size
ggplot(sample_means_data, aes(x = value, fill = distribution)) +
geom_density(alpha = 0.6) +
facet_grid(distribution ~ size) +
labs(title = "Distribution of Sample Means by Distribution Type and Sample Size",
x = "Sample Mean",
y = "Density") +
theme_minimal() +
guides(fill = guide_legend(title = "Distribution Type"))
Tidyverse Package: Wickham, H. et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
The R Project: R Core Team. The R Project for Statistical Computing. https://www.r-project.org/
R Manuals: R Core Team. An Introduction to R. https://cran.r-project.org/manuals.html
Understanding the Central Limit Theorem: Rice, J. A. (2007). Mathematical Statistics and Data Analysis (3rd ed.). Duxbury Press.