Home Sample Data Generation
Post
Cancel

Sample Data Generation

Creating Sample Datasets

This guide provides instructions on how to create sample datasets in R and Python. You can use these methods to generate a mini version of your original dataset for data consultations, enabling efficient and effective analysis on a manageable subset of your data. We assume you know how to read in your data, however, if you need step by step instructions on this, these are available further down the page for both R and Python.

R

Prerequisites

You will need dplyr installed. You can double check that you have it installed:

1
find.package("dplyr") # returns an error if the package is not installed, else returns the path to the package
1
## [1] "/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/dplyr"

And install it if necessary:

1
install.packages("dplyr") # REQUIRED

We want a minimum of 10 samples per variable or a maximum of 40% of your data if there is concern that 10 samples per variable will be insufficient for demonstration purposes. We want to store the output as an R object.

10 samples per variable

  • Replace your_data_frame in the second line with the name you assigned to your data on import.
  • Replace "path/to/your/file.RData" in the last line with the path and file name to save your sampled data to.
1
2
3
4
5
library(dplyr)
df_to_sample <- your_data_frame
n_samples <- ncol(df_to_sample) * 10 # calculate the number of variables in your data frame and multiply by 10
sampled_data <- df_to_sample %>% slice_sample(n_samples) # take the sample
save(sampled_data, file = "path/to/your/file.RData") # choose a location to save your RData file with the .RData extension

Bring the resulting .RData file with you to your consultation.

40% of your observations

  • Replace your_data_frame in the second line with the name you assigned to your data on import.
  • Replace "path/to/your/file.RData" in the last line with the path and file name to save your sampled data to.
1
2
3
4
5
library(dplyr)
df_to_sample <- your_data_frame
n_samples <- round(nrow(df_to_sample) * 0.4) # calculate the number of observations in your data frame and multiply by 0.4
sampled_data <- df_to_sample %>% slice_sample(n_samples) # take the sample
save(sampled_data, file = "path/to/your/file.RData") # choose a location to save your RData file with the .RData extension

Bring the resulting .RData file with you to your consultation.

Python

We want a minimum of 10 samples per variable or a maximum of 40% of your data if there is concern that 10 samples per variable will be insufficient for demonstration purposes. We want to store the output as a csv file.

10 samples per variable

  • Replace your_data_frame in the second line with the name you assigned to your data on import.
  • Replace "path/to/your/file.RData" in the last line with the path and file name to save your sampled data to.
1
2
3
4
5
import pandas as pd
df_to_sample = your_data_frame
n_samples = len(df_to_sample.columns) * 10 # calculate the number of variables in your data frame and multiply by 10
sampled_data = df_to_sample.sample(n = n_samples) # take the sample
sampled_data.to_csv("path/to/your/file.csv") # choose a location to save your csv file with a .csv extension

Bring the resulting .csv filw with you to your consultation.

40% of your observations

  • Replace your_data_frame in the second line with the name you assigned to your data on import.
  • Replace "path/to/your/file.RData" in the last line with the path and file name to save your sampled data to.
1
2
3
4
5
import pandas as pd
df_to_sample = your_data_frame
n_samples = round(df_to_sample.shape[0] * 0.4) # calculate the number of observations in your data frame and multiply by 0.4
sampled_data = df_to_sample.sample(n = n_samples) # take the sample
sampled_data.to_csv("path/to/your/file.csv") # choose a location to save your csv file with a .csv extension

Importing data

Importing Data into R

Prerequisites

Make sure you have the readr package for CSV, readxl package for Excel, or jsonlite package for JSON installed. If not, you can install them using:

1
2
3
install.packages("readr") # if reading in csv or other delimited rectangular data
install.packages("readxl") # if reading in Excel files
install.packages("jsonlite") # if reading in JSON files
  • Import CSV file:
1
2
library(readr)
df_to_sample <- read_csv('path/to/your/file.csv')
  • Import Excel file:
1
2
library(readxl)
df_to_sample <- read_excel('path/to/your/file.xlsx')
  • Import JSON file
1
2
library(jsonlite)
df_to_sample <- fromJSON('path/to/your/file.json')

Importing Data in Python

  • Import CSV file:
1
2
import pandas as pd
df_to_sample = pd.read_csv('path/to/your/file.csv')
  • Import Excel file:
1
2
import pandas as pd
df_to_sample = pd.read_excel('path/to/your/file.xlsx')
  • Import JSON file:
1
2
import pandas as pd
df_to_sample = pd.read_json('path/to/your/file.json')
This post is licensed under CC BY 4.0 by the author.