Neat and tidy data exploration with list-columns and purrr
Peter Kamerman
19 June 2019I often find myself having to generate a multitude of exploratory plots and tables when I first lay my hands on a new dataset. An example would be having to explore the data of every indiviudual or animal, looking for any ‘funny numbers’, before proceeding with the anaylsis. This post describes one of the processes I sometimes go through to explore data using the purrr package to work with list-columns within nested dataframes.
Get some data
To start we need some data. I will be using carrots data from the carrots data package (devtools::install_github("kamermanpr/carrots")
). The carrots package contains data on, you guessed it, carrots. The dataset, which is a pared-down version of a larger carrots dataset from the lmerTest package, is briefly described in the figure and printouts below.
# Install package if not already installed
if('carrots' %in% installed.packages()[, "Package"] == FALSE){
::install_github("kamermanpr/carrots")
devtools
}
# Load the package
library(carrots)
# Load the data
data(carrots)
# Quick peek at the data
## Dimensions
dim(carrots)
## [1] 618 11
## Structure
str(carrots)
## tibble [618 × 11] (S3: tbl_df/tbl/data.frame)
## $ Product : Factor w/ 6 levels "Bolero","Major",..: 1 2 3 4 5 6 1 2 3 4 ...
## $ Consumer : Factor w/ 103 levels "168","169","171",..: 1 1 1 1 1 1 2 2 2 2 ...
## $ Frequency : Ord.factor w/ 5 levels "Once a week or more"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 2 2 1 1 1 1 ...
## $ Age_group : Ord.factor w/ 4 levels "< 25 years"<"26-40 years"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Employment_status: Ord.factor w/ 7 levels "Unskilled labourer"<..: 7 7 7 7 7 7 7 7 7 7 ...
## $ Income_group : Ord.factor w/ 4 levels "Low-income"<"Lower middle-income"<..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Preference : int [1:618] 4 4 5 5 5 4 6 5 5 5 ...
## $ Sweetness : int [1:618] 2 2 4 3 4 2 5 2 5 2 ...
## $ Bitterness : int [1:618] 4 2 2 2 1 3 1 2 2 2 ...
## $ Crispness : int [1:618] 2 3 4 5 5 3 5 3 2 3 ...
## Summary
::skim(carrots) skimr
Name | carrots |
Number of rows | 618 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
factor | 7 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Product | 0 | 1.00 | FALSE | 6 | Bol: 103, Maj: 103, Nav: 103, Nel: 103 |
Consumer | 0 | 1.00 | FALSE | 103 | 168: 6, 169: 6, 171: 6, 172: 6 |
Frequency | 0 | 1.00 | TRUE | 5 | Onc: 432, Onc: 102, Onc: 54, Les: 18 |
Sex | 0 | 1.00 | FALSE | 2 | Fem: 426, Mal: 192 |
Age_group | 0 | 1.00 | TRUE | 4 | 41-: 300, > 6: 156, 26-: 96, < 2: 66 |
Employment_status | 0 | 1.00 | TRUE | 7 | Ret: 162, Off: 150, Ski: 90, Uns: 66 |
Income_group | 36 | 0.94 | TRUE | 4 | Upp: 240, Low: 222, Low: 84, Upp: 36 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Preference | 2 | 1 | 4.58 | 1.16 | 1 | 4 | 4 | 5 | 7 | ▁▂▇▆▅ |
Sweetness | 2 | 1 | 3.03 | 1.30 | 1 | 2 | 3 | 4 | 7 | ▇▅▃▂▁ |
Bitterness | 0 | 1 | 2.28 | 1.20 | 1 | 1 | 2 | 3 | 7 | ▇▂▁▁▁ |
Crispness | 3 | 1 | 3.15 | 1.33 | 1 | 2 | 3 | 4 | 7 | ▇▆▃▃▁ |
For this demonstration I only need some of the data, so I am going to select the columns that I want (Product, Age_group, Preference, Crispness, Sweetness, Bitterness), and filter the carrot types (Product) so that I only have Bolero and Yukon carrots.
# Load required package
library(dplyr)
# Select columns and filter data
<- carrots %>%
clean_carrots select(Product, Age_group, Sex,
Preference, Crispness, %>%
Sweetness, Bitterness) filter(Product %in% c('Bolero', 'Yukon'))
Crispness, sweetness, and bitterness are all attributes of the carrots and to my mind they should fall under a single attribute column. I can fix this problem using the pivot_longer
function from the tidyr package.
# Load required packages
library(tidyr)
# Convert clean_carrots to long format
<- clean_carrots %>%
long_carrots pivot_longer(cols = c(Crispness, Sweetness, Bitterness),
names_to = 'Attributes',
values_to = 'Ratings')
Take a look at the new dataset.
print(long_carrots)
## # A tibble: 618 x 6
## Product Age_group Sex Preference Attributes Ratings
## <fct> <ord> <fct> <int> <chr> <int>
## 1 Bolero > 61 years Female 4 Crispness 2
## 2 Bolero > 61 years Female 4 Sweetness 2
## 3 Bolero > 61 years Female 4 Bitterness 4
## 4 Yukon > 61 years Female 4 Crispness 3
## 5 Yukon > 61 years Female 4 Sweetness 2
## 6 Yukon > 61 years Female 4 Bitterness 3
## 7 Bolero > 61 years Male 6 Crispness 5
## 8 Bolero > 61 years Male 6 Sweetness 5
## 9 Bolero > 61 years Male 6 Bitterness 1
## 10 Yukon > 61 years Male 6 Crispness 3
## # … with 608 more rows
Working with list-columns
I think list-columns are great for rapidly analysing data across one or more grouping variables, while keeping intermediate and final outputs from an analysis nicely contained within a single dataframe.
For those of you unfamiliar with list-columns, they are columns in a dataframe that contain more than one piece of data per row within a column. Instead, each row of the column contain a list of two or more pieces of data. The figure below will hopefully give you some insight into what I mean by this description. Briefly, in the figure we have a standard dataframe with columns for brand of tomato sauce, country, and cost in each country. If we use brand of tomato sauce as a grouping variable, we can cram all the related data into one cell per row (note: it doesn’t compress the data), replacing the previous two columns with a new list-column. That is, a column consisting a series of lists.
The ultimate goal of this exploratory analysis is to look at the relationship between consumer carrot preference and carrot attributes (crispness, sweetness, and bitterness) for each carrot type according to age group. So, the first thing I need to do is to create a list-column of all the pieces of data with carrot type as a grouping variable.
Creating list-columns is simple with the help of the group_by
function from the dpylr package, and the nest
function from the tidyr package.
# Create a dataframe with a list-column called 'data'
<- long_carrots %>%
diced_carrots # First group the data by carrot type
group_by(Product) %>%
# Then 'nest' the remaining data into a list-column
nest()
We now have a new dataframe with two columns, ‘Product’ and ‘data’. ‘Product’ contains the grouping variable (carrot type), while ‘data’ contain the rest of the data as a list column.
If we compare the original dataframe with the nested dataframe they look like this:
Original dataframe
print(long_carrots)
## # A tibble: 618 x 6
## Product Age_group Sex Preference Attributes Ratings
## <fct> <ord> <fct> <int> <chr> <int>
## 1 Bolero > 61 years Female 4 Crispness 2
## 2 Bolero > 61 years Female 4 Sweetness 2
## 3 Bolero > 61 years Female 4 Bitterness 4
## 4 Yukon > 61 years Female 4 Crispness 3
## 5 Yukon > 61 years Female 4 Sweetness 2
## 6 Yukon > 61 years Female 4 Bitterness 3
## 7 Bolero > 61 years Male 6 Crispness 5
## 8 Bolero > 61 years Male 6 Sweetness 5
## 9 Bolero > 61 years Male 6 Bitterness 1
## 10 Yukon > 61 years Male 6 Crispness 3
## # … with 608 more rows
Nested dataframe
print(diced_carrots)
## # A tibble: 2 x 2
## # Groups: Product [2]
## Product data
## <fct> <list>
## 1 Bolero <tibble [309 × 5]>
## 2 Yukon <tibble [309 × 5]>
If we look inside the two rows of the list-column, we find the data associated with each of the carrot types.
print(diced_carrots$data)
## [[1]]
## # A tibble: 309 x 5
## Age_group Sex Preference Attributes Ratings
## <ord> <fct> <int> <chr> <int>
## 1 > 61 years Female 4 Crispness 2
## 2 > 61 years Female 4 Sweetness 2
## 3 > 61 years Female 4 Bitterness 4
## 4 > 61 years Male 6 Crispness 5
## 5 > 61 years Male 6 Sweetness 5
## 6 > 61 years Male 6 Bitterness 1
## 7 26-40 years Male 4 Crispness 6
## 8 26-40 years Male 4 Sweetness 3
## 9 26-40 years Male 4 Bitterness 3
## 10 41-60 years Male 5 Crispness 4
## # … with 299 more rows
##
## [[2]]
## # A tibble: 309 x 5
## Age_group Sex Preference Attributes Ratings
## <ord> <fct> <int> <chr> <int>
## 1 > 61 years Female 4 Crispness 3
## 2 > 61 years Female 4 Sweetness 2
## 3 > 61 years Female 4 Bitterness 3
## 4 > 61 years Male 6 Crispness 3
## 5 > 61 years Male 6 Sweetness 5
## 6 > 61 years Male 6 Bitterness 1
## 7 26-40 years Male 5 Crispness 5
## 8 26-40 years Male 5 Sweetness 4
## 9 26-40 years Male 5 Bitterness 3
## 10 41-60 years Male 4 Crispness 3
## # … with 299 more rows
Working with list-columns
Because the ‘data’ column is a list of lists, you can iterate functions over the elements of the column using the base R lapply
function. However, I find the best way to work with dataframes that contain list-columns is to use the various map
functions "(map
, map2
, pmap
being the most basic varieties) from the purrr package to iterate over the list-column rows. When combining one of the various map
functions with the mutate
function from the dplyr package, you can take the output produced by applying a function over list-column(s) and place it into a new column in the same dataframe (see the figure below). This keeps all you analysis inputs and outputs together in one place.
First I will demonstrate a basic example of generating plots using map
.
The map
function takes one list (in this case a list-column from the dataframe) as an input and iterates a function over the data. As I’ve already said, when used within a dplyr mutate
function, the output from the map
function is placed in a new column in the dataframe (as shown in the figure below).
# Load required packages
library(purrr)
library(ggplot2)
# Make a basic plot
<- diced_carrots %>%
diced_carrots # Add a new column containing ggplot objects
mutate(basic_plot = map(.x = data, # Tell map to use the data found within the 'data' column
~ ggplot(data = .x) + # Create a user-defined plotting function using the tilde
aes(x = Ratings,
y = Preference) +
geom_point(position = position_jitter(width = 0.2,
height = 0.2)) +
facet_wrap(~Attributes, nrow = 3) +
theme_bw(base_size = 14)))
If you look at the resulting dataframe, it now has three columns, the grouping column (Product), the list-column (data), and a new list-column (basic_plot). This new column contains ggplot plotting objects, one for each carrot type.
# Print the updated dataframe
print(diced_carrots)
## # A tibble: 2 x 3
## # Groups: Product [2]
## Product data basic_plot
## <fct> <list> <list>
## 1 Bolero <tibble [309 × 5]> <gg>
## 2 Yukon <tibble [309 × 5]> <gg>
Having the ggplot objects in the dataframe is a bit useless unless theere is a way to print the plots. There are three functions that you can use to extract the plots from the list-column, the lapply
function from base R, the map
function from purrr, or the walk
function from purrr. My personal choice is the walk
function, which works exactly the same way map
does, but it returns the input (.x) invisibly, thus making a cleaner output for RMarkdown documents.
You will notice in the code below that for all three functions, the input is the column you want to apply the print
function to, namely, diced_carrots$basic_plot.
#-- Using lapply --#
# lapply(X = diced_carrots$basic_plot, FUN = function(x){print(x)})
#-- Using map --#
# map(.x = diced_carrots$basic_plot, ~ print(.x))
#-- Using walk --#
walk(.x = diced_carrots$basic_plot, ~ print(.x))
The problem with these figures is that you have no way of identifying which figure belongs to the the Bolero carrot, and which to the Yukon carrot. This is where we can use the map2
function from purrr. The map2
function works the same way as map
, but can take two data inputs. And, like the map
function, if we wrap the function with a mutate
function, we can produce a new column in the dataframe (see the figure below).
Our goal from with map2
is to recapitulate the plots we had before, but with the carrot type as a plot title. We can do that by using the data from the ‘data’ list-column to draw the plot (as we did before), but then also using the data in the ‘Product’ column to label the plots.
# Make a labeled plot
<- diced_carrots %>%
diced_carrots # Add a new column containing ggplot objects
mutate(labeled_plot = map2(.x = data, # Tell map2 where to find the plotting data
.y = Product, # Tell map2 where to find the label data
~ ggplot(data = .x) + # Tell ggplot to use .x to draw the plots
aes(x = Ratings,
y = Preference) +
geom_point(position = position_jitter(width = 0.2,
height = 0.2)) +
labs(title = .y) + # Tell ggplot to use .y to add labels
facet_wrap(~Attributes, nrow = 3) +
theme_bw(base_size = 14)))
# Look at the resulting dataframe
print(diced_carrots)
## # A tibble: 2 x 4
## # Groups: Product [2]
## Product data basic_plot labeled_plot
## <fct> <list> <list> <list>
## 1 Bolero <tibble [309 × 5]> <gg> <gg>
## 2 Yukon <tibble [309 × 5]> <gg> <gg>
The new column ‘labeled_plot’ is a list-column of ggplot objects, which we can then print using the walk
function.
# Print the new figures
walk(.x = diced_carrots$labeled_plot, ~print(.x))
We now have labeled figures, which makes identification easy.
To use more than two sources of data, you use pmap
(see figure below).
There is a slight difference between map/map2
, andpmap
in terms of identifying the sources of data. Instead of using .x and .y to identify the data sources, you list the data sources using .l. For example:
# A list
pmap(.l = list(data_1, data_2, data_3, ...), ...)
# Or a named list
pmap(.l = list(foo = data_1, bar = data_2, baz = data_3, ...), ...)
When calling the data within your function, you can either use the positions in the list using ..1, ..2, ..3, etc for the first, second, third, etc items in the list. Or, if you have a named list, you can use call the data using the names. In the example above, that would be, foo, bar, baz, etc.
I will use pmap
to add a line to my plots showing what the median preference rating is, but I want to do this procedure for a slightly more complex plot that the ones I have used before, which were only faceted by carrot taste attribute. Now I want to faceting the figure by taste attribute and age group to see whether there is a change in the relationship between carrot taste attributes and carrot preference as one gets older.
I would usually use the stat_summary
function in ggplot to add a summary measure to a plot, but it is my experience in ggplot that geom_hline
does not play nicely within stat_summary
, and so the first step in adding the median rating to each plot facet is to create a new dataframe with median preference value for each carrot type according to age group and taste attribute. This can be done using the map
function, along with the group_by
and summarise
functions from dplyr.
# Calculate the median preference rating for each carrot type according
# to and age group and taste attribute
<- diced_carrots %>%
diced_carrots mutate(median_pref = map(.x = data, # Tell map where to find the data to summarise
~ .x %>%
group_by(Age_group, Attributes) %>%
summarise(med_ian = round(median(Preference,
na.rm = TRUE)))))
# Print the updated dataframe
print(diced_carrots)
## # A tibble: 2 x 5
## # Groups: Product [2]
## Product data basic_plot labeled_plot median_pref
## <fct> <list> <list> <list> <list>
## 1 Bolero <tibble [309 × 5]> <gg> <gg> <grouped_df [12 × 3]>
## 2 Yukon <tibble [309 × 5]> <gg> <gg> <grouped_df [12 × 3]>
The new column is a list-column of dataframes (tibbles), and we can look at the contents of the new column using walk
or walk2
. walk2
behaves the same way as walk
but takes two inputs (like its sibling map2
), and allows one to produce pretty, labeled tables using the kable
function from the knitr package (if you are knitting an Rmarkdown document, remember to put results = 'asis'
in the chunk options when printing kable
tables).
# Print the dataframes
walk2(.x = diced_carrots$median_pref, # Tell walk2 where to find the dataframes to print as tables
.y = diced_carrots$Product, # Tell walk2 where to find the labels for table captions
~ print(knitr::kable(x = .x,
caption = .y)))
Age_group | Attributes | med_ian |
---|---|---|
< 25 years | Bitterness | 5 |
< 25 years | Crispness | 5 |
< 25 years | Sweetness | 5 |
26-40 years | Bitterness | 4 |
26-40 years | Crispness | 4 |
26-40 years | Sweetness | 4 |
41-60 years | Bitterness | 5 |
41-60 years | Crispness | 5 |
41-60 years | Sweetness | 5 |
> 61 years | Bitterness | 5 |
> 61 years | Crispness | 5 |
> 61 years | Sweetness | 5 |
Age_group | Attributes | med_ian |
---|---|---|
< 25 years | Bitterness | 4 |
< 25 years | Crispness | 4 |
< 25 years | Sweetness | 4 |
26-40 years | Bitterness | 4 |
26-40 years | Crispness | 4 |
26-40 years | Sweetness | 4 |
41-60 years | Bitterness | 4 |
41-60 years | Crispness | 4 |
41-60 years | Sweetness | 4 |
> 61 years | Bitterness | 5 |
> 61 years | Crispness | 5 |
> 61 years | Sweetness | 5 |
Now that I have the summarised data, I can create the new plots using pmap
to pull in data from three (or more if required) sources. One source for the basic plots, one source for adding the median preference data, and one source to label the plots.
<- diced_carrots %>%
diced_carrots # Add a new column containing ggplot objects
mutate(labeled_plot2 = pmap(.l = list(data, median_pref, Product), # Tell pmap where to find the data
~ ggplot(data = ..1) + # Tell ggplot to use ..1 to draw the basic plots
aes(x = Ratings,
y = Preference) +
geom_point(position = position_jitter(width = 0.2,
height = 0.2)) +
geom_hline(data = ..2, # tell ggplot to use ..2 to draw the median values
aes(yintercept = med_ian),
colour = 'red') +
labs(title = ..3) + # Tell ggplot to use ..3 to add labels
facet_grid(Attributes ~ Age_group) +
theme_bw(base_size = 14)))
# Look at the resulting dataframe
print(diced_carrots)
## # A tibble: 2 x 6
## # Groups: Product [2]
## Product data basic_plot labeled_plot median_pref labeled_plot2
## <fct> <list> <list> <list> <list> <list>
## 1 Bolero <tibble [309 … <gg> <gg> <grouped_df [12 … <gg>
## 2 Yukon <tibble [309 … <gg> <gg> <grouped_df [12 … <gg>
# Print the figures
walk(.x = diced_carrots$labeled_plot2, ~print(.x))
And there you have it, a neat and tidy way of generating exploratory plots and tables. I’ve used this process on some fairly large datasets and it seems to scale nicely.
For more cool R content, visit: R Weekly
Session information
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## Random number generation:
## RNG: Mersenne-Twister
## Normal: Inversion
## Sample: Rounding
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] carrots_0.1.0 geofacet_0.2.0 survey_4.0
## [4] survival_3.2-9 Matrix_1.3-2 sf_0.9-7
## [7] fiftystater_1.0.1 skimr_2.1.3 magrittr_2.0.1
## [10] geojsonio_0.9.4 lubridate_1.7.10 xml2_1.3.2
## [13] sp_1.4-5 leaflet_2.0.4.1 highcharter_0.8.2
## [16] flexdashboard_0.5.2 gganimate_1.0.7 gdtools_0.2.3
## [19] ggiraph_0.7.8 svglite_2.0.0 pander_0.6.3
## [22] knitr_1.31 forcats_0.5.1 stringr_1.4.0
## [25] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0
## [28] tidyr_1.1.3 tibble_3.1.0 ggplot2_3.3.3
## [31] tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-0 ellipsis_0.3.1 class_7.3-18
## [4] rgdal_1.5-23 base64enc_0.1-3 fs_1.5.0
## [7] rstudioapi_0.13 httpcode_0.3.0 farver_2.1.0
## [10] ggrepel_0.9.1 fansi_0.4.2 splines_4.0.4
## [13] rlist_0.4.6.1 jsonlite_1.7.2 broom_0.7.5
## [16] dbplyr_2.1.0 png_0.1-7 rgeos_0.5-5
## [19] compiler_4.0.4 httr_1.4.2 backports_1.2.1
## [22] assertthat_0.2.1 lazyeval_0.2.2 cli_2.3.1
## [25] tweenr_1.0.1 htmltools_0.5.1.1 prettyunits_1.1.1
## [28] tools_4.0.4 igraph_1.2.6 gtable_0.3.0
## [31] glue_1.4.2 geojson_0.3.4 V8_3.4.0
## [34] Rcpp_1.0.6 imguR_1.0.3 cellranger_1.1.0
## [37] jquerylib_0.1.3 vctrs_0.3.6 crul_1.1.0
## [40] crosstalk_1.1.1 xfun_0.22 rvest_1.0.0
## [43] lifecycle_1.0.0 jqr_1.2.0 zoo_1.8-9
## [46] scales_1.1.1 hms_1.0.0 yaml_2.2.1
## [49] quantmod_0.4.18 curl_4.3 gridExtra_2.3
## [52] sass_0.3.1 stringi_1.5.3 highr_0.8
## [55] maptools_1.0-2 e1071_1.7-4 TTR_0.24.2
## [58] boot_1.3-27 repr_1.1.3 rlang_0.4.10
## [61] pkgconfig_2.0.3 systemfonts_1.0.1 geogrid_0.1.1
## [64] evaluate_0.14 lattice_0.20-41 htmlwidgets_1.5.3
## [67] labeling_0.4.2 tidyselect_1.1.0 geojsonsf_2.0.1
## [70] R6_2.5.0 magick_2.7.0 generics_0.1.0
## [73] DBI_1.1.1 pillar_1.5.1 haven_2.3.1
## [76] foreign_0.8-81 withr_2.4.1 units_0.7-0
## [79] xts_0.12.1 modelr_0.1.8 crayon_1.4.1
## [82] uuid_0.1-4 KernSmooth_2.23-18 utf8_1.2.1
## [85] rmarkdown_2.7 jpeg_0.1-8.1 progress_1.2.2
## [88] rnaturalearth_0.1.0 readxl_1.3.1 data.table_1.14.0
## [91] reprex_1.0.0 digest_0.6.27 classInt_0.4-3
## [94] webshot_0.5.2 munsell_0.5.0 viridisLite_0.3.0
## [97] kableExtra_1.3.4 bslib_0.2.4 mitools_2.4