This week I’m not doing a data analysis project so much as a data cleaning project. One of the most common problems I come across in data cleaning is how to get summary statistics for various groups in the data. And it’s also one of the most annoying problems, because I invariably forget how to do it, and end up having to go back to old code to copy and paste.

## Data.table, base R, dplyr

At various times I have used the data.table package, base R “aggregate”, or the dplyr package and let me voice my favor for dplyr at the beginning. It’s very intuitive and works just as well as the other methods. Like the Dutch cleaning product brand HG, dplyr “doet wat het belooft” (Does what it promises). In this post I’m going to outline and evaluate each of the methods.

## The data

For this project I’m going to use the npk data from the R Datasets package. It’s a dataframe of 24 observations and 5 variables measuring the results of an experiment of how nitrogen (N), phosphate (P), and potassium (K) affect the crop yield of peas. This dataset is useful for today’s project because data is grouped into block (1-6), or we could aggregate by whether crops were exposed to nitrogen, phosphate and/or potassium. The yield of peas (outcome variable) is in units of pounds per plot, so more is better.

npk <- as.data.frame(npk)
str(npk)
## 'data.frame':    24 obs. of  5 variables:
##  $block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ... ##$ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
##  $P : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ... ##$ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
##  $yield: num 49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ... I will now proceed to take the average yield per block and add that to the dataset as a column. ### “mean” First things first, if all we want to know if the mean yield for all pea crops, we can do this very easily with “mean” Mean <- mean(npk$yield, na.rm=T)
Mean
## [1] 54.875

But usually, this is not very useful because what you actually want to know is the crop yield per block.

### “data.table”

The very first way I learned to do this is somewhat cumbersome. It involved using the data.table package to create a data table of aggregated statistics, and then merging it (or not) with the original data.

# load the package
library(data.table)
##
## This data.table install has not detected OpenMP support. It will work but slower in single threaded mode.
# use data.table to transform the data frame to a data table. Careful, this command with rename your variables to V1, V2, ... Vn
npk_datatable <- data.table(npk$block, npk$yield)
# rename variables
npk_datatable$block <- npk_datatable$V1
npk_datatable$yield <- npk_datatable$V2
npk_datatable <- npk_datatable[,c("block","yield")]

# create an aggregated data set of the mean of "yield" per "block"
mean_datatable <- npk_datatable[, mean(yield, na.rm=T), by=block]
mean_datatable
##    block     V1
## 1:     1 54.025
## 2:     2 57.450
## 3:     3 60.775
## 4:     4 50.125
## 5:     5 50.525
## 6:     6 56.350

The variable is renamed from yield to V1. Aggravating, but there’s nothing to do but change the name:

mean_datatable$mean_yield <- mean_datatable$V1
(mean_datatable <- mean_datatable[, c("block","mean_yield")])
##    block mean_yield
## 1:     1     54.025
## 2:     2     57.450
## 3:     3     60.775
## 4:     4     50.125
## 5:     5     50.525
## 6:     6     56.350

### “aggregate”: Base R

You can do the exact same thing in base R using the aggregate command:

npk_aggregate <- aggregate(npk[, 5], list(npk$block), mean, na.rm=T) npk_aggregate ## Group.1 x ## 1 1 54.025 ## 2 2 57.450 ## 3 3 60.775 ## 4 4 50.125 ## 5 5 50.525 ## 6 6 56.350 Arg! I can feel my blood pressure rising every time R changes the names of my variables. Block is now Group.1 and mean_yield is x. Change names again: npk_aggregate$mean_yield <- npk_aggregate$x npk_aggregate$block <- npk_aggregate\$Group.1
(npk_aggregate <- npk_aggregate[, c("block","mean_yield")])
##   block mean_yield
## 1     1     54.025
## 2     2     57.450
## 3     3     60.775
## 4     4     50.125
## 5     5     50.525
## 6     6     56.350

These methods both work, but trying to remember them is about as Aggravating as trying to teach my grandma how to double click. Which is to say: very. Enter: dplyr and tidyverse to save the day!

### “dplyr”

I started out writing this post not entirely convinced that dplyr is actually better than aggregate. But in the process of writing, I’m convinced. The syntax is just so readable and seems to make intuitive sense. Reproducing it for this post was super easy.

library(dplyr)
npk_dplyr <- npk %>%
group_by(block) %>%
summarize(mean(yield, na.rm=T))
npk_dplyr
## # A tibble: 6 x 2
##    block mean(yield, na.rm = T)
##   <fctr>                    <dbl>
## 1      1                   54.025
## 2      2                   57.450
## 3      3                   60.775
## 4      4                   50.125
## 5      5                   50.525
## 6      6                   56.350

### “merge”

With all of these methods, if we want to append the aggregated information to the original data, we can do so using “merge”. For example:

npk_datatable <- merge(npk, mean_datatable, intersect(names(npk), names(mean_datatable)))
head(npk_datatable)
##   block N P K yield mean_yield
## 1     1 0 1 1  49.5     54.025
## 2     1 1 1 0  62.8     54.025
## 3     1 0 0 0  46.8     54.025
## 4     1 1 0 1  57.0     54.025
## 5     2 1 0 0  59.8     57.450
## 6     2 1 1 1  58.5     57.450

And that’s it! A brief illustration of the differences between data.table, base R, and dplyr for summarizing aggregated data.

This blog post can be found on GitHub.