How to use the group_by function with your ecological data

How to use group_by() with other dplyr functions for ecological data wrangling like a pro

Last updated on Feb 23, 2022 9 min read R

In scientific data and experiments, we often have groups of subjects between which we want to compare an observed response. For example, we might want to compare the growth rates of plants under different light treatments. Or maybe we want to compare CO² emissions of different countries over time. Each of these scenarios requires you to group your data based on a certain variable before you can compare any kind of statistic such as mean, minimum, or maximum.

In this tutorial, I’m going to discuss how to use a handy function called group_by(), which allows you to do what I just described.

Image showing pine trees of different ages with an arrow showing that the group by function grouped them by age

group_by() is part of the dplyr package, so we’ll load that up first. Remember that if you haven’t used or installed the package before, you need to run install.packages("dplyr") before loading it in your script. Let’s also load up a data set that comes with R, called Loblolly.

# Load package
library(dplyr)

# Load data
data(Loblolly)

# View data
head(Loblolly)

##    height age Seed
## 1    4.51   3  301
## 15  10.89   5  301
## 29  28.72  10  301
## 43  41.74  15  301
## 57  52.70  20  301
## 71  60.92  25  301

Loblolly describes the height of Loblolly pine trees at different ages. “Height” is given in feet, “age” is given in years, and “seed” is a unique identifier for each tree.

How to use group_by() and summarise()

Let’s say we want to see the average height of loblolly pine trees within each of the age groups. To do that, we need to group our data by the variable “age”. We use the group_by() function like this: group_by(data, column).

# Group the Loblolly data by tree age
group_by(Loblolly, age)

## # A tibble: 84 × 3
## # Groups:   age [6]
##    height   age Seed 
##     <dbl> <dbl> <ord>
##  1   4.51     3 301  
##  2  10.9      5 301  
##  3  28.7     10 301  
##  4  41.7     15 301  
##  5  52.7     20 301  
##  6  60.9     25 301  
##  7   4.55     3 303  
##  8  10.9      5 303  
##  9  29.1     10 303  
## 10  42.8     15 303  
## # … with 74 more rows

When we do this, our data look the same. But behind the scenes, R makes note of how we want to group our data and returns a table that is grouped accordingly. In fact, our data look the same aside from the Groups: age [6] labeled at the top of the table. However, after grouping the data, we can now apply functions that calculate summary statistics within each group using the function summarize(), or summarise() (the spelling depends on if you use British or American English).

summarise() can be used like so: summarise(data, new_column_name = function(column_to_evaluate)).

So if we wanted to summarize mean heights of trees, it would look like summarise(Loblolly, avgheight = mean(height)).

# Group the Loblolly data by tree age and then summarize the mean, min, and max heights in each group
group_by(Loblolly, age) %>%
  summarise(avgheight = mean(height), 
            minheight = min(height),
            maxheight = max(height))

## # A tibble: 6 × 4
##     age avgheight minheight maxheight
##   <dbl>     <dbl>     <dbl>     <dbl>
## 1     3      4.24      3.46      4.81
## 2     5     10.2       9.03     11.4 
## 3    10     27.4      25.4      30.2 
## 4    15     40.5      37.8      44.4 
## 5    20     51.5      48.3      55.8 
## 6    25     60.3      56.4      64.1

In essence, summarise() produces a new table that contains a column for your group, and then new columns of summary statistics that you define. In the code above, I asked summarise() to create new columns called “avgheight” for the mean height of trees in each age group, “minheight” for the minimum, and “maxheight” for the maximum. After we summarize our data, dplyr will also automatically ungroup our output.

You might be wondering about this guy %>% in the code above. This operator is called a pipe, and it comes loaded with the dplyr package. Importantly, this pipe doesn’t come with base R. For now, what you need to know about pipes are that they feed the output of one statement into the input of another. In the code above, the new table that came out of group_by() was passed into the data argument of summarise(), so there was no need for me to write data = Loblolly in the summarise() function. In plain English, I asked the code to “group the Loblolly data by tree age, and then (pipe!) summarize those groups using their mean, max, and min”.

Pipes can make your code a lot cleaner, especially if you’re performing several operations on one data frame. Don’t worry, we have a more comprehensive tutorial post on pipes coming up soon.

group_by() and other dplyr functions

We just went over the summarise() function, which is one of the most common dplyr functions to use with group_by(). But you could also use other dplyr functions such as mutate() and filter().

mutate()

For example, we could once again group our data by age, and then we could use mutate() to create a new column for mean height.

# Group the Loblolly data by age and create a new column for average height by age group
group_by(Loblolly, age) %>%
  mutate(age_avgheight = mean(height))

## # A tibble: 84 × 4
## # Groups:   age [6]
##    height   age Seed  age_avgheight
##     <dbl> <dbl> <ord>         <dbl>
##  1   4.51     3 301            4.24
##  2  10.9      5 301           10.2 
##  3  28.7     10 301           27.4 
##  4  41.7     15 301           40.5 
##  5  52.7     20 301           51.5 
##  6  60.9     25 301           60.3 
##  7   4.55     3 303            4.24
##  8  10.9      5 303           10.2 
##  9  29.1     10 303           27.4 
## 10  42.8     15 303           40.5 
## # … with 74 more rows

This essentially did the same thing as summarise(), but instead of creating a new table, mutate() just added this “age_avgheight” column to the original data set. You can see that for trees of the same age, the “age_avgheight” value is the same. This makes sense, since we grouped the data by age before taking the mean, and there should only be one mean height for each age group.

For functions like mutate() and filter() where we might want to keep working on the same data set afterwards, we need to ungroup() the data after grouping it so that the grouping doesn’t affect other functions down the line. I’ll demonstrate quickly:

# Demonstrating ungrouping data and mutating a new column for average height
group_by(Loblolly, age) %>%
  mutate(age_avgheight = mean(height)) %>%
  ungroup() %>%
  mutate(all_avgheight = mean(height))

## # A tibble: 84 × 5
##    height   age Seed  age_avgheight all_avgheight
##     <dbl> <dbl> <ord>         <dbl>         <dbl>
##  1   4.51     3 301            4.24          32.4
##  2  10.9      5 301           10.2           32.4
##  3  28.7     10 301           27.4           32.4
##  4  41.7     15 301           40.5           32.4
##  5  52.7     20 301           51.5           32.4
##  6  60.9     25 301           60.3           32.4
##  7   4.55     3 303            4.24          32.4
##  8  10.9      5 303           10.2           32.4
##  9  29.1     10 303           27.4           32.4
## 10  42.8     15 303           40.5           32.4
## # … with 74 more rows

After I ungrouped the data, I used mutate() to create a new column for average height again. But this time, because the data is ungrouped, the “all_avgheight” column just contains the average height of all trees in the data set rather than by age group.

filter()

For the filter() example, I’m going to remove a few rows of data from the Loblolly data set so that we can more clearly see the effect of the filter. If you want to follow along, you can copy and paste the following code into your script:

# Remove some rows at random (sort of)
Loblolly <- Loblolly[-c(1, 2, 3, 4, 9, 10, 11, 17, 18, 22, 29, 30, 34, 35, 47, 55, 56, 70, 82, 83), ]

Now let’s see how to use filter() with group_by(). In our data set, we have 6 age classes for each tree: 3, 5, 10, 15, and 25. But because I removed several rows of data, we are now missing age data for some trees (e.g., for trees 301 and 303).

# Look at age classes
sort(unique(Loblolly$age))

## [1]  3  5 10 15 20 25

# View modified data
head(Loblolly, 10)

##    height age Seed
## 57  52.70  20  301
## 71  60.92  25  301
## 2    4.55   3  303
## 16  10.92   5  303
## 72  63.39  25  303
## 3    4.79   3  305
## 17  11.37   5  305
## 31  30.21  10  305
## 45  44.40  15  305
## 4    3.91   3  307

Let’s say our data analysis requires that we have at least 5 age classes for each tree. In that case, we’ll have to eliminate all trees for which there are fewer than 5 ages. We can use group_by() to group by Seed (the individual tree), then use filter() to only include data that are in a group of at least 5. The function n() will help us count the number of rows in each group.

# Filtering to include groups of at least 5
group_by(Loblolly, Seed) %>%
  filter(n() >= 5) %>%
  ungroup()

## # A tibble: 39 × 3
##    height   age Seed 
##     <dbl> <dbl> <ord>
##  1   3.91     3 307  
##  2   9.48     5 307  
##  3  25.7     10 307  
##  4  50.8     20 307  
##  5  59.1     25 307  
##  6   4.32     3 315  
##  7  10.4      5 315  
##  8  27.2     10 315  
##  9  40.8     15 315  
## 10  51.3     20 315  
## # … with 29 more rows

We see that the data set is greatly reduced, and trees like 301 and 303 have been removed because they have fewer than 5 age classes. We can also run the opposite filter and only include data that are in a group of less than 5.

# Filtering to include groups of less than 5
group_by(Loblolly, Seed) %>%
  filter(n() < 5) %>%
  ungroup()

## # A tibble: 25 × 3
##    height   age Seed 
##     <dbl> <dbl> <ord>
##  1  52.7     20 301  
##  2  60.9     25 301  
##  3   4.55     3 303  
##  4  10.9      5 303  
##  5  63.4     25 303  
##  6   4.79     3 305  
##  7  11.4      5 305  
##  8  30.2     10 305  
##  9  44.4     15 305  
## 10   4.81     3 309  
## # … with 15 more rows

Great! Now you’ve learned how to use the group_by() function along with several of the main dplyr functions summarise(), mutate(), and filter(). I covered just a few ways you might use these functions; it’s up to you to play around with them and learn even more. And don’t forget to use ungroup()!

If you want learn more about data wrangling with dplyr functions, you can check out our full course on the complete basics of R for ecology here: