Using tidyr::pivot_longer() and regex for data wrangling

Author

Cath Blatter

Published

March 16, 2020

TL;DR

This code shows you how you can effectively wrangle your dataframe from wide to long using tidyr::pivot_longer() combined with regular expressions for properly naming the variables.

Your data is… not in the right format!

You have a wide dataset of patient data, each patient identified through ID. They have been hospitalised multiple times, each place of hospitalisation and outcome listed in its own variable. Place and Outcome of a hospitalization are linked with the same suffix number, e.g. _2_2:

# load packages
suppressPackageStartupMessages(library(tidyverse))

# create sample data
patient_data <- 
tibble::tribble(~ID, ~Hosp_Place_1, ~Hosp_Outcome_1, ~Hosp_Place_2_2, 
                ~Hosp_Outcome_2_2, ~Hosp_Place_2_3, ~Hosp_Outcome_2_3, 
                1, "London", "Alive", "Paris", "alive", "Rome", "dead",
                2, "Paris", "alive", "Rome", "alive", "London", "alive",
                3, "Berne", "dead", NA_character_, NA_character_, NA_character_, NA_character_) 

# look at data
patient_data
# A tibble: 3 × 7
     ID Hosp_Place_1 Hosp_Outcome_1 Hosp_Place_2_2 Hosp_Outcom…¹ Hosp_…² Hosp_…³
  <dbl> <chr>        <chr>          <chr>          <chr>         <chr>   <chr>  
1     1 London       Alive          Paris          alive         Rome    dead   
2     2 Paris        alive          Rome           alive         London  alive  
3     3 Berne        dead           <NA>           <NA>          <NA>    <NA>   
# … with abbreviated variable names ¹​Hosp_Outcome_2_2, ²​Hosp_Place_2_3,
#   ³​Hosp_Outcome_2_3

Patient No. 3 died in his first hospitatisation, so the following variables are listed NA.

For some analysis, this data structure would work ok, but for your specific questions you need to switch the unit of analysis to “hospitalisation” instead of “patient”. In other words, you’d like to switch the data from wide to long.

Thanks to the tidyverse’s initiative of making clear function names, it might be obvious, that tidyr::pivot_longer() should do what you want.

Familiarize yourself with the function - What to do with tidyr::pivot_longer()?

The first step is to initially think about, what the outcome should look like and what input-arguments the function takes, so let’s do this:

# tidyr::pivot_long() and its arguments
tidyr::pivot_longer(data, 
                    cols, 
                    names_to = "name", 
                    names_prefix = NULL, 
                    names_sep = NULL, 
                    names_pattern = NULL,
                    names_ptypes = list(),
                    names_repair = "check_unique",
                    values_to = "value",
                    values_drop_na = FALSE,
                    values_ptypes = list())

As you can see, only data and cols are effectively needed, so let’s try that. As I want to exclude the Patient-ID from pivoting, I remove this line from pivoting:

# as with all the tidyverse functions you can easily pipe-in the data as 
# the first argument

# '-ID' means, that all variables are used other than ID
patient_data %>% 
  pivot_longer(cols = -ID)
# A tibble: 18 × 3
      ID name             value 
   <dbl> <chr>            <chr> 
 1     1 Hosp_Place_1     London
 2     1 Hosp_Outcome_1   Alive 
 3     1 Hosp_Place_2_2   Paris 
 4     1 Hosp_Outcome_2_2 alive 
 5     1 Hosp_Place_2_3   Rome  
 6     1 Hosp_Outcome_2_3 dead  
 7     2 Hosp_Place_1     Paris 
 8     2 Hosp_Outcome_1   alive 
 9     2 Hosp_Place_2_2   Rome  
10     2 Hosp_Outcome_2_2 alive 
11     2 Hosp_Place_2_3   London
12     2 Hosp_Outcome_2_3 alive 
13     3 Hosp_Place_1     Berne 
14     3 Hosp_Outcome_1   dead  
15     3 Hosp_Place_2_2   <NA>  
16     3 Hosp_Outcome_2_2 <NA>  
17     3 Hosp_Place_2_3   <NA>  
18     3 Hosp_Outcome_2_3 <NA>  
# I could have put the following instead, meaning selecting the variables I want
# but it was generally shorter to drop just the ID
# patient_data %>%
#   pivot_longer(cols = Hosp_Place_1:Hosp_Outcome_2_3)

Something clearly happened, but name and value are not exactly what we want here.
The colnames “name” and “value” are actually coming from the default arguments.
What is now the next step?

Mental image of desired outcome - How should my dataframe look like?

My desired output is a dataframe with the colum names ID, Hosp_Place and Hosp_Outcome. Additionally, I want a variable - lets call it hosp_sequence - that captures the number of the hospitalisation (you remember the suffix of the original variable names).

tidyr::pivot_longer()’s names_to =-arguments states in the help-page:
Can be a character vector, creating multiple columns, if names_sep or names_pattern is provided.

If you can detect any patterns in the column names, its possible to use them for the column naming. If we look at Hosp_Place_1 and Hosp_Outcome_1 we can clearly see a pattern: The information I want as name is Hosp_Place and Hosp_Outcome and the number followed should be put in variable called hosp_sequence.

This translates to something like (Hosp_Place)_(1) where the parts in brackets correspond to the inputs given in names_to =. With the .value-argument, I can easily take over the string as it is.

I actually found this very confusing (honestly - still do…) and I had great help for defining the regular expression from R4DS1.

Finally, this is the code we need:

# this code should do the trick
patient_data %>% 
  pivot_longer(cols = -ID, 
               names_to = c(".value", "hosp_sequence"),
               names_pattern = '(^[A-z]+_[A-z]+)_([0-9].*)')
# A tibble: 9 × 4
     ID hosp_sequence Hosp_Place Hosp_Outcome
  <dbl> <chr>         <chr>      <chr>       
1     1 1             London     Alive       
2     1 2_2           Paris      alive       
3     1 2_3           Rome       dead        
4     2 1             Paris      alive       
5     2 2_2           Rome       alive       
6     2 2_3           London     alive       
7     3 1             Berne      dead        
8     3 2_2           <NA>       <NA>        
9     3 2_3           <NA>       <NA>        

Wait, what is happening with rows 8 and 9? This is my deceased Patient No. 3 an those are not hospitalisations anymore - how do I drop those rows?

# you can either use dplyr::drop_na() or specify the built-in argument to TRUE
patient_data %>% 
  pivot_longer(cols = -ID, 
               names_to = c(".value", "hosp_sequence"),
               names_pattern = '(^[A-z]+_[A-z]+)_([0-9].*)', 
               values_drop_na = TRUE)
# A tibble: 7 × 4
     ID hosp_sequence Hosp_Place Hosp_Outcome
  <dbl> <chr>         <chr>      <chr>       
1     1 1             London     Alive       
2     1 2_2           Paris      alive       
3     1 2_3           Rome       dead        
4     2 1             Paris      alive       
5     2 2_2           Rome       alive       
6     2 2_3           London     alive       
7     3 1             Berne      dead        

Comment

I wrote this blogpost after after solving exactly this issue with a real dataset for a colleague.
My work as a research programmer allows me to dive into data wrangling problems on a regular basis. As I learned most of my R skills from other blogposts from the fantastic R community, I started to write up some of the problems I encountered for others. I also use my previous blogposts sometimes, when I have to dig up old code…

Any comments from your side? Let me know!

Footnotes

  1. https://rfordatascience.slack.com/archives/C8K09CDNZ/p1584129595187200 If you are not already on this slack - sign up for it! Its just so great, low key help and great learning opportunities to just dive through the topics. ↩︎