Using tidyr::pivot_longer() and regex for data wrangling
Author
Cath Blatter
Published
March 16, 2020
TL;DR
This code shows you how you can effectively wrangle your dataframe from wide to long using tidyr::pivot_longer() combined with regular expressions for properly naming the variables.
Your data is… not in the right format!
You have a wide dataset of patient data, each patient identified through ID. They have been hospitalised multiple times, each place of hospitalisation and outcome listed in its own variable. Place and Outcome of a hospitalization are linked with the same suffix number, e.g. _2_2:
# A tibble: 3 × 7
ID Hosp_Place_1 Hosp_Outcome_1 Hosp_Place_2_2 Hosp_Outcom…¹ Hosp_…² Hosp_…³
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 London Alive Paris alive Rome dead
2 2 Paris alive Rome alive London alive
3 3 Berne dead <NA> <NA> <NA> <NA>
# … with abbreviated variable names ¹Hosp_Outcome_2_2, ²Hosp_Place_2_3,
# ³Hosp_Outcome_2_3
Patient No. 3 died in his first hospitatisation, so the following variables are listed NA.
For some analysis, this data structure would work ok, but for your specific questions you need to switch the unit of analysis to “hospitalisation” instead of “patient”. In other words, you’d like to switch the data from wide to long.
Thanks to the tidyverse’s initiative of making clear function names, it might be obvious, that tidyr::pivot_longer() should do what you want.
Familiarize yourself with the function - What to do with tidyr::pivot_longer()?
The first step is to initially think about, what the outcome should look like and what input-arguments the function takes, so let’s do this:
# tidyr::pivot_long() and its argumentstidyr::pivot_longer(data, cols, names_to ="name", names_prefix =NULL, names_sep =NULL, names_pattern =NULL,names_ptypes =list(),names_repair ="check_unique",values_to ="value",values_drop_na =FALSE,values_ptypes =list())
As you can see, only data and cols are effectively needed, so let’s try that. As I want to exclude the Patient-ID from pivoting, I remove this line from pivoting:
# as with all the tidyverse functions you can easily pipe-in the data as # the first argument# '-ID' means, that all variables are used other than IDpatient_data %>%pivot_longer(cols =-ID)
# A tibble: 18 × 3
ID name value
<dbl> <chr> <chr>
1 1 Hosp_Place_1 London
2 1 Hosp_Outcome_1 Alive
3 1 Hosp_Place_2_2 Paris
4 1 Hosp_Outcome_2_2 alive
5 1 Hosp_Place_2_3 Rome
6 1 Hosp_Outcome_2_3 dead
7 2 Hosp_Place_1 Paris
8 2 Hosp_Outcome_1 alive
9 2 Hosp_Place_2_2 Rome
10 2 Hosp_Outcome_2_2 alive
11 2 Hosp_Place_2_3 London
12 2 Hosp_Outcome_2_3 alive
13 3 Hosp_Place_1 Berne
14 3 Hosp_Outcome_1 dead
15 3 Hosp_Place_2_2 <NA>
16 3 Hosp_Outcome_2_2 <NA>
17 3 Hosp_Place_2_3 <NA>
18 3 Hosp_Outcome_2_3 <NA>
# I could have put the following instead, meaning selecting the variables I want# but it was generally shorter to drop just the ID# patient_data %>%# pivot_longer(cols = Hosp_Place_1:Hosp_Outcome_2_3)
Something clearly happened, but name and value are not exactly what we want here.
The colnames “name” and “value” are actually coming from the default arguments.
What is now the next step?
Mental image of desired outcome - How should my dataframe look like?
My desired output is a dataframe with the colum names ID, Hosp_Place and Hosp_Outcome. Additionally, I want a variable - lets call it hosp_sequence - that captures the number of the hospitalisation (you remember the suffix of the original variable names).
tidyr::pivot_longer()’s names_to =-arguments states in the help-page: Can be a character vector, creating multiple columns, if names_sep or names_pattern is provided.
If you can detect any patterns in the column names, its possible to use them for the column naming. If we look at Hosp_Place_1 and Hosp_Outcome_1 we can clearly see a pattern: The information I want as name is Hosp_Place and Hosp_Outcome and the number followed should be put in variable called hosp_sequence.
This translates to something like (Hosp_Place)_(1) where the parts in brackets correspond to the inputs given in names_to =. With the .value-argument, I can easily take over the string as it is.
I actually found this very confusing (honestly - still do…) and I had great help for defining the regular expression from R4DS1.
Finally, this is the code we need:
# this code should do the trickpatient_data %>%pivot_longer(cols =-ID, names_to =c(".value", "hosp_sequence"),names_pattern ='(^[A-z]+_[A-z]+)_([0-9].*)')
# A tibble: 9 × 4
ID hosp_sequence Hosp_Place Hosp_Outcome
<dbl> <chr> <chr> <chr>
1 1 1 London Alive
2 1 2_2 Paris alive
3 1 2_3 Rome dead
4 2 1 Paris alive
5 2 2_2 Rome alive
6 2 2_3 London alive
7 3 1 Berne dead
8 3 2_2 <NA> <NA>
9 3 2_3 <NA> <NA>
Wait, what is happening with rows 8 and 9? This is my deceased Patient No. 3 an those are not hospitalisations anymore - how do I drop those rows?
# you can either use dplyr::drop_na() or specify the built-in argument to TRUEpatient_data %>%pivot_longer(cols =-ID, names_to =c(".value", "hosp_sequence"),names_pattern ='(^[A-z]+_[A-z]+)_([0-9].*)', values_drop_na =TRUE)
# A tibble: 7 × 4
ID hosp_sequence Hosp_Place Hosp_Outcome
<dbl> <chr> <chr> <chr>
1 1 1 London Alive
2 1 2_2 Paris alive
3 1 2_3 Rome dead
4 2 1 Paris alive
5 2 2_2 Rome alive
6 2 2_3 London alive
7 3 1 Berne dead
Comment
I wrote this blogpost after after solving exactly this issue with a real dataset for a colleague.
My work as a research programmer allows me to dive into data wrangling problems on a regular basis. As I learned most of my R skills from other blogposts from the fantastic R community, I started to write up some of the problems I encountered for others. I also use my previous blogposts sometimes, when I have to dig up old code…
Any comments from your side? Let me know!
Footnotes
https://rfordatascience.slack.com/archives/C8K09CDNZ/p1584129595187200 If you are not already on this slack - sign up for it! Its just so great, low key help and great learning opportunities to just dive through the topics. ↩︎