Different approaches to rename variables

Author

Cath Blatter

Published

February 24, 2022

Renaming individual columns

This was usually a rather small post with snippets but turned out to be much more interesting…

Loading the necessary package

library(tidyverse)

Sample dataframe

Creating a sample dataframe: the function tribble() lets you construct rowwise-tibbles (thus the R in tribble()). This is closer to the idea of data in a spreadsheet than the mirrored version with tibble() or data.frame().

df <- tibble::tribble(
  ~id, ~pesnwi_leadership, ~pesnwi_sra, ~bernca_adl, ~bernca_docu,
  1, 2.3, 2.1, 1.8, 1.4,
  2, 2.0, 3, 2.1, 0.9,
  3, 3.1, 3.5, 3.9, 3
)

This is the equivalent code for the ‘mirrored’ version:

df <- 
  data.frame(
                   id = c(1, 2, 3),
    pesnwi_leadership = c(2.3, 2, 3.1),
           pesnwi_sra = c(2.1, 3, 3.5),
           bernca_adl = c(1.8, 2.1, 3.9),
          bernca_docu = c(1.4, 0.9, 3)
  )

Finally, this is how the dataframe looks like:

df
# A tibble: 3 × 5
     id pesnwi_leadership pesnwi_sra bernca_adl bernca_docu
  <dbl>             <dbl>      <dbl>      <dbl>       <dbl>
1     1               2.3        2.1        1.8         1.4
2     2               2          3          2.1         0.9
3     3               3.1        3.5        3.9         3  

Manual renaming with dplyr::rename()

Manual renaming following the idea of newname = oldname. This is the approach you need to take for manual replacement of column-names (see an equivalent base-R code below). You can rename more than three variables like this, just add further combinations of newname = oldname…

df %>%
  rename(
    pesnwi_ls = pesnwi_leadership,
    ration_adl = bernca_adl,
    ration_docu = bernca_docu
  )
# A tibble: 3 × 5
     id pesnwi_ls pesnwi_sra ration_adl ration_docu
  <dbl>     <dbl>      <dbl>      <dbl>       <dbl>
1     1       2.3        2.1        1.8         1.4
2     2       2          3          2.1         0.9
3     3       3.1        3.5        3.9         3  

Manual renaming with base-R

Of course its also possible to rename the columns with other approaches, e.g. :

names(df)[names(df)=="pesnwi_leadership"] <- "pesnwi_ls"

Renaming multiple columns at once

Renaming based on text-patterns with dplyr::rename_with()

It’s possible to rename multiple column-names based on matching text-patterns (regular expression). This is great if you need to rename multiple variables with the same pattern, but it only works if you can find such a pattern. If you’re in doubt which option to choose its probably safer to do the manual method above…

For the example dataframe, say we want to replace the prefix “bernca” (the name of an instrument to measure rationed nursing care 1) by the more general term “ration”. This is possible with the following code:

df %>%
  rename_with(
    .fn = ~ stringr::str_replace(., pattern = "bernca", replacement = "ration"),
    .cols = everything()
  )
# A tibble: 3 × 5
     id pesnwi_leadership pesnwi_sra ration_adl ration_docu
  <dbl>             <dbl>      <dbl>      <dbl>       <dbl>
1     1               2.3        2.1        1.8         1.4
2     2               2          3          2.1         0.9
3     3               3.1        3.5        3.9         3  

By default, all columns get considered, but we could limit this or manually define the columns with c(bernca_adl, bernca_docu). Note that I used the {stringr} package for replacing the pattern ‘bernca’ with ‘ration’.

Renaming based on names defined in an external document (e.g. xlsx)

Imagine the following situation: a survey study with three timepoints was conducted over a timespan of 2 years: T0 as baseline, T1 = T0+12months and T2 = T0+24months.
Some of the questions were only asked at baseline, some across all timepoints and selected variables only in T1 and T2. For several reasons (a.k.a. real-life data collection) this led to situations, where a variable named X12 in T0, did in fact not correspond to the variable named X12 in T1 and so on.

In total there were > 250 variables to rename and the person defining the new variables did not work with R. Thus, the solution was to store the information in an external spreadsheet and then write code to import this spreadsheet and rename based on it. But how?

A spreadsheet was created in excel that essentially held information on the variable names of a specific questions across all time points. Additionally, the first column ‘global’ defines the new variable name that will be used across all datasets once the renaming took place. Below is a small example of how the excel-file looked like:

global T0 T1 T2
ID ID ID ID
X1 X1 X3 X2
X2 X4 X10 X11
X3 X2 X12

Creating sample datasets

Below I created some sample datasets to mimic T0, T1 and T2 and the information from the spreadsheet:

For the data from the surveys T0, T1, T2

orig_t0 <- tibble::tribble(
  ~ID, ~X1, ~X4,
  "A", "m", 1,
  "B", "x", 2
)

orig_t1 <-  tibble::tribble(
  ~ID, ~X3, ~X10, ~X2,
  "AAA", "f", 1, 44,
  "BBB", "x", 2, 56,
)

orig_t2 <-  tibble::tribble(
  ~ID, ~X2, ~X11, ~X12,
  "FFF", "f", 2, 53,
  "BGG", "f", 1, 23,
)

Information from the spreadsheet

name_tbl <- tibble::tribble(
  ~global, ~T0, ~T1, ~T2,
  "ID", "ID", "ID", "ID",
  "X1", "X1", "X3", "X2",
  "X2", "X4", "X10", "X11",
  "X3", NA_character_, "X2", "X12"
)

Using a named vector to rename dataframes

check also this tweet by @PipingHotData: https://twitter.com/PipingHotData/status/1497014703473704965?s=20&t=TA5bW8K-wxczoaW2Q6UWCQ

As noted above rename() takes its inputs following the usual tidyverse-style, meaning rename(newname = oldname). This can be translated to the situation at hand by using the corresponding columns from the spreadsheet to create a named vector that has the information newname as ‘name’ and oldname as value. tibble::deframe() is suitable for this situation as described in the help-page: > deframe() converts two-column data frames to a named vector or list, using the first column as name and the second column as value.

For T0

# create a named vector
helper_rename_t0 <- name_tbl %>%
  select(global, T0) %>% # take the two variables needed
  drop_na() %>% # drop_na() is important here 
  deframe() # deframe creates the named vector

# rename with !!! 
(global_t0 <- 
  orig_t0 %>%
    rename(!!!helper_rename_t0))
# A tibble: 2 × 3
  ID    X1       X2
  <chr> <chr> <dbl>
1 A     m         1
2 B     x         2

For T1

# create a named vector
helper_rename_t1 <- name_tbl %>%
  select(global, T1) %>% # take the two variables needed
  drop_na() %>% # drop_na() is important here 
  deframe() # deframe creates the named vector

# rename with !!! 
(global_t1 <- 
  orig_t1 %>%
    rename(!!!helper_rename_t1))
# A tibble: 2 × 4
  ID    X1       X2    X3
  <chr> <chr> <dbl> <dbl>
1 AAA   f         1    44
2 BBB   x         2    56

For T2

# create a named vector
helper_rename_t2 <- name_tbl %>%
  select(global, T2) %>% # take the two variables needed
  drop_na() %>% # drop_na() is important here 
  deframe() # deframe creates the named vector

# rename with !!! 
(global_t2 <- 
  orig_t2 %>%
    rename(!!!helper_rename_t2))
# A tibble: 2 × 4
  ID    X1       X2    X3
  <chr> <chr> <dbl> <dbl>
1 FFF   f         2    53
2 BGG   f         1    23

Finally, using bind_rows() to combine all datasets

(version1 <- bind_rows(global_t0, global_t1, global_t2))
# A tibble: 6 × 4
  ID    X1       X2    X3
  <chr> <chr> <dbl> <dbl>
1 A     m         1    NA
2 B     x         2    NA
3 AAA   f         1    44
4 BBB   x         2    56
5 FFF   f         2    53
6 BGG   f         1    23

This approach worked very well, however some elements are repeated for every dataset (creating the helper vector, renaming), thus maybe a more functional programming approach might be suitable.

Using functional programming to rename all dataframes at once

The approach above works very well, yet sometimes its nice to have the output checked to the original content as well (see if the renaming was really correct).

The code presented here war written with help from the R4DS-online community, specifically Tyler Smith for the first option and @jonthegeek for the 2nd one.

First option

# create a long df with matching variable names
var_names <- 
  name_tbl %>% 
  rename(to = global) %>% 
  pivot_longer(-to, names_to = "source", values_to = "from") %>% 
  select(source, from, to) %>% 
  drop_na()


# helper function to rename the cols
rename_cols <- function(.data, from, to) {
  dplyr::rename(.data, 
                !!!rlang::syms(
                  purrr::discard(
                    rlang::set_names(from, to), is.na(from))))
}


# helper function to table the cols
# to compare their content
table_cols <- function(.data, cols, ...) {
  purrr::map(cols, ~ base::table(.data[[.x]]), ...)
}


# creating a list of all dataframes that need to be renamed
data_list <- list(
  T0 = orig_t0,
  T1 = orig_t1,
  T2 = orig_t2
)

# Create a nested tibble, define the variables to 
# be rename for each row, apply the renaming, finbally validate the
# data_table (original dataframe) versus the output_table (renamed dataframe)

mapped_object <- 
  enframe(data_list, name = "source", value = "data") %>%
  mutate(mapping = map(source, ~ filter(var_names, source == .))) %>%
  hoist(mapping, from = "from", to = "to") %>%
  mutate(
    output = pmap(list(data, from, to), rename_cols), # renaming
    data_table = map2(data, from, table_cols, useNA = "always"), # validation
    output_table = map2(output, to, table_cols, useNA = "always"), # validation
    validate = map2_lgl(data_table, output_table, identical) # validation
  )


# from the mapped_pbject, 
# unnest the output - don't forget to store it into its own object
mapped_object %>% 
  select(output) %>% 
  unnest(cols = output)
# A tibble: 6 × 4
  ID    X1       X2    X3
  <chr> <chr> <dbl> <dbl>
1 A     m         1    NA
2 B     x         2    NA
3 AAA   f         1    44
4 BBB   x         2    56
5 FFF   f         2    53
6 BGG   f         1    23

Second option

The approach is very similar: use purrr::map2_dfr() to indicate the use of two lists (.x and .y) whereas .x is the list of the dataframes and .y is the list of the vectors with the names. The suffix _dfr() does bind_rows() to combine the results into one dataframe

# rename 
map2_dfr(
  list(orig_t0, orig_t1, orig_t2),
  list(name_tbl$T0, name_tbl$T1, name_tbl$T2), function(tdf, names) {
    real_names <- name_tbl$global[!is.na(names)]
    names <- keep(names, function(v) !is.na(v))
    tdf %>%
      rename(!!!rlang::syms(set_names(names, real_names)))
  }
)
# A tibble: 6 × 4
  ID    X1       X2    X3
  <chr> <chr> <dbl> <dbl>
1 A     m         1    NA
2 B     x         2    NA
3 AAA   f         1    44
4 BBB   x         2    56
5 FFF   f         2    53
6 BGG   f         1    23

TL;DR

  • There are many different options on to how to rename a variable

  • This post should give an overview of some of the possibilities I have worked with

  • IMHO the best solution depends on the situation at hand - choose your pick!

Footnotes

  1. Schubert et al., 2007, https://doi.org/10.1097/01.nnr.0000299853.52429.62↩︎