How to Modify Variables the Right Way in R
A quick guide to modifying many columns at once like a pro.
In data analysis and data science, it’s common to work with large datasets that require some form of manipulation to be useful. In this small article, we’ll explore how to create and modify columns in a dataframe using modern R tools from the tidyverse package. We can do that on several ways, so we are going from basic to advanced level.
Let’s use the
starwars dataset for that purpose:
# # A tibble: 4 × 8 # name height mass hair_color skin_color eye_color birth_year sex # <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> # 1 Luke Skywalker 172 77 blond fair blue 19 male # 2 C-3PO 167 75 NA gold yellow 112 none # 3 R2-D2 96 32 NA white, blue red 33 none # 4 Darth Vader 202 136 none white yellow 41.9 male
The most basic example is using
mutate to create and modify variables.
Note how the second variable we created is recycled to match the length of the dataset. But you already knew that, right?
A common trick is making use of
if_else to conditionally modify some variables. I use this structure on a daily basis.
Another common use case is to rely on the
case_when function to modify the variable based on several conditions:
Note that we should end the
case_when with an option that always yields
TRUE because the conditions are evaluated in order. If our data doesn’t meet any condition we are leaving the column as is.
All these are fairly basic examples. Let’s go with the
dplyr advanced way of creating and modifying variables.
The Advanced Way: Using across()
In modern R, we can simultaneously modify several columns at once using the verb
across. We need to pass the transformation we will be performing on those variables as well. For that, we are using a lambda function which basically means that we are creating the function on-the-fly but we are not storing it.
That’s quite nice but sometimes you don’t want to modify the existing columns but creating new ones.
This is an important use case: batch-creating several columns at once based on the existing ones. I already discussed how to do it in How to create multiple lags like a Pro. We can use the
.names argument to dynamically specify the new column names, like this:
Awesome, right? However, I still had to type them all manually. There is a better way.
The Pro Way: Using across() + tidyselectors
What if we want to modify a lot of columns? There must be a better way to avoid having to type them all…
Sure there is!
tidyselectors to the rescue! Those are a family of functions that allow us to dynamically select several columns based on a condition. Let’s see that with an example.
Let’s say we want to modify only the numerical variables. We can do that easily with the help of
where function and the neat part is this family of functions works with several verbs of the Tidyverse. For instance, they work with
But also with
mutate! So combining
where we can apply the function only over the desired columns (without having to type them!)
Note how the
name feature hasn’t been modified, as it is not a numeric variable. This is a really handy trick specially when you are working with big datasets and need to perform an operation on many columns at once.
Also, it is worth noting that we can pass any function to
across to modify the selected columns. We don’t necessarily have to define the operation with a lambda function, but any existing function can be used.
Here is another powerful example working with character columns. We can apply an existing function to make all of them uppercase:
# # A tibble: 4 × 8 # name hair_color skin_color eye_color sex gender homeworld species # <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> # 1 LUKE SKYWALKER BLOND FAIR BLUE MALE MASCULINE TATOOINE HUMAN # 2 C-3PO NA GOLD YELLOW NONE MASCULINE TATOOINE DROID # 3 R2-D2 NA WHITE, BLUE RED NONE MASCULINE NABOO DROID # 4 DARTH VADER NONE WHITE YELLOW MALE MASCULINE TATOOINE HUMAN
Also, you don’t have to rely only on the
where tidyselector, you can use many others like
Here’s another example example using
# # A tibble: 4 × 4 # name hair_color skin_color eye_color # <chr> <chr> <chr> <chr> # 1 Luke Skywalker the color is blond the color is fair the color is blue # 2 C-3PO the color is NA the color is gold the color is yellow # 3 R2-D2 the color is NA the color is white, blue the color is red # 4 Darth Vader the color is none the color is white the color is yellow
Handy stuff, right? There is so much more possibilities to discover. You can read more about it on the across reference.