Tidylog

Do you find Tidyverse pipelines useful? But do you need some kind of logging inside the fancy pipes? Here’s Tidylog, logging your pipelines

Pablo Cánovas

2020-01-21 R, Tips

Some time ago I made one of the best discoveries I have ever made in the Tidyverse: a tool called tidylog. This package is built on top of dplyr and tidyr and provides us with feedback on the results of the operations. Actually, this is a feature that already appeared in the Stata software.

When performing one operation at a time, it is easy to track the changes made on a table. However things get increasingly obscure when chaining multiple functions or dealing with big data frames.

We all love piping operations. I often ‘play’ to perform the whole transformation without leaving the pipeflow. But the counterpart is missing the intermediate states: you can make some big mistakes and be unaware of them until it’s too late and maybe you have to undone some work or rethink your analysis.

In this context, some additional info is always welcome. I think this feature is specially convenient for beginners, but not only! I have myself wasted several hours debugging long pipelines and trying to understand where the problems came from.

Let’s see a tiny bit of its behaviour with a simple example:

library(nycflights13)
library(tidyverse)
library(tidylog)

flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier) %>% 
  mutate(month = if_else(nchar(month) == 1, paste0("0",month), as.character(month)),
         day = if_else(nchar(day) == 1, paste0("0",day), as.character(day))) %>% 
  unite("date", year:day, sep = "/", remove = T) %>% 
  mutate(date = lubridate::ymd(date)) %>% 
  filter(hour >= 8) %>% 
  anti_join(planes, by = "tailnum") %>% 
  count(tailnum, sort = TRUE) 

# select: dropped 11 variables (dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, …)
# mutate: converted 'month' from integer to character (0 new NA)
#         converted 'day' from integer to character (0 new NA)
# mutate: converted 'date' from character to double (0 new NA)
# filter: removed 50,726 rows (15%), 286,050 rows remaining
# anti_join: added no columns
#            > rows only in x    45,008
#            > rows only in y  (     39)
#            > matched rows    (241,042)
#            >                 =========
#            > rows total        45,008
# count: now 716 rows and 2 columns, ungrouped

Pretty neat! It is specially useful with joins, as it provides plenty of details and they can be a source of duplicated or missing rows.

I decided to write this little post now to celebrate that tidylog v1.0.0 has recently been released! Check the official repo out to see more examples or show some love to @elbersb on Twitter!

All in all, I think this package was a missing piece in the Tidyverse ecosystem: It is incredibly useful, whereas making advantage of it is as simple as writing library(tidylog). Integrating this package into our daily R work is a no-brainer!

Tidyverse

Tidylog

Pablo Cánovas

Senior Data Scientist at Spotahome

Related