Counting NAs by column in R
Counting NAs by column in R. Benchmarking several functions. How much not leaving the pipeflow costs?
Are you starting your data exploration? Do you want to have an easy overview of your variable NA percentage?
We create a function to benchmark different ways of achieving it:
library(microbenchmark)
library(tidyverse)
benchmark_count_na_by_column <- function(dataset){
microbenchmark(
# Summary table output
dataset %>% summary(),
# Numeric output
colSums(is.na(dataset)),
sapply(dataset, function(x) sum(is.na(x))),
# List output
dataset %>% map(~sum(is.na(.))),
lapply( dataset, function(x) sum(is.na(x))),
# Df output
dataset %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.)))),
dataset %>%
summarise_each(funs(sum(is.na(.)))),
# Tibble output
dataset %>% map_df(~sum(is.na(.)))
)
}
See the performance dealing with small datasets:
## Unit: microseconds
##funct min lq mean median uq max neval class
##summary() 1480.5 1582.60 1979.676 1897.30 2100.45 6403.2 100 table
##colSums() 24.4 38.45 47.854 44.70 53.90 152.4 100 integer
##sapply() 23.2 35.05 67.891 39.65 50.30 2494.8 100 integer
##map() 140.2 182.60 214.092 200.75 238.50 549.6 100 list
##lapply() 11.2 15.65 27.093 18.85 22.45 750.1 100 list
##summarise_all() 1996.9 2147.80 2650.223 2382.90 2798.55 8133.7 100 data.frame
##summarise_each() 2277.9 2497.05 2951.477 2898.40 3080.65 7977.2 100 data.frame
##map_df() 190.0 249.00 331.368 275.40 326.05 383 100 tbl_df
Let’s see how well them scale with 100000 rows dataset:
## Unit: milliseconds
##funct min lq mean median uq max neval class
##summary() 113.7535 129.35070 138.716624 133.14050 143.45920 252.0149 100 table
##colSums() 4.4280 5.31080 12.502741 5.65005 18.77570 124.8206 100 integer
##sapply() 2.2452 3.03095 6.788395 3.15310 15.04010 18.6061 100 integer
##map() 2.5950 3.28390 5.760602 3.38020 3.69445 19.4527 100 list
##lapply() 2.2018 2.95700 6.219106 3.03605 3.62860 19.5514 100 list
##summarise_all() 5.0982 5.85135 10.093431 6.05940 6.87070 127.5107 100 data.frame
##summarise_each() 5.7251 6.16980 10.191426 6.33065 6.72210 125.2943 100 data.frame
##map_df() 2.6913 3.42045 7.694863 3.56720 3.89715 122.2030 100 tbl_df