Polars: Filter rows and columns based on percentage of NAs / nulls

Optimizing Data Cleanup: Filtering Rows by Null Percentage in Python Polars. Manage NAs with precision.

Polars is well-known for working with LazyFrames and avoiding materializing data till necessary. That’s the main reason Polars Selectors work with schema available. But filtering rows and columns based on NAs is a first basic step needed for many data analysis.

So let’s see the most idiomatic way we have to do so, as the trade-off of syntactic sugar vs high performance of query planning and lazy evaluation.

import polars as pl

df = pl.DataFrame({
    "column_1": [1, 2, 3],
    "column_2": [4, None, 5],
    "column_3": [None, 6, None],
    "column_4": [7, 8, 9]
}).lazy()


Filter rows based on percentage of NAs

How to filter rows based on null percentage in Python Polars? As you may notice, we could include it in a Polars Lazy pipeline.

df.filter(pl.sum_horizontal(pl.all().is_null() / pl.all().count()) < 0.7).collect()
shape: (3, 4)
column_1column_2column_3column_4
i64i64i64i64
14null7
2null68
35null9


Filter and drop columns based on percentage of NAs

Do you want to select columns that are populated higher than a given percentage? Here what it takes to do so. Of course, the main computation needs to materialize. But you know, you can’t make an omelette without breaking some eggs.

Has we posted about dropping columns based on NAs percentage without leaving the pipe flow, we are not doing it on Python Polars.

df.select(
    [
        column for i, column in
        enumerate(df.columns) if df.select(pl.all().is_null().sum() / pl.all().count() < 0.6).collect().to_numpy()[0][i]
]
).collect()
shape: (3, 3)
column_1column_2column_4
i64i64i64
147
2null8
359

Or even better, as suggested here:

df_mat = df.collect()
df.select(col.name for col in df_mat.null_count() / df_mat.height if col.item() <= 0.2).collect()
shape: (3, 2)
column_1column_4
i64i64
17
28
39


Carlos Vecina
Carlos Vecina
Senior Data Scientist at Jobandtalent

Senior Data Scientist at Jobandtalent | AI & Data Science for Business

Related