Polars: Filter rows and columns based on percentage of NAs / nulls

Optimizing Data Cleanup: Filtering Rows by Null Percentage in Python Polars. Manage NAs with precision.

2023-10-18 Python, Tips

Polars is well-known for working with LazyFrames and avoiding materializing data till necessary. That’s the main reason Polars Selectors work with schema available. But filtering rows and columns based on NAs is a first basic step needed for many data analysis.

So let’s see the most idiomatic way we have to do so, as the trade-off of syntactic sugar vs high performance of query planning and lazy evaluation.

import polars as pl

df = pl.DataFrame({
    "column_1": [1, 2, 3],
    "column_2": [4, None, 5],
    "column_3": [None, 6, None],
    "column_4": [7, 8, 9]
}).lazy()

Filter rows based on percentage of NAs

How to filter rows based on null percentage in Python Polars? As you may notice, we could include it in a Polars Lazy pipeline.

df.filter(pl.sum_horizontal(pl.all().is_null() / pl.all().count()) < 0.7).collect()

shape: (3, 4)

column_1	column_2	column_3	column_4
i64	i64	i64	i64
1	4	null	7
2	null	6	8
3	5	null	9

Filter and drop columns based on percentage of NAs

Do you want to select columns that are populated higher than a given percentage? Here what it takes to do so. Of course, the main computation needs to materialize. But you know, you can’t make an omelette without breaking some eggs.

Has we posted about dropping columns based on NAs percentage without leaving the pipe flow, we are not doing it on Python Polars.

df.select(
    [
        column for i, column in
        enumerate(df.columns) if df.select(pl.all().is_null().sum() / pl.all().count() < 0.6).collect().to_numpy()[0][i]
]
).collect()

shape: (3, 3)

column_1	column_2	column_4
i64	i64	i64
1	4	7
2	null	8
3	5	9

Or even better, as suggested here:

df_mat = df.collect()
df.select(col.name for col in df_mat.null_count() / df_mat.height if col.item() <= 0.2).collect()

shape: (3, 2)

column_1	column_4
i64	i64
1	7
2	8
3	9

Polars

Polars: Filter rows and columns based on percentage of NAs / nulls

Filter rows based on percentage of NAs

Filter and drop columns based on percentage of NAs

Carlos Vecina

Senior Data Scientist at Jobandtalent

Related