Efficient Column Selection in Polars: Utilizing Polars Selectors for Python DataFrame Manipulation

Mastering Column Selection in Python. Polars Selectors for Efficient DataFrame Handling


Column selection in Polars

Embarking on the transition from the trusty Pandas Python library to the exhilarating realm of Polars for data manipulation is like setting off on a thrilling adventure!

Are you tired of the endless code scrolling just to pick the right columns for your Python DataFrames? Look no further! This post could be your hidden gem for precise and efficient column selection. In this guide, we’ll take you on a journey through the intricacies of Polars Selectors, helping you simplify your data analysis tasks and supercharge your Python projects.

Whether you’re a data scientist, analyst, or developer, mastering this essential skill will save you time and effort, avoiding unwanted extra SO searches. Let’s dive in the way you work with DataFrames in Python with Polars!


Polars: Choosing Columns with Square Brackets

This approach comes with certain limitations and is best suited for interactive and exploratory coding.

Let’s explore some examples after creating the base DataFrame (note it’s not Lazy for this time, which you usually will use while working with high amounts of data):

import polars as pl
import random 

# Create a Polars DataFrame with base columns
df = pl.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'], 
    'offensive_skill': [5, 30, 85], 
    'defensive_skill': [92, 30, 10]
    },
    schema={
        "name":pl.Utf8,
        "offensive_skill":pl.Int32,
        "defensive_skill":pl.Int32
    }
)

We can select a column using square brackets with a string inside []. Note that the output is a Polars Series.

df["name"].head(3)
shape: (3,)
name
str
"Alice"
"Bob"
"Charlie"

We can choose a column with a list of strings inside []. We should expect now a Polars DataFrame as an output:

df[["name","defensive_skill"]].head(3)
shape: (3, 2)
namedefensive_skill
stri32
"Alice"92
"Bob"30
"Charlie"10

The major caveats about square brackets column selection is that it can only be used in eager mode. So let’s start and deepen into the native Polars select expression.


Polars Select

One of the key benefits of using Polars select function is that it can be employed in lazy mode, allowing for optimization and parallel execution by Polars.

It’s essential to note that this method consistently yields a Polars DataFrame.

To effortlessly choose specific columns, simply pass their string names as a list to the Polars select function.

(
    df
    .select(
        ["name", "defensive_skill"]
    )
    .head(3)
)
shape: (3, 2)
namedefensive_skill
stri32
"Alice"92
"Bob"30
"Charlie"10

In the next paragraphs we will discover that this is just a simple way to do it, as you usually would use pl.col() inside select() function.


Polars selecting columns by regex

df.select(
    "^.*skill$"
)
shape: (3, 2)
offensive_skilldefensive_skill
i32i32
592
3030
8510


Selecting columns with an expression with Polars and aliasing

As we commented previously, we can create powerful column expressions and transformations while selecting by using pl.col():

(
    df
    .select(
        pl.col("defensive_skill").mean().alias("defensive_skill_mean"),
        pl.col("defensive_skill").std().alias("defensive_skill_std"),
        pl.col("offensive_skill").mean().suffix("_mean"), # shorter to add a suffix to a column name
        pl.col("offensive_skill").std().suffix("_std")
    )
    .head(3)
)
shape: (1, 4)
defensive_skill_meandefensive_skill_stdoffensive_skill_meanoffensive_skill_std
f64f64f64f64
44.042.75511740.040.926764


Polars selecting all columns or exclude

Polars offers the flexibility to select all columns or exclude specific ones, providing you with powerful control over your data manipulation tasks.

df.select(
    pl.all()
)
shape: (3, 3)
nameoffensive_skilldefensive_skill
stri32i32
"Alice"592
"Bob"3030
"Charlie"8510
df.select(
    pl.exclude("defensive_skill") 
    #pl.exclude(["name", "defensive_skill"]) # pass a list for select several columns
)
shape: (3, 2)
nameoffensive_skill
stri32
"Alice"5
"Bob"30
"Charlie"85


Polars selecting columns based on type

Also, Polars’ capability to select columns based on their data type will simplify your data analysis.

df.select(
    pl.col(pl.Utf8)
    #pl.col([pl.Utf8, pl.Int32]) # several types as a list
    #pl.col(pl.NUMERIC_TYPES) # all numeric types
)
shape: (3, 1)
name
str
"Alice"
"Bob"
"Charlie"


Polars selectors API

Enhance your data analysis efficiency with Selectors! Selectors provide a convenient way to select columns from DataFrame or LazyFrame objects based on their name, data type, or other attributes. They streamline and extend the functionality offered by the col() expression, while also enabling the easy application of expressions to the selected columns. Say goodbye to tedious column selection and hello to the simplicity of Selectors!

import polars.selectors as cs

One could use this API to easily select columns by type as the example above, but in a simpler / more readable way:

df.select(
    cs.string()
)
shape: (3, 1)
name
str
"Alice"
"Bob"
"Charlie"

For datetime and timezone selection and manipulation with Polars, please find our post about selecting datetimes and timezones with Polar selectors

df.select(
    cs.datetime()
)
shape: (0, 0)

A summary of all selector functions:

FunctionDescription
📊 all()Select all columns.
📈 by_dtype(*dtypes)Select columns matching the given data types.
🏷️ by_name(*names)Select columns matching the given names.
🧩 categorical()Select all categorical columns.
🔍 contains(substring)Select columns containing the given substring(s).
📅 date()Select all date columns.
⏳ datetime([time_unit, time_zone])Select datetime columns, optionally filter by unit/zone.
⏱️ duration([time_unit])Select duration columns, optionally filter by unit.
🏁 ends_with(*suffix)Select columns ending with the given substring(s).
📦 expand_selector(target, selector)Expand a selector to column names with respect to a specific target.
⏮️ first()Select the first column in the current scope.
📈 float()Select all float columns.
🔢 integer()Select all integer columns.
🧐 is_selector(obj)Check if the object/expression is a selector.
⏭️ last()Select the last column in the current scope.
🔍 matches(pattern)Select columns matching the given regex pattern.
📊 numeric()Select all numeric columns.
🚀 starts_with(*prefix)Select columns starting with the given substring(s).
📝 string(*[, include_categorical])Select Utf8 (and optionally Categorical) string columns.
📆 temporal()Select all temporal columns.
🕰️ time()Select all time columns.

All of them could be find in the Polars selectors documentation


Carlos Vecina
Carlos Vecina
Senior Data Scientist at Jobandtalent

Senior Data Scientist at Jobandtalent | AI & Data Science for Business

Related