Sort in Python Polars. Arrange your DataFrames and Series

Effortless Data Arrangement: Mastering Sorting DataFrames and Series with Python Polars

2023-10-08 Python, Tips

Polars Python with sort dataframes message

How to sort data in Python Polars

Sorting data is a fundamental operation in data analysis and manipulation, and when it comes to doing it efficiently and effectively in Python. Whether you’re a data scientist, analyst, or simply a Python enthusiast looking to work with structured data, understanding how to sort data using Polars can significantly enhance your data processing capabilities. In this blog post, we will explore the ins and outs of sorting data in Polars. Get ready to dive into the world of Polars and master the art of data sorting with ease.

Polars sort a dataframe

Polars DataFrame sort method has the following typing:

DataFrame.sort(
    by: IntoExpr | Iterable[IntoExpr],
    descending: bool | Sequence[bool] = False,
    nulls_last: bool = False,
) → DataFrame

The descending and null_last arguments have a default False behaviour.

Let’s deep into it and start by create a simple Polars dataframe and check the sorting basics. You can call it directly over an string representing a column name, or sort it by using a pl.col() structure.

import polars as pl

df = pl.DataFrame(
    {
        "Company": ["Tesla", "Tesla_old", "Apple", "Microsoft"],
        "Market_Cap": [0.798, None, 2.78, 2.44],
        "Diluted_EPS": [0.79, None, 1.26, 2.69],
    }
)

df.sort("Market_Cap")

shape: (4, 3)

Company	Market_Cap	Diluted_EPS
str	f64	f64
"Tesla_old"	null	null
"Tesla"	0.798	0.79
"Microsoft"	2.44	2.69
"Apple"	2.78	1.26

df.sort("Market_Cap").frame_equal(df.sort(pl.col("Market_Cap"))) # Comparte to check it is equal

## True

Check that as expected, the default behaviour is to place nulls at the beginning and sort values in ascending way.

Sort Polars Dataframe by several columns

Now, let’s sort it in descending order and move null values to the end of the Polars dataframe. You can experiment with removing null_last as its default is False. However, when you set “descending” to Tru0e, null_last automatically becomes True, making null values appear at the end by default.

df.sort(
    ["Market_Cap", "Diluted_EPS"],
    descending=True,
    nulls_last=True # not needed when descending is True
)

shape: (4, 3)

Company	Market_Cap	Diluted_EPS
str	f64	f64
"Apple"	2.78	1.26
"Microsoft"	2.44	2.69
"Tesla"	0.798	0.79
"Tesla_old"	null	null

Doing it by expression is also possible and becomes especially handy when you want to sort by columns while applying arithmetic operations, such as:

df.sort(
    pl.col("Market_Cap") / pl.col("Diluted_EPS"), 
    descending=True, 
)

shape: (4, 3)

Company	Market_Cap	Diluted_EPS
str	f64	f64
"Apple"	2.78	1.26
"Tesla"	0.798	0.79
"Microsoft"	2.44	2.69
"Tesla_old"	null	null

Polars Sort expression. Be careful with Polars expression expansion while sorting

As discussed in this GH issue, it might not be immediately intuitive for someone who has recently started using the Polars library to attempt sorting within a select statement. When you employ expression expansion, you end up with two separate expressions, each of which sorts their respective columns.

This approach can inadvertently disrupt the relative order of your data, potentially resulting in unexpected behavior for the user. It’s important to be aware of this potential issue to ensure that your data sorting operations align with your intended outcomes and don’t inadvertently compromise the data’s coherence and integrity.

df.select(pl.col(["Company","Diluted_EPS"]).sort())

# You can do smth similar by packing them inside an Struct
# df.select(pl.struct(["Company", "Diluted_EPS"]).sort())

shape: (4, 2)

Company	Diluted_EPS
str	f64
"Apple"	null
"Microsoft"	0.79
"Tesla"	1.26
"Tesla_old"	2.69

Polars sorted flag

In Polars, the use of a “sorted” flag comes in handy when you want to explicitly indicate that a column has been sorted, especially when it pertains to data generated, f.e over a range of date. This flag is automatically applied when you’ve used the sort() expression. This flag serves as a performance-enhancing tool, optimizing subsequent operations and enabling certain functions to work more efficiently when they require the data to be in a sorted state before their execution.

Let’s take a look at an illustrative example:

df["Diluted_EPS"].is_sorted()

## False

df_not_sorted_but_flagged_as_sorted = df.with_columns(pl.col("Diluted_EPS").set_sorted())
df_not_sorted_but_flagged_as_sorted["Diluted_EPS"].is_sorted()

## True

Remember te reasignation as Polars does not work with inplace operations.

df_sorted = df.sort("Diluted_EPS", descending=True)
print(df_sorted["Diluted_EPS"].is_sorted())

## False

What’s happening here?? We can access to Polars column flags by doing:


df_sorted["Diluted_EPS"].flags

## {'SORTED_ASC': False, 'SORTED_DESC': True}

The proper way to verify this is to:

print(df_sorted["Diluted_EPS"].is_sorted(descending=True))

## True

any(df_sorted["Diluted_EPS"].flags) # Or directly if you don't know the order

## True

Stay updated on Polars and Python tips

Hopefully, this post has helped you become familiar with Polars sort usage and allowed you to enjoy a showcase of some of its features.

If you want to stay updated…

Python Polars

Sort in Python Polars. Arrange your DataFrames and Series

How to sort data in Python Polars

Polars sort a dataframe

Sort Polars Dataframe by several columns

Polars Sort expression. Be careful with Polars expression expansion while sorting

Polars sorted flag

Stay updated on Polars and Python tips

Carlos Vecina

Senior Data Scientist at Jobandtalent

Related