Sort in Python Polars. Arrange your DataFrames and Series

Effortless Data Arrangement: Mastering Sorting DataFrames and Series with Python Polars

Polars Python with sort dataframes message

How to sort data in Python Polars

Sorting data is a fundamental operation in data analysis and manipulation, and when it comes to doing it efficiently and effectively in Python. Whether you’re a data scientist, analyst, or simply a Python enthusiast looking to work with structured data, understanding how to sort data using Polars can significantly enhance your data processing capabilities. In this blog post, we will explore the ins and outs of sorting data in Polars. Get ready to dive into the world of Polars and master the art of data sorting with ease.


Polars sort a dataframe

Polars DataFrame sort method has the following typing:

DataFrame.sort(
    by: IntoExpr | Iterable[IntoExpr],
    descending: bool | Sequence[bool] = False,
    nulls_last: bool = False,
) → DataFrame

The descending and null_last arguments have a default False behaviour.

Let’s deep into it and start by create a simple Polars dataframe and check the sorting basics. You can call it directly over an string representing a column name, or sort it by using a pl.col() structure.

import polars as pl

df = pl.DataFrame(
    {
        "Company": ["Tesla", "Tesla_old", "Apple", "Microsoft"],
        "Market_Cap": [0.798, None, 2.78, 2.44],
        "Diluted_EPS": [0.79, None, 1.26, 2.69],
    }
)
df.sort("Market_Cap")
shape: (4, 3)
CompanyMarket_CapDiluted_EPS
strf64f64
"Tesla_old"nullnull
"Tesla"0.7980.79
"Microsoft"2.442.69
"Apple"2.781.26
df.sort("Market_Cap").frame_equal(df.sort(pl.col("Market_Cap"))) # Comparte to check it is equal
## True

Check that as expected, the default behaviour is to place nulls at the beginning and sort values in ascending way.


Sort Polars Dataframe by several columns

Now, let’s sort it in descending order and move null values to the end of the Polars dataframe. You can experiment with removing null_last as its default is False. However, when you set “descending” to Tru0e, null_last automatically becomes True, making null values appear at the end by default.

df.sort(
    ["Market_Cap", "Diluted_EPS"],
    descending=True,
    nulls_last=True # not needed when descending is True
)
shape: (4, 3)
CompanyMarket_CapDiluted_EPS
strf64f64
"Apple"2.781.26
"Microsoft"2.442.69
"Tesla"0.7980.79
"Tesla_old"nullnull

Doing it by expression is also possible and becomes especially handy when you want to sort by columns while applying arithmetic operations, such as:

df.sort(
    pl.col("Market_Cap") / pl.col("Diluted_EPS"), 
    descending=True, 
)
shape: (4, 3)
CompanyMarket_CapDiluted_EPS
strf64f64
"Apple"2.781.26
"Tesla"0.7980.79
"Microsoft"2.442.69
"Tesla_old"nullnull


Polars Sort expression. Be careful with Polars expression expansion while sorting

As discussed in this GH issue, it might not be immediately intuitive for someone who has recently started using the Polars library to attempt sorting within a select statement. When you employ expression expansion, you end up with two separate expressions, each of which sorts their respective columns.

This approach can inadvertently disrupt the relative order of your data, potentially resulting in unexpected behavior for the user. It’s important to be aware of this potential issue to ensure that your data sorting operations align with your intended outcomes and don’t inadvertently compromise the data’s coherence and integrity.

df.select(pl.col(["Company","Diluted_EPS"]).sort())

# You can do smth similar by packing them inside an Struct
# df.select(pl.struct(["Company", "Diluted_EPS"]).sort())
shape: (4, 2)
CompanyDiluted_EPS
strf64
"Apple"null
"Microsoft"0.79
"Tesla"1.26
"Tesla_old"2.69


Polars sorted flag

In Polars, the use of a “sorted” flag comes in handy when you want to explicitly indicate that a column has been sorted, especially when it pertains to data generated, f.e over a range of date. This flag is automatically applied when you’ve used the sort() expression. This flag serves as a performance-enhancing tool, optimizing subsequent operations and enabling certain functions to work more efficiently when they require the data to be in a sorted state before their execution.

Let’s take a look at an illustrative example:

df["Diluted_EPS"].is_sorted()
## False
df_not_sorted_but_flagged_as_sorted = df.with_columns(pl.col("Diluted_EPS").set_sorted())
df_not_sorted_but_flagged_as_sorted["Diluted_EPS"].is_sorted()
## True

Remember te reasignation as Polars does not work with inplace operations.

df_sorted = df.sort("Diluted_EPS", descending=True)
print(df_sorted["Diluted_EPS"].is_sorted())
## False

What’s happening here?? We can access to Polars column flags by doing:


df_sorted["Diluted_EPS"].flags
## {'SORTED_ASC': False, 'SORTED_DESC': True}

The proper way to verify this is to:

print(df_sorted["Diluted_EPS"].is_sorted(descending=True))
## True
any(df_sorted["Diluted_EPS"].flags) # Or directly if you don't know the order
## True


Stay updated on Polars and Python tips

Hopefully, this post has helped you become familiar with Polars sort usage and allowed you to enjoy a showcase of some of its features.

If you want to stay updated…

Carlos Vecina
Carlos Vecina
Senior Data Scientist at Jobandtalent

Senior Data Scientist at Jobandtalent | AI & Data Science for Business

Related