Sort in Python Polars. Arrange your DataFrames and Series
Effortless Data Arrangement: Mastering Sorting DataFrames and Series with Python Polars
How to sort data in Python Polars
Sorting data is a fundamental operation in data analysis and manipulation, and when it comes to doing it efficiently and effectively in Python. Whether you’re a data scientist, analyst, or simply a Python enthusiast looking to work with structured data, understanding how to sort data using Polars can significantly enhance your data processing capabilities. In this blog post, we will explore the ins and outs of sorting data in Polars. Get ready to dive into the world of Polars and master the art of data sorting with ease.
Polars sort a dataframe
Polars DataFrame sort method has the following typing:
DataFrame.sort(
by: IntoExpr | Iterable[IntoExpr],
descending: bool | Sequence[bool] = False,
nulls_last: bool = False,
) → DataFrame
The descending
and null_last
arguments have a default False
behaviour.
Let’s deep into it and start by create a simple Polars dataframe and check the sorting basics. You can call it directly over an string representing a column name, or sort it by using a pl.col()
structure.
import polars as pl
df = pl.DataFrame(
{
"Company": ["Tesla", "Tesla_old", "Apple", "Microsoft"],
"Market_Cap": [0.798, None, 2.78, 2.44],
"Diluted_EPS": [0.79, None, 1.26, 2.69],
}
)
Company | Market_Cap | Diluted_EPS |
---|---|---|
str | f64 | f64 |
"Tesla_old" | null | null |
"Tesla" | 0.798 | 0.79 |
"Microsoft" | 2.44 | 2.69 |
"Apple" | 2.78 | 1.26 |
## True
Check that as expected, the default behaviour is to place nulls at the beginning and sort values in ascending way.
Sort Polars Dataframe by several columns
Now, let’s sort it in descending order and move null values to the end of the Polars dataframe. You can experiment with removing null_last
as its default is False
. However, when you set “descending” to Tru0
e, null_last
automatically becomes True
, making null values appear at the end by default.
df.sort(
["Market_Cap", "Diluted_EPS"],
descending=True,
nulls_last=True # not needed when descending is True
)
Company | Market_Cap | Diluted_EPS |
---|---|---|
str | f64 | f64 |
"Apple" | 2.78 | 1.26 |
"Microsoft" | 2.44 | 2.69 |
"Tesla" | 0.798 | 0.79 |
"Tesla_old" | null | null |
Doing it by expression is also possible and becomes especially handy when you want to sort by columns while applying arithmetic operations, such as:
Company | Market_Cap | Diluted_EPS |
---|---|---|
str | f64 | f64 |
"Apple" | 2.78 | 1.26 |
"Tesla" | 0.798 | 0.79 |
"Microsoft" | 2.44 | 2.69 |
"Tesla_old" | null | null |
Polars Sort expression. Be careful with Polars expression expansion while sorting
As discussed in this GH issue, it might not be immediately intuitive for someone who has recently started using the Polars library to attempt sorting within a select statement. When you employ expression expansion, you end up with two separate expressions, each of which sorts their respective columns.
This approach can inadvertently disrupt the relative order of your data, potentially resulting in unexpected behavior for the user. It’s important to be aware of this potential issue to ensure that your data sorting operations align with your intended outcomes and don’t inadvertently compromise the data’s coherence and integrity.
df.select(pl.col(["Company","Diluted_EPS"]).sort())
# You can do smth similar by packing them inside an Struct
# df.select(pl.struct(["Company", "Diluted_EPS"]).sort())
Company | Diluted_EPS |
---|---|
str | f64 |
"Apple" | null |
"Microsoft" | 0.79 |
"Tesla" | 1.26 |
"Tesla_old" | 2.69 |
Polars sorted flag
In Polars, the use of a “sorted” flag comes in handy when you want to explicitly indicate that a column has been sorted, especially when it pertains to data generated, f.e over a range of date. This flag is automatically applied when you’ve used the sort()
expression. This flag serves as a performance-enhancing tool, optimizing subsequent operations and enabling certain functions to work more efficiently when they require the data to be in a sorted state before their execution.
Let’s take a look at an illustrative example:
## False
df_not_sorted_but_flagged_as_sorted = df.with_columns(pl.col("Diluted_EPS").set_sorted())
df_not_sorted_but_flagged_as_sorted["Diluted_EPS"].is_sorted()
## True
Remember te reasignation as Polars does not work with inplace operations.
## False
What’s happening here?? We can access to Polars column flags by doing:
## {'SORTED_ASC': False, 'SORTED_DESC': True}
The proper way to verify this is to:
## True
## True
Stay updated on Polars and Python tips
Hopefully, this post has helped you become familiar with Polars sort usage and allowed you to enjoy a showcase of some of its features.
If you want to stay updated…