Polars new feature. Suggest more efficient Polars method for apply lambda functions
Using apply with lambda functions is less performant than the native Polars API functions. Now, you are warned about it and are presented with a more efficient alternative.
Lasts days were thrilling in the Polars repo. One of the most celebrated improvements that was merged yestarday is about the usage of apply
function to map udf to columns. While this is a door to write complex logic for which the common API is not prepared, for all the other cases, there is an optimized function in Polars that could run units of magnitude faster.
It makes use of the binary operations in order to detect an
So for this code
for fn in (
lambda x: 100,
lambda x: x,
lambda x: x + 1 - (2 / 3),
lambda x: x // 1 % 1,
lambda x: x & True,
lambda x: x | False,
lambda x: x == "three",
lambda x: x != 3,
lambda x: x is None,
):
insts = dis.get_instructions(fn)
bytecode_ops = [(inst.opname, inst.argrepr) for inst in insts][1:-1]
if is_unnecessary_apply(ops=bytecode_ops):
generate_warning(ops=bytecode_ops)
We get:
Apply returns constant: use '100' directly
Apply returns constant: use 'x' directly
Unnecessary apply: use pl.col("x") + 1 - 0.6666666666666666 instead
Unnecessary apply: use pl.col("x") // 1 % 1 instead
Unnecessary apply: use pl.col("x") & True instead
Unnecessary apply: use pl.col("x") | False instead
Unnecessary apply: use pl.col("x") == 'three' instead
Unnecessary apply: use pl.col("x") != 3 instead
Unnecessary apply: use pl.col("x") is None instead
Lets see it with an example:
import polars as pl
df = pl.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'offensive_skill': [5, 30, 85],
'defensive_skill': [92, 30, 10]
})
df.with_columns(
pl.col("defensive_skill").apply(lambda x: x/3)
)
PolarsInefficientApplyWarning:
Expr.apply is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
In this case, you can replace your `apply` with an expression:
- pl.col("defensive_skill").apply(lambda x: ...)
+ (-pl.col("defensive_skill") / 3)
So the recommended options is:
For now, it can only accept a single argument (e.g., lambda x: but not lambda x, y:), and it should return a single binary operation or comparison (e.g., lambda x: x+1 or lambda x: x==1).
Additionally, the lambda function can only use its own variable (e.g., lambda a: a+1 is acceptable, but not lambda a: b+1).
You can follow the discussion thread here!