Polars new feature. Suggest more efficient Polars method for apply lambda functions

Using apply with lambda functions is less performant than the native Polars API functions. Now, you are warned about it and are presented with a more efficient alternative.

Lasts days were thrilling in the Polars repo. One of the most celebrated improvements that was merged yestarday is about the usage of apply function to map udf to columns. While this is a door to write complex logic for which the common API is not prepared, for all the other cases, there is an optimized function in Polars that could run units of magnitude faster.

It makes use of the binary operations in order to detect an

So for this code

for fn in (
    lambda x: 100,
    lambda x: x,
    lambda x: x + 1 - (2 / 3),
    lambda x: x // 1 % 1,
    lambda x: x & True,
    lambda x: x | False,
    lambda x: x == "three",
    lambda x: x != 3,
    lambda x: x is None,
):
    insts = dis.get_instructions(fn)
    bytecode_ops = [(inst.opname, inst.argrepr) for inst in insts][1:-1]

    if is_unnecessary_apply(ops=bytecode_ops):
        generate_warning(ops=bytecode_ops)

We get:

Apply returns constant: use '100' directly
Apply returns constant: use 'x' directly
Unnecessary apply: use pl.col("x") + 1 - 0.6666666666666666 instead
Unnecessary apply: use pl.col("x") // 1 % 1 instead
Unnecessary apply: use pl.col("x") & True instead
Unnecessary apply: use pl.col("x") | False instead
Unnecessary apply: use pl.col("x") == 'three' instead
Unnecessary apply: use pl.col("x") != 3 instead
Unnecessary apply: use pl.col("x") is None instead

Lets see it with an example:


import polars as pl

df = pl.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'], 
    'offensive_skill': [5, 30, 85], 
    'defensive_skill': [92, 30, 10]
    })

df.with_columns(
    pl.col("defensive_skill").apply(lambda x: x/3)
)
PolarsInefficientApplyWarning: 

    Expr.apply is significantly slower than the native expressions API.
    Only use if you absolutely CANNOT implement your logic otherwise.
    In this case, you can replace your `apply` with an expression:
    -  pl.col("defensive_skill").apply(lambda x: ...)
    +  (-pl.col("defensive_skill") / 3)

So the recommended options is:

df.with_columns(
    pl.col("defensive_skill") / 3
)

For now, it can only accept a single argument (e.g., lambda x: but not lambda x, y:), and it should return a single binary operation or comparison (e.g., lambda x: x+1 or lambda x: x==1).

Additionally, the lambda function can only use its own variable (e.g., lambda a: a+1 is acceptable, but not lambda a: b+1).

You can follow the discussion thread here!


Carlos Vecina
Carlos Vecina
Senior Data Scientist at Jobandtalent

Senior Data Scientist at Jobandtalent | AI & Data Science for Business