Skip to content

ENH: Custom parser and engine in pd.eval, pd.query and pandas.core.computation.expr.Expr #45444

@erezinman

Description

@erezinman

Is your feature request related to a problem?

I don't understand the context. If by " problem" you mean "a bug", then no. If you mean "a problem that i currently have", then see the Addtional Context part.

Describe the solution you'd like

I would like to be able to pass a factory-function or a matching instance in either of the engine and parser arguments in pd.eval, and pd.DataFrame.query. For example, to be able to run:

df.query(..., parser=MyParser)

Where MyParser is a custom parser type (can also be a factory function) that accepts the same parameters as any BaseExprVisitor derived class (i.e. env, engine, parser, preparser).

API breaking implications

  1. The Expr.parser and Expr.engine should be instantiated outside the Expr class' initialization. Note that this settles better with the current (newer) multi-expression implementation of the function.
  2. Maybe consider moving pandas.core.computation.expr._parsers and pandas.core.computation.engines._engine to pandas.core.computation.eval or similar, and instantiate them in pd.core.computation.eval.eval.
  3. Also consider "breaking" pd.core.computation.eval.eval into the regular pd.core.computation.eval.eval and a newer pd.core.computation.eval.eval_single_expression that evaluates a single Expr class instance (basically line 353 and below) to allow a more customizable evaluation behavior.

From what I see these changes shouldn't be a big deal at all, but I'm no expert.

Describe alternatives you've considered

As an alternative, what I currently do is

from pandas.core.computation.expr import  PARSERS, PandasExprVisitor

class MyParser(PandasExprVisitor): pass

PARSERS['my_parser'] = MyParser

which is, of course, hacky and undocumented.

Additional context

I have a dataclass that contains pd.Series and I would like to implement pd.query like the dataframe does. The idea I came up with is as follows:

Suppose I get the call:

my_cls_instance.query('((a * 2) == 1) & (b == 2)`)

where a and b are series contained in my class.
I figure that in order to evaluate this using pandas (with minimal interruptions as possible), I need to

  1. separate the "unary" expressions (here they are ((a * 2) == 1) and (b == 2)) from the "binary" expressions (here - the &),
  2. evaluate each "unary" expression individually using pd.eval or similar,
  3. perform one of the following (I'm unsure of the best course of action):
    3.1. replace the evaluated results in the strings (e.g. '__processed_1__ & __processed_2__') and rerun pd.eval again; or
    3.2. to change the contents of the Expr and evaluate.

So looking at the code, I found that what I want is possible only if I can use my own parser, and/or engine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions