Skip to content

Regex Enhancements #3

@wrmthorne

Description

@wrmthorne

On a side note: I'll probably also add fancy_regex at some point to support e.g. look-around. It should perform similar for the current regular expressions but I want to test that first.
Originally posted by @uniQIndividual in #2 (comment)

I thought I would open an issue I had some thoughts to contribute if you're open to discussion.

As a quick feature proposal that I have been debating implementing is another config option to toggle the regex filter between include and exclude (one). Alternatively, the pattern could be separated out into an include and exclude pattern which would be flexible and intuitive (one or both). I'm unsure of the performance implications vs a single more complex regex.

To provide my own example of a need for look-arounds:
As it stands, I can't find a reliable way to filter out specific subreddits from the push shift archive with just zstd-jsonl-filter. The way they store crosspost information is by nesting the whole submission object under crosspost_parent_list as part of a list. Without lookarounds, I can't reliably tell whether I am fetching e.g. a submission from r/rust or whether the source of a crosspost was from r/rust. Thankfully this only leads to false-positives that are easily excluded in polars. There are, however, lots of them.

I have some other unrelated feature ideas/improvements - I can open a new issue if you'd prefer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions