Skip to content

[Bug]: numpy 2.0 breaks OpenData store (probable pandas bug) #995

@rkingsbury

Description

@rkingsbury

Since updating dependencies to allow numpy 2.0 (#986 ), we have a test failure in the OpenData store that is being triggered by pandas. See, for example, this failed test run.

The exception raised is

   def resolve(self, key: str, is_local: bool):
        """
        Resolve a variable name in a possibly local context.
    
        Parameters
        ----------
        key : str
            A variable name
        is_local : bool
            Flag indicating whether the variable is local or not (prefixed with
            the '@' symbol)
    
        Returns
        -------
        value : object
            The value of a particular variable
        """
        try:
            # only look for locals in outer scope
            if is_local:
                return self.scope[key]
    
            # not a local variable so check in resolvers if we have them
            if self.has_resolvers:
                return self.resolvers[key]
    
            # if we're here that means that we have no locals and we also have
            # no resolvers
            assert not is_local and not self.has_resolvers
            return self.scope[key]
        except KeyError:
            try:
                # last ditch effort we look in temporaries
                # these are created when parsing indexing expressions
                # e.g., df[df > 0]
                return self.temps[key]
            except KeyError as err:
>               raise UndefinedVariableError(key, is_local) from err
E               pandas.errors.UndefinedVariableError: name 'np' is not defined

/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/pandas/core/computation/scope.py:244: UndefinedVariableError

The test triggering the failure is test_update:

def test_update(s3store):
    assert len(s3store.index_data) == 2
    s3store.update(
        pd.DataFrame(
            [
                {
                    "task_id": "mp-199999",
                    "data": "asd",
                    "group": {"level_two": 4},
                    s3store.last_updated_field: datetime.utcnow(),
                }
            ]
        )
    )

I did some debugging on the resolve function in pandas (see source code here) and determined that the number in {"level_two": 4}, is getting turned into a np.int64 and that the key and is_local args to the resolve function are np and False, respectively.

Somewhere, pandas is getting confused and using np as a variable name. I'm not sure how or why this is happening but I have a feeling it is a pandas bug. The following may be relevant:

pandas-dev/pandas#54252

https://numpy.org/devdocs/numpy_2_0_migration_guide.html#windows-default-integer

Version

latest

Which OS?

  • MacOS
  • Windows
  • Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions