Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions pipelines/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,12 +93,14 @@ def prepare_lookup(data: str | list[str] | pd.Series | pd.DataFrame) -> pd.Serie
elif isinstance(data, (list, pd.Series)):
result = pd.DataFrame(data)
else:
result = data
# Handle unexpected types (like int, float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that this will work. The current else is covering the situation where data has type pd.DataFrame. We want to end up in a situation where we have a regular dataframe to work on. What does pd.DataFrame(pd.DataFrame(data)) actually do.

Also, if the int label is passed as part of a list or series then it will still be part of the resulting dataframe.

Can't we solve this by just reversing the order of the if clauses and leaving everything else untouched:

    if isinstance(data, pd.DataFrame):
        result = data
    elif isinstance(data, (list, pd.Series)):
        result = pd.DataFrame(data)
    else:
        # Handle other types (like str, int, float)
        result = pd.DataFrame([data])

Then the result.map(str) will apply to a dataframe containing a column that contains an int, which will convert it to a str.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And similarly reverse the if statements for extracting the result from the dataframe to return.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@girum-air you might also want to add a pipelines_tests.test_utils.PrepareLookupTestCase that tests each of these input types (str, int, list, Series, DataFrame)

result = pd.DataFrame([str(data)])

result = result.map(str).map(str.strip).map(str.lower).replace(r"\s+", " ", regex=True)
if isinstance(data, str):
result = result.iloc[0, 0]
if isinstance(data, str) or (not isinstance(data, (list, pd.Series))):
return result.iloc[0, 0]
elif isinstance(data, (list, pd.Series)):
result = result.iloc[:, 0]
return result.iloc[:, 0]
return result


Expand Down