Conversation
e78e897 to
fd5a1ad
Compare
process_report/invoices/invoice.py
Outdated
| self.data[column_name] = value | ||
| self.data[column_name] = self.data[column_name].astype(column_type) |
There was a problem hiding this comment.
I'm a little puzzled about what's happening here: we're setting self.data[column_name], and then immediately setting it to another value. I guess this is to have access to the astype method, but is there way of accomplishing this that is mroe intuitive?
Looking through the docs a bit, I think we could write this as:
self.data[column_name] = pandas.Series(value, dtype=column_type)
My simple test:
>>> BALANCE_FIELD_TYPE = pandas.ArrowDtype(pyarrow.decimal128(21, 2))
>>> df = pandas.DataFrame()
>>> df['foo'] = pandas.Series([1,2,3], dtype=BALANCE_FIELD_TYPE)
>>> df
foo
0 1.00
1 2.00
2 3.00
| ### Invoice column types | ||
| BOOL_FIELD_TYPE = pandas.BooleanDtype() | ||
| STRING_FIELD_TYPE = pandas.StringDtype() | ||
| BALANCE_FIELD_TYPE = pandas.ArrowDtype(pyarrow.decimal128(21, 2)) | ||
| ### |
There was a problem hiding this comment.
Shouldn't we be importing these from process_report.invoices.invoice rather than re-defining them here?
There was a problem hiding this comment.
My idea was that the tests should check if the invoicing code can handle input dataframes with specific datatypes. If I imported these types from process_report.invoices.invoice, then I wouldn't be testing the type compatibility. I imagined this would be useful in a case where if a future change accidentally or intentionally changed the typing of STRING_FIELD_TYPE or BALANCE_FIELD_TYPE, I would like my tests to show me there's typing incompatibility.
Previously, there was no consistent method to create new invoice columns with an explicit type. This has been the source of some confusion when debugging and writing test cases. As such, several changes have been added to reduce confusion over the typing of invoice columns: - A function (`_create_column`) has been added to `invoice.py` to provide a consistent way of creating new columns with an explicit type. - Many fields will now use Pandas' built-in extension types [1], such as `BooleanDtype`, which allows them to be nullable. This also enforces typing, as assigning incompatible values to these columns will raise a type error - Decimal precision has been increased to 21. This is to allow `int64` values to be converted to `pyarrow.decimal128(21, 2)`. This is useful for writing test cases (no longer have to write `[1.0, 2.0]`), and in case our invoices have balance fields that resemble integers - A new test case base class is made with a function to create a dataframe with standardized column types. This allows creating invoice dataframes that more closely resemble what our pipeline will see during execution - Removed redundant code in `_get_test_invoice()` of many test classes - `PIInvoice` will cast the dataframe to `StringDtype` to prevent TypeError with the new Pandas extension types [1] https://pandas.pydata.org/docs/user_guide/basics.html#dtypes
Closes #191. More details in the commit message.
Currently, the test cases are failing because the PI-specific invoice injects an incompatible value in the invoice dataframe. After some thinking, I thought the PI invoice could do some refactoring in that spot as well. I'll put this as a draft for now while I work on refactor the PI invoice.