Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates emod-api to v3.1.0 and removes pandas, shapely, and scikit-learn from the runtime dependency set by rewriting affected implementation and test code to use the Python stdlib (notably csv) and NumPy.
Changes:
- Bump package version to
3.1.0and droppandas,shapely, andscikit-learnfrompyproject.tomldependencies. - Replace pandas-based CSV parsing/writing in demographics, gridding, and channel report code paths with stdlib
csv. - Remove deprecated mortality-generation APIs and remove legacy
dtk_toolsdemographics leftovers and pandas-based channel report CSV utilities/tests.
Reviewed changes
Copilot reviewed 18 out of 20 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Version bump to 3.1.0 and removes pandas/shapely/scikit-learn deps. |
emod_api/__init__.py |
Version bump to 3.1.0. |
emod_api/demographics/demographics.py |
Replaces pandas CSV ingestion with csv reader; affects node creation from CSV. |
emod_api/demographics/service/service.py |
Rewrites _create_grid_files to avoid pandas, uses csv + NumPy aggregation. |
emod_api/demographics/service/grid_construction.py |
Removes shapely/pandas usage and returns dict-based grid representation. |
emod_api/channelreports/channels.py |
Removes pandas dependency; rewrites to_csv using csv.writer. |
emod_api/channelreports/icj_to_csv.py |
Deletes pandas-based inset chart JSON→CSV helper. |
emod_api/demographics/demographics_base.py |
Removes deprecated infer_natural_mortality implementation and sklearn imports. |
emod_api/demographics/calculators.py |
Removes deprecated mortality-over-time generator that relied on pandas. |
tests/test_demographics.py |
Replaces pandas-based CSV assertions with stdlib csv; removes deprecated mortality test. |
tests/test_migration.py |
Replaces pandas CSV read with stdlib csv validation. |
tests/test_demog_from_pop.py |
Reworks test to avoid pandas and align with new grid CSV outputs. |
tests/test_channel_reports.py |
Removes pandas-dependent channel dataframe and inset-chart CSV conversion tests. |
tests/data/demographics/demog_in.csv |
Minor data edit (string field content change). |
tests/data/demographics/No_Site_grid.csv |
Adds reference grid output used by updated population-grid tests. |
emod_api/.dtk_tools/... |
Deletes legacy demographics tooling code pulled from dtk-tools. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| print("Creating grid...") | ||
|
|
There was a problem hiding this comment.
construct() uses print() for status output. This is library code and will emit noise for all callers (including tests/CLI usage) and is hard to control. Prefer the project’s logging approach (e.g., logging.getLogger(__name__) at debug/info) or remove these statements.
| @@ -152,50 +152,92 @@ def get_value(row, headers): | |||
| return None | |||
There was a problem hiding this comment.
from_csv() defines a nested get_value() helper that is no longer used after the switch away from pandas. This is dead code now and makes the method harder to follow; it should be removed (or re-used if intended).
| csv_obj.writerow(channel_names) | ||
| for row_idx in range(self._numTimeSteps): | ||
| csv_obj.writerow([self[cname][row_idx] for cname in channel_names]) |
There was a problem hiding this comment.
In the transpose branch, range(self._numTimeSteps) will raise AttributeError because _numTimeSteps is not a ChannelReport attribute (time steps live on self._header.num_time_steps / self.num_time_steps). Also, the previous pandas implementation wrote an explicit timestep index column; this implementation drops it, which is a breaking change for consumers reading the transposed CSV. Use self.num_time_steps and consider writing a leading "timestep" column to preserve the prior format.
| csv_obj.writerow(channel_names) | |
| for row_idx in range(self._numTimeSteps): | |
| csv_obj.writerow([self[cname][row_idx] for cname in channel_names]) | |
| # Include an explicit timestep column to match the previous pandas-based format | |
| csv_obj.writerow(["timestep"] + list(channel_names)) | |
| for row_idx in range(self.num_time_steps): | |
| csv_obj.writerow([row_idx] + [self[cname][row_idx] for cname in channel_names]) |
| for csv_row in csv_obj: | ||
| lat_dict[float(csv_row[ni_idx])] = float(csv_row[lat_idx]) | ||
| lon_dict[float(csv_row[ni_idx])] = float(csv_row[lon_idx]) | ||
| pop_dict[float(csv_row[ni_idx])] = float(csv_row[pop_idx]) | ||
|
|
||
| # checking if we have the same number of nodes and the number of rows in csv file | ||
| self.assertEqual(len(csv_df), len(demog.nodes)) | ||
|
|
||
| for index, row in csv_df.iterrows(): | ||
| pop = int(row['pop']) | ||
| lat = float(row['lat']) | ||
| lon = float(row['lon']) | ||
| node_id = int(row['node_id']) | ||
| self.assertEqual(pop, demog.nodes[index].node_attributes.initial_population) | ||
| self.assertEqual(lat, demog.nodes[index].node_attributes.latitude) | ||
| self.assertEqual(lon, demog.nodes[index].node_attributes.longitude) | ||
| self.assertEqual(node_id, demog.nodes[index].forced_id) | ||
| self.assertEqual(len(pop_dict), len(demog.nodes)) | ||
|
|
||
| for node_id in pop_dict: | ||
| self.assertEqual(pop_dict[node_id], demog.get_node_by_id(node_id).node_attributes.initial_population) | ||
| self.assertEqual(lat_dict[node_id], demog.get_node_by_id(node_id).node_attributes.latitude) | ||
| self.assertEqual(lon_dict[node_id], demog.get_node_by_id(node_id).node_attributes.longitude) | ||
| self.assertEqual(node_id, demog.get_node_by_id(node_id).forced_id) |
There was a problem hiding this comment.
CSV parsing here stores node_id keys as float(...), but Demographics.get_node_by_id() expects IDs that match the internal int node IDs exactly; using floats will cause lookups to fail and raise UnknownNodeException. Parse node_id as int(...) and keep dict keys as ints (and pass node_id= explicitly for clarity).
| for row_val in csv_obj: | ||
| lat.append(float(row_val[lat_idx])) | ||
| lon.append(float(row_val[lon_idx])) | ||
| if (pop_idx): |
There was a problem hiding this comment.
pop_idx is a column index (0-based). The condition if (pop_idx): will be false when the population column is the first column (index 0), causing the code to ignore the provided population values and fall back to 5.5. Use an explicit pop_idx is not None check instead (and keep the index value even when it is 0).
| if (pop_idx): | |
| if pop_idx is not None: |
| @@ -85,59 +87,13 @@ def construct(x_min, y_min, x_max, y_max): | |||
| gcids.append(cell_id) | |||
| idy += 1 | |||
There was a problem hiding this comment.
The grid cell IDs stored in grid_dict['gcid'] are off by one compared to the IDs stored in grid_id_2_cell_id: you set grid_id_2_cell_id[grid_id] = cell_id and then increment cell_id before appending to gcids. This makes the gcid list inconsistent with the mapping used by point_2_grid_cell_id_lookup, and breaks downstream joins/aggregation that assume the IDs match.
| node = Node(lat_val, lon_val, pop_val, node_attributes=node_attributes, forced_id=forced_id, meta=dict()) | ||
| out_nodes.append(node) | ||
|
|
||
| print(out_nodes) |
There was a problem hiding this comment.
print(out_nodes) will dump the entire node list to stdout every time Demographics.from_csv() is called, which can be extremely large and will pollute logs/console output. This should be removed or replaced with a bounded debug log message (e.g., count of nodes).
| print(out_nodes) | |
| print(f"Created {len(out_nodes)} nodes from {input_file}.") |
| with open(input_file) as csv_file: | ||
| csv_obj = csv.reader(csv_file, dialect='unix') | ||
| headers = next(csv_obj, None) | ||
| ni_idx = headers.index('node_id') | ||
| br_idx = headers.index('birth_rate') | ||
| for csv_row in csv_obj: | ||
| br_dict[float(csv_row[ni_idx])] = float(csv_row[br_idx]) | ||
|
|
||
| # Compare csv birth rates with demographics file | ||
| for node_id in br_dict: | ||
| birth_rate = br_dict[node_id] | ||
| self.assertAlmostEqual(demog.get_node_by_id(node_id=node_id).birth_rate, birth_rate) |
There was a problem hiding this comment.
Same issue as above: br_dict uses float(csv_row[ni_idx]) for node IDs, but get_node_by_id() requires the exact int ID. Use int(...) for node_id keys to avoid UnknownNodeException during the assertions.
tests/test_migration.py
Outdated
| num_row = 0 | ||
| for csv_row in csv_obj: | ||
| for row_val in csv_row: | ||
| assert(len(row_val) > 0) |
There was a problem hiding this comment.
This test uses a bare assert(...) inside a unittest.TestCase method. Prefer self.assertTrue(...) / self.assertGreater(...) so the assertion is never skipped under python -O and failure messages are consistent with the rest of the suite.
| assert(len(row_val) > 0) | |
| self.assertGreater(len(row_val), 0) |
Bumps version to v3.1
Removes pandas, shapely, and scikit-learn as dependencies.
Removes deprecated
infer_natural_mortality,generate_mortality_over_time_from_data, and demographics leftovers from dtk_tools.Retains
_create_grid_fileseven though it's definitely incorrect: see EMOD-Hub/issues-and-discussions#82