Skip to content

Comments

Remove pandas#77

Merged
kfrey-idm merged 11 commits intoEMOD-Hub:mainfrom
kfrey-idm:kfrey_pandas
Feb 16, 2026
Merged

Remove pandas#77
kfrey-idm merged 11 commits intoEMOD-Hub:mainfrom
kfrey-idm:kfrey_pandas

Conversation

@kfrey-idm
Copy link
Member

@kfrey-idm kfrey-idm commented Feb 16, 2026

Bumps version to v3.1

Removes pandas, shapely, and scikit-learn as dependencies.

Removes deprecated infer_natural_mortality, generate_mortality_over_time_from_data, and demographics leftovers from dtk_tools.

Retains _create_grid_files even though it's definitely incorrect: see EMOD-Hub/issues-and-discussions#82

@kfrey-idm kfrey-idm self-assigned this Feb 16, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates emod-api to v3.1.0 and removes pandas, shapely, and scikit-learn from the runtime dependency set by rewriting affected implementation and test code to use the Python stdlib (notably csv) and NumPy.

Changes:

  • Bump package version to 3.1.0 and drop pandas, shapely, and scikit-learn from pyproject.toml dependencies.
  • Replace pandas-based CSV parsing/writing in demographics, gridding, and channel report code paths with stdlib csv.
  • Remove deprecated mortality-generation APIs and remove legacy dtk_tools demographics leftovers and pandas-based channel report CSV utilities/tests.

Reviewed changes

Copilot reviewed 18 out of 20 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pyproject.toml Version bump to 3.1.0 and removes pandas/shapely/scikit-learn deps.
emod_api/__init__.py Version bump to 3.1.0.
emod_api/demographics/demographics.py Replaces pandas CSV ingestion with csv reader; affects node creation from CSV.
emod_api/demographics/service/service.py Rewrites _create_grid_files to avoid pandas, uses csv + NumPy aggregation.
emod_api/demographics/service/grid_construction.py Removes shapely/pandas usage and returns dict-based grid representation.
emod_api/channelreports/channels.py Removes pandas dependency; rewrites to_csv using csv.writer.
emod_api/channelreports/icj_to_csv.py Deletes pandas-based inset chart JSON→CSV helper.
emod_api/demographics/demographics_base.py Removes deprecated infer_natural_mortality implementation and sklearn imports.
emod_api/demographics/calculators.py Removes deprecated mortality-over-time generator that relied on pandas.
tests/test_demographics.py Replaces pandas-based CSV assertions with stdlib csv; removes deprecated mortality test.
tests/test_migration.py Replaces pandas CSV read with stdlib csv validation.
tests/test_demog_from_pop.py Reworks test to avoid pandas and align with new grid CSV outputs.
tests/test_channel_reports.py Removes pandas-dependent channel dataframe and inset-chart CSV conversion tests.
tests/data/demographics/demog_in.csv Minor data edit (string field content change).
tests/data/demographics/No_Site_grid.csv Adds reference grid output used by updated population-grid tests.
emod_api/.dtk_tools/... Deletes legacy demographics tooling code pulled from dtk-tools.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +48 to 49
print("Creating grid...")

Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct() uses print() for status output. This is library code and will emit noise for all callers (including tests/CLI usage) and is hard to control. Prefer the project’s logging approach (e.g., logging.getLogger(__name__) at debug/info) or remove these statements.

Copilot uses AI. Check for mistakes.
Comment on lines 148 to 152
@@ -152,50 +152,92 @@ def get_value(row, headers):
return None
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_csv() defines a nested get_value() helper that is no longer used after the switch away from pandas. This is dead code now and makes the method harder to follow; it should be removed (or re-used if intended).

Copilot uses AI. Check for mistakes.
Comment on lines 428 to 430
csv_obj.writerow(channel_names)
for row_idx in range(self._numTimeSteps):
csv_obj.writerow([self[cname][row_idx] for cname in channel_names])
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the transpose branch, range(self._numTimeSteps) will raise AttributeError because _numTimeSteps is not a ChannelReport attribute (time steps live on self._header.num_time_steps / self.num_time_steps). Also, the previous pandas implementation wrote an explicit timestep index column; this implementation drops it, which is a breaking change for consumers reading the transposed CSV. Use self.num_time_steps and consider writing a leading "timestep" column to preserve the prior format.

Suggested change
csv_obj.writerow(channel_names)
for row_idx in range(self._numTimeSteps):
csv_obj.writerow([self[cname][row_idx] for cname in channel_names])
# Include an explicit timestep column to match the previous pandas-based format
csv_obj.writerow(["timestep"] + list(channel_names))
for row_idx in range(self.num_time_steps):
csv_obj.writerow([row_idx] + [self[cname][row_idx] for cname in channel_names])

Copilot uses AI. Check for mistakes.
Comment on lines 495 to 507
for csv_row in csv_obj:
lat_dict[float(csv_row[ni_idx])] = float(csv_row[lat_idx])
lon_dict[float(csv_row[ni_idx])] = float(csv_row[lon_idx])
pop_dict[float(csv_row[ni_idx])] = float(csv_row[pop_idx])

# checking if we have the same number of nodes and the number of rows in csv file
self.assertEqual(len(csv_df), len(demog.nodes))

for index, row in csv_df.iterrows():
pop = int(row['pop'])
lat = float(row['lat'])
lon = float(row['lon'])
node_id = int(row['node_id'])
self.assertEqual(pop, demog.nodes[index].node_attributes.initial_population)
self.assertEqual(lat, demog.nodes[index].node_attributes.latitude)
self.assertEqual(lon, demog.nodes[index].node_attributes.longitude)
self.assertEqual(node_id, demog.nodes[index].forced_id)
self.assertEqual(len(pop_dict), len(demog.nodes))

for node_id in pop_dict:
self.assertEqual(pop_dict[node_id], demog.get_node_by_id(node_id).node_attributes.initial_population)
self.assertEqual(lat_dict[node_id], demog.get_node_by_id(node_id).node_attributes.latitude)
self.assertEqual(lon_dict[node_id], demog.get_node_by_id(node_id).node_attributes.longitude)
self.assertEqual(node_id, demog.get_node_by_id(node_id).forced_id)
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV parsing here stores node_id keys as float(...), but Demographics.get_node_by_id() expects IDs that match the internal int node IDs exactly; using floats will cause lookups to fail and raise UnknownNodeException. Parse node_id as int(...) and keep dict keys as ints (and pass node_id= explicitly for clarity).

Copilot uses AI. Check for mistakes.
for row_val in csv_obj:
lat.append(float(row_val[lat_idx]))
lon.append(float(row_val[lon_idx]))
if (pop_idx):
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pop_idx is a column index (0-based). The condition if (pop_idx): will be false when the population column is the first column (index 0), causing the code to ignore the provided population values and fall back to 5.5. Use an explicit pop_idx is not None check instead (and keep the index value even when it is 0).

Suggested change
if (pop_idx):
if pop_idx is not None:

Copilot uses AI. Check for mistakes.
Comment on lines 83 to 88
@@ -85,59 +87,13 @@ def construct(x_min, y_min, x_max, y_max):
gcids.append(cell_id)
idy += 1
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grid cell IDs stored in grid_dict['gcid'] are off by one compared to the IDs stored in grid_id_2_cell_id: you set grid_id_2_cell_id[grid_id] = cell_id and then increment cell_id before appending to gcids. This makes the gcid list inconsistent with the mapping used by point_2_grid_cell_id_lookup, and breaks downstream joins/aggregation that assume the IDs match.

Copilot uses AI. Check for mistakes.
node = Node(lat_val, lon_val, pop_val, node_attributes=node_attributes, forced_id=forced_id, meta=dict())
out_nodes.append(node)

print(out_nodes)
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print(out_nodes) will dump the entire node list to stdout every time Demographics.from_csv() is called, which can be extremely large and will pollute logs/console output. This should be removed or replaced with a bounded debug log message (e.g., count of nodes).

Suggested change
print(out_nodes)
print(f"Created {len(out_nodes)} nodes from {input_file}.")

Copilot uses AI. Check for mistakes.
Comment on lines +545 to 556
with open(input_file) as csv_file:
csv_obj = csv.reader(csv_file, dialect='unix')
headers = next(csv_obj, None)
ni_idx = headers.index('node_id')
br_idx = headers.index('birth_rate')
for csv_row in csv_obj:
br_dict[float(csv_row[ni_idx])] = float(csv_row[br_idx])

# Compare csv birth rates with demographics file
for node_id in br_dict:
birth_rate = br_dict[node_id]
self.assertAlmostEqual(demog.get_node_by_id(node_id=node_id).birth_rate, birth_rate)
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: br_dict uses float(csv_row[ni_idx]) for node IDs, but get_node_by_id() requires the exact int ID. Use int(...) for node_id keys to avoid UnknownNodeException during the assertions.

Copilot uses AI. Check for mistakes.
num_row = 0
for csv_row in csv_obj:
for row_val in csv_row:
assert(len(row_val) > 0)
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses a bare assert(...) inside a unittest.TestCase method. Prefer self.assertTrue(...) / self.assertGreater(...) so the assertion is never skipped under python -O and failure messages are consistent with the rest of the suite.

Suggested change
assert(len(row_val) > 0)
self.assertGreater(len(row_val), 0)

Copilot uses AI. Check for mistakes.
@kfrey-idm kfrey-idm merged commit 877e51a into EMOD-Hub:main Feb 16, 2026
12 checks passed
@kfrey-idm kfrey-idm deleted the kfrey_pandas branch February 16, 2026 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant