MBW pipeline migration: v2 job handler, CCHQ CLI OAuth, parity tests#24
MBW pipeline migration: v2 job handler, CCHQ CLI OAuth, parity tests#24
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cchq_fetcher.py that normalizes CCHQ form dicts to Connect visit dict shape so FieldComputation path extraction works identically. Integrate the CCHQ data source into all 3 fetch paths in pipeline.py (cache miss with filters, force refresh with filters, and normal unfiltered path). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create job handler that orchestrates MBW dashboard computations using existing computation modules. The handler receives pipeline data (visits, registrations, gs_forms) and runs GPS analysis, follow-up rate computation, quality metrics, FLW performance, and overview summary. Includes a PipelineRowAdapter to bridge serialized pipeline row dicts to the VisitRow-like interface expected by followup_analysis functions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the SSE streaming data loading layer with pipeline + job handler approach. The v2 render code: - Reads from the `pipelines` prop (visits, registrations, gs_forms) which are auto-loaded by the workflow runner infrastructure - Shows pipeline loading status with row counts for each data source - Presents a "Run Analysis" button when all 3 pipelines are loaded - Triggers actions.startJob() to run the mbw_monitoring job handler - Streams job progress via actions.streamJobProgress() - Assembles dashData from job results in the shape the tabs expect All 4 tab renderings (Overview, GPS, Follow-Up, FLW Performance) and all worker result management code remain identical to v1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 21 parity tests comparing v1 (MBWMonitoringStreamView) and v2 (job handler) data transformation paths. Tests cover: - GPS dict-building and full GPS analysis - Per-mother field extraction (parity, ANC/PNC dates, baby DOB) - EBF % computation - PipelineRowAdapter attribute access fidelity - Follow-up analysis (build_followup, aggregate, status distribution) - End-to-end job handler output comparison The tests caught a real bug: _build_gps_visit_dicts was not lowercasing usernames, causing mixed-case usernames to be treated as separate FLWs (v1 normalizes to lowercase at line 458 of views.py). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add CCHQ OAuth to CLI: `create_cli_request()` now auto-loads CommCare HQ token from `~/.commcare-connect/commcare_token.json` into the mock session so `CommCareDataAccess` works transparently in CLI scripts. - Fix CommCare HQ OAuth endpoints: `get_oauth_token()` now accepts `authorize_path` and `token_path` params (CCHQ uses `/oauth/` not `/o/`). `get_commcare_token` command passes the correct paths. - Fix GPS parity: add `visit_date` to `_build_gps_visit_dicts()` output in the v2 job handler — without it, `analyze_gps_metrics` couldn't compute trailing_7_days or avg_daily_travel_km. - Fix pipeline import: use lazy `get_backend()` instead of direct `SQLBackend()` to avoid AppRegistryNotReady errors. - Add `COMMCARE_OAUTH_CLI_CLIENT_ID` to local settings. - Extract v1 transforms to shared `data_transforms.py` module, update parity tests to use them instead of duplicated inline code. - Add `test_mbw_parity` management command for live end-to-end payload comparison with `--gs-app-id` flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR removes the Python/Redis labs analysis backend and Protocol, consolidates analysis on an SQL/PostgreSQL backend with SQL-backed cache tables, adds CCHQ Forms as a data source, refactors visit-level computations, and introduces pipeline/job-driven MBW Monitoring V2 with tests and parity tooling. Changes
Sequence Diagram(s)mermaid UI->>Runner: start MBW job (pipeline config) Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (9)
commcare_connect/custom_analysis/chc_nutrition/management/commands/test_chc_nutrition.py (2)
245-255: Theuse_cacheparameter is now unused.The
run_full_analysismethod acceptsuse_cache=False(line 245), but the newAnalysisPipeline.stream_analysis_ignore_events()call (line 255) doesn't use it. The--use-cacheCLI argument (lines 61-64) has no effect.Consider removing the dead parameter or implementing cache control in the SQL backend.
♻️ Suggested cleanup
- def run_full_analysis(self, request, opportunity_id, use_cache=False): + def run_full_analysis(self, request, opportunity_id): """Run the full analysis and show results."""And remove the
--use-cacheargument fromadd_arguments, or keep it with a deprecation warning until SQL cache control is implemented.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/custom_analysis/chc_nutrition/management/commands/test_chc_nutrition.py` around lines 245 - 255, The run_full_analysis method currently accepts a use_cache parameter but never passes it to AnalysisPipeline.stream_analysis_ignore_events, so the CLI --use-cache flag has no effect; either remove the dead parameter and the --use-cache argument in add_arguments (or emit a deprecation warning from add_arguments) or wire cache control through the pipeline by updating run_full_analysis to pass use_cache into AnalysisPipeline (e.g., call stream_analysis_ignore_events(CHC_NUTRITION_CONFIG, use_cache=use_cache)) and implement cache handling in AnalysisPipeline.stream_analysis_ignore_events/SQL backend accordingly.
234-243: Cache test stub is acceptable, but consider removing the CLI argument.The TODO comment is helpful for tracking the incomplete migration. However, the
--test-cacheargument still exists inadd_arguments(line 66-69), which may confuse users. Consider either:
- Removing the argument entirely until SQL cache testing is implemented
- Adding a deprecation warning when the argument is provided
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/custom_analysis/chc_nutrition/management/commands/test_chc_nutrition.py` around lines 234 - 243, Remove the obsolete CLI flag by deleting the "--test-cache" argument from the command's add_arguments method and any related handling, and remove or simplify test_cache_functionality's warning block (or keep the stub but no longer reference the flag), so users aren't offered a non-functional option; locate add_arguments and test_cache_functionality in the management command and update help text/docstring to reflect that SQL-backend cache testing is not implemented (alternatively, if you prefer a softer change, add a deprecation warning inside add_arguments that logs a clear message when "--test-cache" is present and points to the TODO in test_cache_functionality).commcare_connect/labs/tests/test_cchq_pipeline.py (2)
28-28: Remove unused imports.
MagicMockandpatchare imported but not used in the current tests.♻️ Cleanup unused imports
-from unittest.mock import MagicMock, patch # noqa: E402🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/labs/tests/test_cchq_pipeline.py` at line 28, The test file imports unused symbols MagicMock and patch; remove them from the import statement in commcare_connect/labs/tests/test_cchq_pipeline.py (the line containing "from unittest.mock import MagicMock, patch") so the module only imports what it actually uses; if other tests later need mocking, add the specific imports at that time.
8-25: Consider using pytest-django instead of manual configuration.The manual
settings.configure()block is fragile:
- It may conflict if tests run after Django is already configured
- It sets up a minimal in-memory SQLite that may not match production schema
If pytest-django is available, use
@pytest.mark.django_dbandconftest.pyfixtures instead. If this file must run standalone (e.g., during CI without full Django setup), add a comment explaining why.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/labs/tests/test_cchq_pipeline.py` around lines 8 - 25, The manual Django bootstrap (settings.configure(...) and django.setup() in this test module) is fragile; replace it by relying on pytest-django fixtures or make the intent explicit: remove the inline settings/configure and annotate tests with `@pytest.mark.django_db` and add any required DB fixtures in conftest.py, or if this module must run standalone, add a brief comment above the settings.configure/django.setup explaining why standalone configuration is required and guard it with a check to avoid reconfiguring Django (e.g., only configure when not settings.configured), keeping the existing settings keys moved to a project conftest.py fixture named e.g. django_db_settings to centralize configuration.commcare_connect/workflow/templates/__init__.py (1)
148-190: Consider making pipeline_schema and pipeline_schemas mutually exclusive.If a template provides both
pipeline_schema(singular) andpipeline_schemas(plural), both blocks will execute and append topipeline_sources. This could lead to unexpected behavior.Consider adding a guard or documenting the expected usage:
♻️ Proposed fix to make schemas mutually exclusive
+ # Handle pipeline creation - singular OR plural, not both + pipeline_schemas = template.get("pipeline_schemas", []) + if pipeline_schema and pipeline_schemas: + raise ValueError( + f"Template {template_key} has both pipeline_schema and pipeline_schemas; use only one" + ) + # Create pipeline if template has one (singular schema) if pipeline_schema and request:Alternatively, if both are intentionally supported (singular is "main" + plural are "additional"), document this behavior clearly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/workflow/templates/__init__.py` around lines 148 - 190, The code currently processes both pipeline_schema and pipeline_schemas and appends into pipeline_sources, causing unexpected duplicates; update the logic in __init__.py around pipeline_schema, pipeline_schemas, pipeline_sources so the two modes are mutually exclusive (e.g., if pipeline_schemas is present prefer/only process it, else process pipeline_schema) or add an explicit guard that raises/logs when both are provided; ensure the branch that computes pipeline_alias using template_key/alias_map remains applied only for the singular pipeline_schema path and that pipeline_data_access usage and pipeline_sources population occur in the chosen exclusive branch.commcare_connect/labs/analysis/computations.py (1)
103-116: Broad exception handling is acceptable here, but consider being more specific.The
except Exceptionblocks (flagged by Ruff BLE001) are reasonable for batch data processing where one malformed record shouldn't crash the entire pipeline. However, for transforms specifically, you could narrow to common expected exceptions:♻️ Optional: More specific exception handling for transforms
if value is not None and field_comp.transform and not field_comp.uses_extractor: try: value = field_comp.transform(value) - except Exception: + except (TypeError, ValueError, AttributeError, KeyError): value = NoneThis makes the intent clearer while still catching the common failure modes. The outer
except Exceptionon line 114 can remain broad since extraction failures are more varied.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/labs/analysis/computations.py` around lines 103 - 116, The inner transform error handling currently uses a bare except in the field processing loop; update the try/except around field_comp.transform(value) to catch specific expected errors (for example ValueError, TypeError, KeyError) instead of Exception so malformed values are handled explicitly, while leaving the outer except in the surrounding block (the handler that logs "Failed to compute field {field_comp.name} for visit {visit.id}: {e}") as a broad except to continue protecting the overall extraction; reference field_comp.transform and the outer except that assigns visit_result[field_comp.name] = field_comp.default when implementing this change.commcare_connect/labs/analysis/pipeline.py (1)
340-386: Extract duplicated raw-fetch streaming logic into one helper.These two branches duplicate the same event loop/translation behavior. A single internal helper would reduce drift risk and keep future fixes consistent.
♻️ Refactor sketch
+ def _stream_visit_fetch(self, opp_id: int, force_refresh: bool): + visit_dicts = None + for event in self.backend.stream_raw_visits( + opportunity_id=opp_id, + access_token=self.access_token, + expected_visit_count=self.visit_count, + force_refresh=force_refresh, + ): + # existing cached/progress/parsing/complete mapping + ... + return visit_dictsAlso applies to: 455-484
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/labs/analysis/pipeline.py` around lines 340 - 386, The two branches that produce and translate streaming events duplicate the same loop; extract that logic into a single helper (e.g., a private method _stream_and_collect_raw_visits(opportunity_id, access_token, force_refresh, expected_visit_count) on the Pipeline class) and call it from both the cchq_forms branch (after fetch_cchq_forms_as_visit_dicts produces its visit_dicts) and the self.backend branch; the helper should iterate the same event tuples from self.backend.stream_raw_visits (or accept an iterator of events), map "cached"/"progress"/"parsing"/"complete" to the same yields (EVENT_STATUS and EVENT_DOWNLOAD), set/return visit_dicts when "cached" or "complete" occurs, and preserve logging using self.backend_name and logger so the behavior in fetch_cchq_forms_as_visit_dicts, visit_dicts assignment, EVENT_STATUS/EVENT_DOWNLOAD yields, and messages remain identical; apply the same refactor for the other duplicated region around the second occurrence.commcare_connect/workflow/job_handlers/mbw_monitoring.py (2)
186-222: Theaccess_tokenparameter is unused but required by the handler interface.The static analysis correctly flags that
access_tokenis not used in this handler. However, the@register_job_handlerdecorator requires all handlers to have the signature(job_config, access_token, progress_callback). Consider adding an underscore prefix (_access_token) to signal intentional non-use and silence the linter, or add a brief comment explaining it's unused for this handler.💡 Optional: Signal intentional non-use
`@register_job_handler`("mbw_monitoring") -def handle_mbw_monitoring_job(job_config: dict, access_token: str, progress_callback) -> dict: +def handle_mbw_monitoring_job(job_config: dict, _access_token: str, progress_callback) -> dict:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/workflow/job_handlers/mbw_monitoring.py` around lines 186 - 222, The access_token parameter of handle_mbw_monitoring_job is intentionally unused but flagged by linters; update the function signature to rename access_token to _access_token (or prefix it with an underscore) and/or add a brief inline comment like "# intentionally unused — required by register_job_handler interface" inside handle_mbw_monitoring_job to explicitly signal non-use and silence static analysis warnings while keeping the required (job_config, access_token, progress_callback) shape for `@register_job_handler`.
357-370: Non-idiomatic variable existence check usingdir().Line 370 uses
"adapted_visit_rows" not in dir()to check if a local variable exists, which is unusual and confusing. This check is needed becauseadapted_visit_rowsis defined inside the try block (line 296) and may not exist if an exception occurs before that line.A cleaner approach is to initialize the variable before the try block:
♻️ Suggested refactor: Initialize variable before try block
Move the initialization before Step 2's try block:
+ # Initialize before try block so it exists even if step 2 fails early + adapted_visit_rows: list[_PipelineRowAdapter] = [] + # ========================================================================= # Step 2: Follow-up Rate Analysis # ========================================================================= try: progress_callback("Computing follow-up rates...", processed=1, total=5) # Adapt visit rows for followup_analysis functions adapted_visit_rows = _adapt_rows(visit_rows)Then simplify the exception handler:
- adapted_visit_rows = _adapt_rows(visit_rows) if "adapted_visit_rows" not in dir() else adapted_visit_rows + if not adapted_visit_rows: + adapted_visit_rows = _adapt_rows(visit_rows)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/workflow/job_handlers/mbw_monitoring.py` around lines 357 - 370, The code uses "adapted_visit_rows" not in dir() to detect whether adapted_visit_rows was created inside the try; instead initialize adapted_visit_rows (e.g., set to None or an empty list) immediately before the try block that defines it (the try beginning around Step 2 where adapted_visit_rows is assigned), and in the except handler replace the dir() check with a direct None (or empty) check (e.g., if adapted_visit_rows is None) and then call _adapt_rows(visit_rows) as needed; update references to adapted_visit_rows in the except block accordingly so the variable always exists without using dir().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commcare_connect/labs/analysis/backends/sql/cchq_fetcher.py`:
- Around line 103-126: The fetch currently always passes app_id to
client.fetch_forms even when xmlns was resolved from data_source.gs_app_id or
via client.discover_form_xmlns, which can incorrectly filter out matches; modify
the logic to track where xmlns was found (e.g., a flag like xmlns_source or
xmlns_from_app) when calling client.get_form_xmlns(data_source.gs_app_id, ...)
or client.get_form_xmlns(app_id, ...) or client.discover_form_xmlns(...), and
only pass app_id into client.fetch_forms when xmlns was specifically discovered
via the main app_id (i.e., xmlns_from_app is True); otherwise call
client.fetch_forms(xmlns=xmlns, app_id=None) so forms from other apps are not
excluded.
In `@commcare_connect/labs/integrations/connect/cli/client.py`:
- Line 153: The current construction of auth_url via string concatenation
(auth_url = f"{production_url}{authorize_path}?{urlencode(auth_params)}") is
fragile when production_url may end with a slash or authorize_path may/may not
start with one; replace this with proper URL joining using urllib.parse.urljoin
to combine production_url and authorize_path, then append the query string built
with urlencode; do the same normalization for the token endpoint (the similar
construction around token_url or wherever production_url + token_path is used)
so both auth and token URLs are robust to trailing/leading slashes.
In `@commcare_connect/workflow/data_access.py`:
- Around line 1745-1752: Guard against schema["data_source"] being null/non-dict
before calling .get on it: replace the current data_source_dict =
schema.get("data_source", {}) with a type check such as data_source_raw =
schema.get("data_source"); if not isinstance(data_source_raw, dict):
data_source_dict = {} else: data_source_dict = data_source_raw, then proceed to
construct DataSourceConfig(...) using data_source_dict; this prevents
AttributeError when schema contains "data_source": null while keeping the same
defaults for missing keys.
In `@commcare_connect/workflow/management/commands/test_mbw_parity.py`:
- Around line 151-153: The except block catching "except Exception as e" in the
command (inside the Command.handle method) currently only logs via
self.stdout.write(self.style.WARNING(...)) and then allows the parity comparison
to continue; change it to abort immediately after logging by raising a
CommandError (or returning from handle) with the error message so the command
stops and does not run with incomplete inputs, and ensure CommandError is
imported (or use return) to prevent further parity checks after the CCHQ fetch
failure.
- Around line 590-601: The command prints "FAILED" when all_diffs is non-empty
but still exits 0; update the failure branch in test_mbw_parity.py (inside the
management command's handle method where all_diffs is checked) to terminate with
a non-zero exit code—either call sys.exit(1) (and add import sys) or raise
django.core.management.CommandError with a clear message—immediately after
writing the failure output so CI sees a failure when all_diffs is present.
---
Nitpick comments:
In
`@commcare_connect/custom_analysis/chc_nutrition/management/commands/test_chc_nutrition.py`:
- Around line 245-255: The run_full_analysis method currently accepts a
use_cache parameter but never passes it to
AnalysisPipeline.stream_analysis_ignore_events, so the CLI --use-cache flag has
no effect; either remove the dead parameter and the --use-cache argument in
add_arguments (or emit a deprecation warning from add_arguments) or wire cache
control through the pipeline by updating run_full_analysis to pass use_cache
into AnalysisPipeline (e.g., call
stream_analysis_ignore_events(CHC_NUTRITION_CONFIG, use_cache=use_cache)) and
implement cache handling in AnalysisPipeline.stream_analysis_ignore_events/SQL
backend accordingly.
- Around line 234-243: Remove the obsolete CLI flag by deleting the
"--test-cache" argument from the command's add_arguments method and any related
handling, and remove or simplify test_cache_functionality's warning block (or
keep the stub but no longer reference the flag), so users aren't offered a
non-functional option; locate add_arguments and test_cache_functionality in the
management command and update help text/docstring to reflect that SQL-backend
cache testing is not implemented (alternatively, if you prefer a softer change,
add a deprecation warning inside add_arguments that logs a clear message when
"--test-cache" is present and points to the TODO in test_cache_functionality).
In `@commcare_connect/labs/analysis/computations.py`:
- Around line 103-116: The inner transform error handling currently uses a bare
except in the field processing loop; update the try/except around
field_comp.transform(value) to catch specific expected errors (for example
ValueError, TypeError, KeyError) instead of Exception so malformed values are
handled explicitly, while leaving the outer except in the surrounding block (the
handler that logs "Failed to compute field {field_comp.name} for visit
{visit.id}: {e}") as a broad except to continue protecting the overall
extraction; reference field_comp.transform and the outer except that assigns
visit_result[field_comp.name] = field_comp.default when implementing this
change.
In `@commcare_connect/labs/analysis/pipeline.py`:
- Around line 340-386: The two branches that produce and translate streaming
events duplicate the same loop; extract that logic into a single helper (e.g., a
private method _stream_and_collect_raw_visits(opportunity_id, access_token,
force_refresh, expected_visit_count) on the Pipeline class) and call it from
both the cchq_forms branch (after fetch_cchq_forms_as_visit_dicts produces its
visit_dicts) and the self.backend branch; the helper should iterate the same
event tuples from self.backend.stream_raw_visits (or accept an iterator of
events), map "cached"/"progress"/"parsing"/"complete" to the same yields
(EVENT_STATUS and EVENT_DOWNLOAD), set/return visit_dicts when "cached" or
"complete" occurs, and preserve logging using self.backend_name and logger so
the behavior in fetch_cchq_forms_as_visit_dicts, visit_dicts assignment,
EVENT_STATUS/EVENT_DOWNLOAD yields, and messages remain identical; apply the
same refactor for the other duplicated region around the second occurrence.
In `@commcare_connect/labs/tests/test_cchq_pipeline.py`:
- Line 28: The test file imports unused symbols MagicMock and patch; remove them
from the import statement in commcare_connect/labs/tests/test_cchq_pipeline.py
(the line containing "from unittest.mock import MagicMock, patch") so the module
only imports what it actually uses; if other tests later need mocking, add the
specific imports at that time.
- Around line 8-25: The manual Django bootstrap (settings.configure(...) and
django.setup() in this test module) is fragile; replace it by relying on
pytest-django fixtures or make the intent explicit: remove the inline
settings/configure and annotate tests with `@pytest.mark.django_db` and add any
required DB fixtures in conftest.py, or if this module must run standalone, add
a brief comment above the settings.configure/django.setup explaining why
standalone configuration is required and guard it with a check to avoid
reconfiguring Django (e.g., only configure when not settings.configured),
keeping the existing settings keys moved to a project conftest.py fixture named
e.g. django_db_settings to centralize configuration.
In `@commcare_connect/workflow/job_handlers/mbw_monitoring.py`:
- Around line 186-222: The access_token parameter of handle_mbw_monitoring_job
is intentionally unused but flagged by linters; update the function signature to
rename access_token to _access_token (or prefix it with an underscore) and/or
add a brief inline comment like "# intentionally unused — required by
register_job_handler interface" inside handle_mbw_monitoring_job to explicitly
signal non-use and silence static analysis warnings while keeping the required
(job_config, access_token, progress_callback) shape for `@register_job_handler`.
- Around line 357-370: The code uses "adapted_visit_rows" not in dir() to detect
whether adapted_visit_rows was created inside the try; instead initialize
adapted_visit_rows (e.g., set to None or an empty list) immediately before the
try block that defines it (the try beginning around Step 2 where
adapted_visit_rows is assigned), and in the except handler replace the dir()
check with a direct None (or empty) check (e.g., if adapted_visit_rows is None)
and then call _adapt_rows(visit_rows) as needed; update references to
adapted_visit_rows in the except block accordingly so the variable always exists
without using dir().
In `@commcare_connect/workflow/templates/__init__.py`:
- Around line 148-190: The code currently processes both pipeline_schema and
pipeline_schemas and appends into pipeline_sources, causing unexpected
duplicates; update the logic in __init__.py around pipeline_schema,
pipeline_schemas, pipeline_sources so the two modes are mutually exclusive
(e.g., if pipeline_schemas is present prefer/only process it, else process
pipeline_schema) or add an explicit guard that raises/logs when both are
provided; ensure the branch that computes pipeline_alias using
template_key/alias_map remains applied only for the singular pipeline_schema
path and that pipeline_data_access usage and pipeline_sources population occur
in the chosen exclusive branch.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (35)
.claude/AGENTS.md.gitignorecommcare_connect/audit/data_access.pycommcare_connect/custom_analysis/chc_nutrition/management/commands/test_chc_nutrition.pycommcare_connect/labs/analysis/__init__.pycommcare_connect/labs/analysis/backends/__init__.pycommcare_connect/labs/analysis/backends/protocol.pycommcare_connect/labs/analysis/backends/python_redis/__init__.pycommcare_connect/labs/analysis/backends/python_redis/backend.pycommcare_connect/labs/analysis/backends/python_redis/cache.pycommcare_connect/labs/analysis/backends/python_redis/computations.pycommcare_connect/labs/analysis/backends/python_redis/flw_analyzer.pycommcare_connect/labs/analysis/backends/python_redis/visit_analyzer.pycommcare_connect/labs/analysis/backends/sql/cchq_fetcher.pycommcare_connect/labs/analysis/computations.pycommcare_connect/labs/analysis/config.pycommcare_connect/labs/analysis/pipeline.pycommcare_connect/labs/integrations/connect/cli/client.pycommcare_connect/labs/management/commands/get_commcare_token.pycommcare_connect/labs/management/commands/test_backend_parity.pycommcare_connect/labs/tests/test_cchq_pipeline.pycommcare_connect/labs/tests/test_data_source_config.pycommcare_connect/workflow/data_access.pycommcare_connect/workflow/job_handlers/__init__.pycommcare_connect/workflow/job_handlers/mbw_monitoring.pycommcare_connect/workflow/management/commands/test_mbw_parity.pycommcare_connect/workflow/tasks.pycommcare_connect/workflow/templates/__init__.pycommcare_connect/workflow/templates/mbw_monitoring/data_transforms.pycommcare_connect/workflow/templates/mbw_monitoring/views.pycommcare_connect/workflow/templates/mbw_monitoring_v2.pycommcare_connect/workflow/tests/test_mbw_v1_v2_parity.pyconfig/settings/labs.pyconfig/settings/local.pydocs/LABS_ARCHITECTURE.md
💤 Files with no reviewable changes (9)
- commcare_connect/labs/analysis/backends/python_redis/computations.py
- config/settings/labs.py
- commcare_connect/labs/management/commands/test_backend_parity.py
- commcare_connect/labs/analysis/backends/python_redis/visit_analyzer.py
- commcare_connect/labs/analysis/backends/python_redis/flw_analyzer.py
- commcare_connect/labs/analysis/backends/python_redis/cache.py
- commcare_connect/labs/analysis/backends/python_redis/backend.py
- commcare_connect/labs/analysis/backends/python_redis/init.py
- commcare_connect/labs/analysis/backends/protocol.py
| # Strategy 1: Try gs_app_id first (for GS forms in separate supervisor app) | ||
| if data_source.gs_app_id: | ||
| xmlns = client.get_form_xmlns(data_source.gs_app_id, form_name) | ||
| if xmlns: | ||
| logger.info(f"[CCHQ Fetcher] Found xmlns via gs_app_id: {xmlns}") | ||
|
|
||
| # Strategy 2: Try the main app_id | ||
| if not xmlns and app_id: | ||
| xmlns = client.get_form_xmlns(app_id, form_name) | ||
| if xmlns: | ||
| logger.info(f"[CCHQ Fetcher] Found xmlns via app_id: {xmlns}") | ||
|
|
||
| # Strategy 3: Search all apps | ||
| if not xmlns: | ||
| xmlns = client.discover_form_xmlns(form_name) | ||
| if xmlns: | ||
| logger.info(f"[CCHQ Fetcher] Discovered xmlns via search: {xmlns}") | ||
|
|
||
| if not xmlns: | ||
| logger.warning(f"[CCHQ Fetcher] Could not discover xmlns for '{form_name}', returning empty") | ||
| return [] | ||
|
|
||
| forms = client.fetch_forms(xmlns=xmlns, app_id=app_id if app_id else None) | ||
| logger.info(f"[CCHQ Fetcher] Fetched {len(forms)} '{form_name}' forms from {cc_domain}") |
There was a problem hiding this comment.
app_id filter is applied even when xmlns came from a different app.
When xmlns is resolved via gs_app_id (or global discovery), forcing fetch_forms(..., app_id=<main app>) can filter out matching forms and return empty results.
💡 Suggested fix
- # Strategy 1: Try gs_app_id first (for GS forms in separate supervisor app)
+ fetch_app_id = None
+
+ # Strategy 1: Try gs_app_id first (for GS forms in separate supervisor app)
if data_source.gs_app_id:
xmlns = client.get_form_xmlns(data_source.gs_app_id, form_name)
if xmlns:
+ fetch_app_id = data_source.gs_app_id
logger.info(f"[CCHQ Fetcher] Found xmlns via gs_app_id: {xmlns}")
# Strategy 2: Try the main app_id
if not xmlns and app_id:
xmlns = client.get_form_xmlns(app_id, form_name)
if xmlns:
+ fetch_app_id = app_id
logger.info(f"[CCHQ Fetcher] Found xmlns via app_id: {xmlns}")
# Strategy 3: Search all apps
if not xmlns:
xmlns = client.discover_form_xmlns(form_name)
if xmlns:
+ fetch_app_id = None # app unknown from discovery; don't over-filter
logger.info(f"[CCHQ Fetcher] Discovered xmlns via search: {xmlns}")
- forms = client.fetch_forms(xmlns=xmlns, app_id=app_id if app_id else None)
+ forms = client.fetch_forms(xmlns=xmlns, app_id=fetch_app_id)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Strategy 1: Try gs_app_id first (for GS forms in separate supervisor app) | |
| if data_source.gs_app_id: | |
| xmlns = client.get_form_xmlns(data_source.gs_app_id, form_name) | |
| if xmlns: | |
| logger.info(f"[CCHQ Fetcher] Found xmlns via gs_app_id: {xmlns}") | |
| # Strategy 2: Try the main app_id | |
| if not xmlns and app_id: | |
| xmlns = client.get_form_xmlns(app_id, form_name) | |
| if xmlns: | |
| logger.info(f"[CCHQ Fetcher] Found xmlns via app_id: {xmlns}") | |
| # Strategy 3: Search all apps | |
| if not xmlns: | |
| xmlns = client.discover_form_xmlns(form_name) | |
| if xmlns: | |
| logger.info(f"[CCHQ Fetcher] Discovered xmlns via search: {xmlns}") | |
| if not xmlns: | |
| logger.warning(f"[CCHQ Fetcher] Could not discover xmlns for '{form_name}', returning empty") | |
| return [] | |
| forms = client.fetch_forms(xmlns=xmlns, app_id=app_id if app_id else None) | |
| logger.info(f"[CCHQ Fetcher] Fetched {len(forms)} '{form_name}' forms from {cc_domain}") | |
| fetch_app_id = None | |
| # Strategy 1: Try gs_app_id first (for GS forms in separate supervisor app) | |
| if data_source.gs_app_id: | |
| xmlns = client.get_form_xmlns(data_source.gs_app_id, form_name) | |
| if xmlns: | |
| fetch_app_id = data_source.gs_app_id | |
| logger.info(f"[CCHQ Fetcher] Found xmlns via gs_app_id: {xmlns}") | |
| # Strategy 2: Try the main app_id | |
| if not xmlns and app_id: | |
| xmlns = client.get_form_xmlns(app_id, form_name) | |
| if xmlns: | |
| fetch_app_id = app_id | |
| logger.info(f"[CCHQ Fetcher] Found xmlns via app_id: {xmlns}") | |
| # Strategy 3: Search all apps | |
| if not xmlns: | |
| xmlns = client.discover_form_xmlns(form_name) | |
| if xmlns: | |
| fetch_app_id = None # app unknown from discovery; don't over-filter | |
| logger.info(f"[CCHQ Fetcher] Discovered xmlns via search: {xmlns}") | |
| if not xmlns: | |
| logger.warning(f"[CCHQ Fetcher] Could not discover xmlns for '{form_name}', returning empty") | |
| return [] | |
| forms = client.fetch_forms(xmlns=xmlns, app_id=fetch_app_id) | |
| logger.info(f"[CCHQ Fetcher] Fetched {len(forms)} '{form_name}' forms from {cc_domain}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commcare_connect/labs/analysis/backends/sql/cchq_fetcher.py` around lines 103
- 126, The fetch currently always passes app_id to client.fetch_forms even when
xmlns was resolved from data_source.gs_app_id or via client.discover_form_xmlns,
which can incorrectly filter out matches; modify the logic to track where xmlns
was found (e.g., a flag like xmlns_source or xmlns_from_app) when calling
client.get_form_xmlns(data_source.gs_app_id, ...) or
client.get_form_xmlns(app_id, ...) or client.discover_form_xmlns(...), and only
pass app_id into client.fetch_forms when xmlns was specifically discovered via
the main app_id (i.e., xmlns_from_app is True); otherwise call
client.fetch_forms(xmlns=xmlns, app_id=None) so forms from other apps are not
excluded.
| "code_challenge_method": "S256", | ||
| } | ||
| auth_url = f"{production_url}/o/authorize/?{urlencode(auth_params)}" | ||
| auth_url = f"{production_url}{authorize_path}?{urlencode(auth_params)}" |
There was a problem hiding this comment.
Normalize URL joining for OAuth endpoints to avoid malformed URLs.
Current concatenation is fragile if production_url has a trailing / or endpoint paths are passed without a leading /.
Suggested hardening
def get_oauth_token(
@@
) -> dict | None:
@@
- # Build authorization URL
+ # Build authorization/token URLs safely
+ base_url = production_url.rstrip("/")
+ authorize_endpoint = authorize_path if authorize_path.startswith("/") else f"/{authorize_path}"
+ token_endpoint = token_path if token_path.startswith("/") else f"/{token_path}"
+
+ # Build authorization URL
auth_params = {
@@
- auth_url = f"{production_url}{authorize_path}?{urlencode(auth_params)}"
+ auth_url = f"{base_url}{authorize_endpoint}?{urlencode(auth_params)}"
@@
response = httpx.post(
- f"{production_url}{token_path}",
+ f"{base_url}{token_endpoint}",
data=token_data,
timeout=10,
)Also applies to: 201-201
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commcare_connect/labs/integrations/connect/cli/client.py` at line 153, The
current construction of auth_url via string concatenation (auth_url =
f"{production_url}{authorize_path}?{urlencode(auth_params)}") is fragile when
production_url may end with a slash or authorize_path may/may not start with
one; replace this with proper URL joining using urllib.parse.urljoin to combine
production_url and authorize_path, then append the query string built with
urlencode; do the same normalization for the token endpoint (the similar
construction around token_url or wherever production_url + token_path is used)
so both auth and token URLs are robust to trailing/leading slashes.
There was a problem hiding this comment.
Dismissed — both callers pass well-formed URLs with explicit leading slashes (defaults are /o/authorize/ and /oauth/authorize/). Adding urljoin would over-engineer for an internal CLI tool with two known call sites.
There was a problem hiding this comment.
Dismissed — both callers pass well-formed URLs with explicit leading slashes (defaults are /o/authorize/ and /oauth/authorize/). Adding urljoin would over-engineer for an internal CLI tool with two known call sites.
There was a problem hiding this comment.
@jjackson, understood—acknowledging that the inputs are controlled and the defaults already enforce the leading-slash convention makes the suggestion unnecessary for this internal CLI tool. I'll note this context for future reference.
✏️ Learnings added
Learnt from: jjackson
Repo: jjackson/commcare-connect PR: 24
File: commcare_connect/labs/integrations/connect/cli/client.py:153-153
Timestamp: 2026-02-26T13:45:55.693Z
Learning: In commcare_connect/labs/integrations/connect/cli/client.py, the get_oauth_token function is used by an internal CLI tool with only two known call sites. The authorize_path and token_path parameters always have explicit leading slashes by convention (defaults are "/o/authorize/" and "/o/token/"), so URL normalization via urljoin is considered over-engineering given the controlled inputs.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: akkaouim
Repo: jjackson/commcare-connect PR: 22
File: commcare_connect/workflow/data_access.py:572-572
Timestamp: 2026-02-18T16:40:33.334Z
Learning: In Python code, prefer timezone-aware UTC datetimes for timestamp fields by using datetime.now(timezone.utc) instead of naive datetime.now(). This avoids timezone ambiguity and ensures consistent UTC timestamps. In commcare_connect/workflow/data_access.py, apply this pattern for all timestamp fields, ensuring you import timezone from datetime.
There was a problem hiding this comment.
@jjackson, understood. For an internal tool with controlled call sites and explicit leading-slash defaults, the simple concatenation is appropriate. Thanks for clarifying the context.
🧠 Learnings used
Learnt from: akkaouim
Repo: jjackson/commcare-connect PR: 22
File: commcare_connect/workflow/data_access.py:572-572
Timestamp: 2026-02-18T16:40:33.334Z
Learning: In Python code, prefer timezone-aware UTC datetimes for timestamp fields by using datetime.now(timezone.utc) instead of naive datetime.now(). This avoids timezone ambiguity and ensures consistent UTC timestamps. In commcare_connect/workflow/data_access.py, apply this pattern for all timestamp fields, ensuring you import timezone from datetime.
| data_source_dict = schema.get("data_source", {}) | ||
| data_source = DataSourceConfig( | ||
| type=data_source_dict.get("type", "connect_csv"), | ||
| form_name=data_source_dict.get("form_name", ""), | ||
| app_id=data_source_dict.get("app_id", ""), | ||
| app_id_source=data_source_dict.get("app_id_source", ""), | ||
| gs_app_id=data_source_dict.get("gs_app_id", ""), | ||
| ) |
There was a problem hiding this comment.
Guard schema["data_source"] against null/non-object values.
If persisted schema contains "data_source": null, this block crashes on data_source_dict.get(...) before config creation.
💡 Suggested fix
- data_source_dict = schema.get("data_source", {})
+ data_source_dict = schema.get("data_source") or {}
+ if not isinstance(data_source_dict, dict):
+ raise ValueError("schema.data_source must be an object")
data_source = DataSourceConfig(
type=data_source_dict.get("type", "connect_csv"),
form_name=data_source_dict.get("form_name", ""),
app_id=data_source_dict.get("app_id", ""),
app_id_source=data_source_dict.get("app_id_source", ""),
gs_app_id=data_source_dict.get("gs_app_id", ""),
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commcare_connect/workflow/data_access.py` around lines 1745 - 1752, Guard
against schema["data_source"] being null/non-dict before calling .get on it:
replace the current data_source_dict = schema.get("data_source", {}) with a type
check such as data_source_raw = schema.get("data_source"); if not
isinstance(data_source_raw, dict): data_source_dict = {} else: data_source_dict
= data_source_raw, then proceed to construct DataSourceConfig(...) using
data_source_dict; this prevents AttributeError when schema contains
"data_source": null while keeping the same defaults for missing keys.
| except Exception as e: | ||
| self.stdout.write(self.style.WARNING(f" -> CCHQ fetch failed: {e}")) | ||
|
|
There was a problem hiding this comment.
Don’t continue parity comparison after CCHQ fetch failure.
Swallowing this error allows the command to run with incomplete inputs and can incorrectly report parity.
💡 Suggested fix
- except Exception as e:
- self.stdout.write(self.style.WARNING(f" -> CCHQ fetch failed: {e}"))
+ except Exception as e:
+ raise CommandError(f"CCHQ fetch failed: {e}") from e🧰 Tools
🪛 Ruff (0.15.2)
[warning] 151-151: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commcare_connect/workflow/management/commands/test_mbw_parity.py` around
lines 151 - 153, The except block catching "except Exception as e" in the
command (inside the Command.handle method) currently only logs via
self.stdout.write(self.style.WARNING(...)) and then allows the parity comparison
to continue; change it to abort immediately after logging by raising a
CommandError (or returning from handle) with the error message so the command
stops and does not run with incomplete inputs, and ensure CommandError is
imported (or use return) to prevent further parity checks after the CCHQ fetch
failure.
There was a problem hiding this comment.
By design — CCHQ is optional for the parity test. When the CommCare token isn't available, the command continues with 0 registration/GS forms and validates all other sections. The output clearly shows the counts so it's transparent.
There was a problem hiding this comment.
@jjackson, thank you for clarifying! That makes sense for a diagnostic command where partial data can still provide value. The transparent output with counts is a good approach.
To help future reviewers (and developers) understand this is intentional, consider adding a brief inline comment explaining that CCHQ fetch failures are expected and handled gracefully:
except Exception as e:
# CCHQ data is optional; continue with 0 forms to validate other sections
self.stdout.write(self.style.WARNING(f" -> CCHQ fetch failed: {e}"))This will make the design intent explicit in the code.
✏️ Learnings added
Learnt from: jjackson
Repo: jjackson/commcare-connect PR: 24
File: commcare_connect/workflow/management/commands/test_mbw_parity.py:151-153
Timestamp: 2026-02-26T13:46:12.419Z
Learning: In the test_mbw_parity management command (commcare_connect/workflow/management/commands/test_mbw_parity.py), CCHQ form fetching (registrations and GS forms) is intentionally optional. The command catches and logs fetch exceptions but continues with 0 forms to allow parity validation of other sections when the CommCare token is unavailable. This is by design to support partial testing scenarios.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: akkaouim
Repo: jjackson/commcare-connect PR: 22
File: commcare_connect/workflow/data_access.py:572-572
Timestamp: 2026-02-18T16:40:33.334Z
Learning: In Python code, prefer timezone-aware UTC datetimes for timestamp fields by using datetime.now(timezone.utc) instead of naive datetime.now(). This avoids timezone ambiguity and ensures consistent UTC timestamps. In commcare_connect/workflow/data_access.py, apply this pattern for all timestamp fields, ensuring you import timezone from datetime.
…eanup
- cchq_fetcher: track which app resolved xmlns and only pass that app_id
to fetch_forms (fixes GS forms being filtered out by main app_id)
- data_access: use `or {}` instead of default={} for data_source to
handle null values in persisted schemas
- test_mbw_parity: raise CommandError on parity failure for non-zero exit
- mbw_monitoring: initialize adapted_visit_rows before try block instead
of using dir() check; rename unused access_token to _access_token
- test_cchq_pipeline: remove unused MagicMock/patch imports
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Addressed CodeRabbit review — implemented 6 fixes, dismissed 8 as out-of-scope or by-design. Implemented:
Dismissed (nitpicks):
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
commcare_connect/workflow/data_access.py (1)
1745-1752:⚠️ Potential issue | 🟠 MajorGuard
schema["data_source"]against non-dict values before.get(...).Line 1745 still allows truthy non-dict payloads (e.g., list/string), which will crash at Line 1747 with
AttributeError.💡 Suggested fix
- data_source_dict = schema.get("data_source") or {} + data_source_raw = schema.get("data_source") + data_source_dict = data_source_raw if isinstance(data_source_raw, dict) else {} data_source = DataSourceConfig( type=data_source_dict.get("type", "connect_csv"), form_name=data_source_dict.get("form_name", ""), app_id=data_source_dict.get("app_id", ""), app_id_source=data_source_dict.get("app_id_source", ""), gs_app_id=data_source_dict.get("gs_app_id", ""), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/workflow/data_access.py` around lines 1745 - 1752, Guard against non-dict values for schema["data_source"] before calling .get: replace the current unconditional assignment of data_source_dict = schema.get("data_source") or {} with a type check so that if schema.get("data_source") is not an instance of dict you fall back to an empty dict; then construct DataSourceConfig(...) from that safe data_source_dict. This change touches the data_source_dict variable used to build the DataSourceConfig in the data access code (the schema -> data_source_dict -> DataSourceConfig flow).
🧹 Nitpick comments (1)
commcare_connect/workflow/job_handlers/mbw_monitoring.py (1)
255-256: Consider using UTC forcurrent_date.
date.today()returns a naive date based on the server's local timezone. If the server runs in a non-UTC timezone, this could return a different calendar day than expected, affecting date comparisons in follow-up analysis.🔧 Suggested fix using UTC
+from datetime import date, timezone, datetime ... - current_date = date.today() + current_date = datetime.now(timezone.utc).date()Based on learnings: "In Python code, prefer timezone-aware UTC datetimes for timestamp fields by using datetime.now(timezone.utc) instead of naive datetime.now(). This avoids timezone ambiguity and ensures consistent UTC timestamps."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/workflow/job_handlers/mbw_monitoring.py` around lines 255 - 256, Replace the naive local date call current_date = date.today() with a UTC-derived date to avoid timezone drift: import datetime/timezone and set current_date = datetime.datetime.now(datetime.timezone.utc).date(), updating any references in this module (e.g., in functions within mbw_monitoring.py where current_date is used) so the date is consistently derived from UTC; ensure necessary imports (datetime, timezone) are added at the top of the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@commcare_connect/workflow/data_access.py`:
- Around line 1745-1752: Guard against non-dict values for schema["data_source"]
before calling .get: replace the current unconditional assignment of
data_source_dict = schema.get("data_source") or {} with a type check so that if
schema.get("data_source") is not an instance of dict you fall back to an empty
dict; then construct DataSourceConfig(...) from that safe data_source_dict. This
change touches the data_source_dict variable used to build the DataSourceConfig
in the data access code (the schema -> data_source_dict -> DataSourceConfig
flow).
---
Nitpick comments:
In `@commcare_connect/workflow/job_handlers/mbw_monitoring.py`:
- Around line 255-256: Replace the naive local date call current_date =
date.today() with a UTC-derived date to avoid timezone drift: import
datetime/timezone and set current_date =
datetime.datetime.now(datetime.timezone.utc).date(), updating any references in
this module (e.g., in functions within mbw_monitoring.py where current_date is
used) so the date is consistently derived from UTC; ensure necessary imports
(datetime, timezone) are added at the top of the file.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
commcare_connect/labs/analysis/backends/sql/cchq_fetcher.pycommcare_connect/labs/tests/test_cchq_pipeline.pycommcare_connect/workflow/data_access.pycommcare_connect/workflow/job_handlers/mbw_monitoring.pycommcare_connect/workflow/management/commands/test_mbw_parity.py
🚧 Files skipped from review as they are similar to previous changes (1)
- commcare_connect/labs/tests/test_cchq_pipeline.py
Remove --use-cache, --test-cache flags and test_cache_functionality stub that were left over from the removed python_redis backend. Also remove the stale TODO comment referencing that removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Production streaming export is silently truncated by gunicorn worker timeout (300s) due to N+1 BlobMeta query in UserVisitDataSerializer. Sentry confirms SystemExit:1 on every download of large opportunities. - Add raw CSV line counting to detect server-side truncation vs parsing drops - Add configurable cache tolerance (PIPELINE_CACHE_TOLERANCE_PCT) so pipeline can work with partial downloads while server-side fix is pending - Add configurable cache TTL (PIPELINE_CACHE_TTL_HOURS) for local dev - Pass tolerance through all cache validation layers (raw, computed, FLW) - Skip visit count validation for CCHQ form sources (different row counts) - Fix mbw_monitoring_v2 render code string concatenation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (3)
commcare_connect/workflow/templates/mbw_monitoring_v2.py (2)
955-978: Safety check for SSE references is a good defensive measure.The code logs warnings when SSE terms remain after transformation, which helps catch incomplete migrations. However, the import inside the loop is inefficient.
♻️ Move import outside the loop
+ import logging + _logger = logging.getLogger(__name__) + for term in _sse_terms: if term in code: - import logging - - logging.getLogger(__name__).warning( + _logger.warning( "MBW V2 render code still contains SSE reference: %s", term )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/workflow/templates/mbw_monitoring_v2.py` around lines 955 - 978, The import logging is inside the for loop causing repeated imports; move the import and logger creation out of the loop (e.g., import logging once and assign logger = logging.getLogger(__name__) before iterating over _sse_terms) and then call logger.warning(...) inside the loop when term in code is true, keeping the rest of the logic (_sse_terms, term, and the warning message) unchanged.
146-162: String replacement helpers could silently fail if V1 code changes.The
_replace_betweenand_replace_between_inclusivefunctions raiseValueErrorwhen markers aren't found, which is good. However, the error messages could be more actionable by including more context.Consider this a minor suggestion since the code does fail loudly on marker mismatch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/workflow/templates/mbw_monitoring_v2.py` around lines 146 - 162, The ValueError raised by _replace_between and _replace_between_inclusive is unhelpful; update the exception message to include the start and end marker strings, their computed indices (start and end), and a short preview of the code context (e.g., a 120-char slice around the computed start or the whole code if short) so callers can quickly see why the markers weren't found; modify both functions' raise ValueError lines to construct and raise this richer message referencing _replace_between and _replace_between_inclusive so debugging marker mismatches is easier.commcare_connect/labs/analysis/pipeline.py (1)
361-408: CCHQ fetch logic is duplicated across three code paths.The CCHQ form fetching code appears in three places: cache miss with filters (lines 361-408), force refresh with filters (lines 466-512), and main CCHQ path (lines 548-572). This increases maintenance burden.
Per the comments summary, this was noted as a refactoring suggestion and dismissed as out-of-scope for this PR. Consider extracting a helper method in a follow-up.
Also applies to: 466-512, 548-572
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commcare_connect/labs/analysis/pipeline.py` around lines 361 - 408, Duplicate CCHQ fetch logic should be extracted into a single helper to avoid three copies; create a private helper on the Pipeline class (e.g., _fetch_cchq_forms) that encapsulates the import and call to fetch_cchq_forms_as_visit_dicts(request=self.request, data_source=unfiltered_config.data_source, access_token=self.access_token, opportunity_id=opp_id) and yields or returns the visit_dicts consistently, then replace the three inlined blocks that check unfiltered_config.data_source.type == "cchq_forms" (the blocks around fetch_cchq_forms_as_visit_dicts, the force-refresh path, and the main CCHQ path) with a single call to that helper; ensure the helper preserves the same logging/yield behavior (EVENT_STATUS messages) and assigns visit_dicts used later.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@commcare_connect/labs/analysis/pipeline.py`:
- Around line 361-408: Duplicate CCHQ fetch logic should be extracted into a
single helper to avoid three copies; create a private helper on the Pipeline
class (e.g., _fetch_cchq_forms) that encapsulates the import and call to
fetch_cchq_forms_as_visit_dicts(request=self.request,
data_source=unfiltered_config.data_source, access_token=self.access_token,
opportunity_id=opp_id) and yields or returns the visit_dicts consistently, then
replace the three inlined blocks that check unfiltered_config.data_source.type
== "cchq_forms" (the blocks around fetch_cchq_forms_as_visit_dicts, the
force-refresh path, and the main CCHQ path) with a single call to that helper;
ensure the helper preserves the same logging/yield behavior (EVENT_STATUS
messages) and assigns visit_dicts used later.
In `@commcare_connect/workflow/templates/mbw_monitoring_v2.py`:
- Around line 955-978: The import logging is inside the for loop causing
repeated imports; move the import and logger creation out of the loop (e.g.,
import logging once and assign logger = logging.getLogger(__name__) before
iterating over _sse_terms) and then call logger.warning(...) inside the loop
when term in code is true, keeping the rest of the logic (_sse_terms, term, and
the warning message) unchanged.
- Around line 146-162: The ValueError raised by _replace_between and
_replace_between_inclusive is unhelpful; update the exception message to include
the start and end marker strings, their computed indices (start and end), and a
short preview of the code context (e.g., a 120-char slice around the computed
start or the whole code if short) so callers can quickly see why the markers
weren't found; modify both functions' raise ValueError lines to construct and
raise this richer message referencing _replace_between and
_replace_between_inclusive so debugging marker mismatches is easier.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignorecommcare_connect/labs/analysis/backends/sql/backend.pycommcare_connect/labs/analysis/backends/sql/cache.pycommcare_connect/labs/analysis/pipeline.pycommcare_connect/workflow/templates/mbw_monitoring_v2.pyconfig/settings/local.py
🚧 Files skipped from review as they are similar to previous changes (1)
- .gitignore
Summary
mbw_monitoringjob handler for GPS analysis, follow-up rates, quality metrics, and FLW performance.create_cli_request()now auto-loads the CommCare HQ token into the mock session. Fixedget_commcare_tokento use correct CCHQ OAuth endpoints (/oauth/not/o/).visit_dateto v2's_build_gps_visit_dicts()output soanalyze_gps_metricscan compute trailing 7-day and daily travel metrics.data_transforms.pymodule.test_mbw_paritymanagement command for live end-to-end payload comparison. Verified identical output on opp 765 (45,652 visits, 100 FLWs, 23,881 registration forms).Test plan
pytest commcare_connect/workflow/tests/test_mbw_v1_v2_parity.py --no-migrations)python manage.py get_commcare_token🤖 Generated with Claude Code
Note
High Risk
High risk because it removes the
python_redisanalysis backend and forces the pipeline onto the SQL backend while also adding a new CommCare HQ Form API data source and new MBW v2 computation path, which can change caching/data-fetch behavior and dashboard outputs.Overview
Labs analysis is migrated to SQL-only. The PR removes the
python_redisbackend (and backend protocol/exports), updates imports (e.g.compute_visit_fields), and makesAnalysisPipelinealways instantiateSQLBackend.The pipeline gains an alternate raw-data source from CommCare HQ forms.
AnalysisPipelineConfignow includesdata_source(connect_csvvscchq_forms), and the pipeline can fetch/normalize CCHQ Form API results into visit-shaped dicts via a new SQL backendcchq_fetcher.MBW Monitoring v2 is added. A new
mbw_monitoringworkflow job handler consumes multi-pipeline outputs (Connect visits + CCHQ registrations + CCHQ GS forms) to compute GPS, follow-up, quality, performance, and overview sections; registration is wired viaworkflow.job_handlersimport, and a newtest_mbw_paritycommand plus unit tests validate v1/v2 parity.CLI OAuth support is extended.
get_oauth_token()now allows configurable authorize/token paths (used to fix CommCare HQ/oauth/*endpoints), andcreate_cli_request()can inject a saved CommCare token into the mock session.Written by Cursor Bugbot for commit d0b56b4. This will update automatically on new commits. Configure here.
Summary by CodeRabbit
New Features
Refactor
Documentation
Tests
Chores