Skip to content

JIRA <-> IMPACT sync for Status and Priority#209

Merged
juanjogarciatorres43 merged 21 commits intomainfrom
feat/impact_jira_status_sync
Feb 5, 2026
Merged

JIRA <-> IMPACT sync for Status and Priority#209
juanjogarciatorres43 merged 21 commits intomainfrom
feat/impact_jira_status_sync

Conversation

@juanjogarciatorres43
Copy link
Collaborator

@juanjogarciatorres43 juanjogarciatorres43 commented Jan 23, 2026

@juanjogarciatorres43 juanjogarciatorres43 marked this pull request as ready for review January 23, 2026 08:11
@juanjogarciatorres43 juanjogarciatorres43 changed the title DRAFT: JIRA <-> IMPACT sync for Status and Priority JIRA <-> IMPACT sync for Status and Priority Jan 23, 2026
@codecov-commenter
Copy link

codecov-commenter commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 55.86207% with 128 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.93%. Comparing base (00c08a8) to head (2cf8887).
⚠️ Report is 29 commits behind head on main.

Files with missing lines Patch % Lines
src/firefighter/raid/serializers.py 50.00% 55 Missing and 16 partials ⚠️
src/firefighter/raid/signals/incident_updated.py 58.97% 41 Missing and 7 partials ⚠️
src/firefighter/raid/utils.py 30.00% 5 Missing and 2 partials ⚠️
src/firefighter/raid/client.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #209      +/-   ##
==========================================
- Coverage   67.08%   66.93%   -0.16%     
==========================================
  Files         212      212              
  Lines       10029    10319     +290     
  Branches     1099     1158      +59     
==========================================
+ Hits         6728     6907     +179     
- Misses       3030     3100      +70     
- Partials      271      312      +41     
Flag Coverage Δ
unittests 66.93% <55.86%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@NicolasLafitteMM NicolasLafitteMM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Thank you for this comprehensive PR! The bidirectional Jira sync is a great feature with excellent documentation and test coverage. However, I've identified a few critical issues that should be addressed before merging.

Critical Issues (Must Fix)

  1. Excessive WARNING logging - Normal operations logged as warnings will pollute production logs
  2. Missing null check - Potential AttributeError when accessing incident.id in loop prevention
  3. Error handling in MITIGATING transition - First transition failure prevents second attempt, leaving Jira in inconsistent state

Medium Priority Issues

  1. Code duplication - _normalize_cache_value() duplicated in two files
  2. Silent error handling - Returning True on failures masks problems
  3. N+1 query potential - JiraTicket lookup inside loop

I'll add inline comments on specific lines with detailed recommendations.

@NicolasLafitteMM
Copy link
Collaborator

Code Review - Based on Actual Branch Code

I've now checked out the branch and reviewed the actual code. Here are my findings:

✅ What's Excellent

  1. Outstanding documentation - jira-integration.md is exceptionally detailed
  2. Good test coverage - New test file covers main scenarios
  3. Loop prevention mechanism - Cache-based approach is solid
  4. P4+ improvement - Relaxing key event requirements for lower priorities makes sense
  5. Proper null checks - Code handles null incidents correctly (e.g., serializers.py:556-558)

🔴 Critical Issues (Must Fix)

I'll post these as separate comments with detailed recommendations.

@NicolasLafitteMM
Copy link
Collaborator

🔴 Critical #1: Excessive WARNING logging

File: src/firefighter/raid/signals/incident_updated.py:66-72

Every normal incident update is logged as WARNING, which will severely pollute production logs:

logger.warning(
    "incident_updated handler invoked for incident #%s with status %s; updated_fields=%s event_type=%s",
    getattr(incident, "id", "unknown"),
    incident_update.status,
    updated_fields,
    incident_update.event_type,
)

Impact: This handler is called for EVERY incident status update. In production, this will generate thousands of WARNING logs daily for normal operations.

Fix:

logger.debug(  # or logger.info at most
    "incident_updated handler invoked for incident #%s with status %s; updated_fields=%s event_type=%s",
    getattr(incident, "id", "unknown"),
    incident_update.status,
    updated_fields,
    incident_update.event_type,
)

Principle: WARNING should be reserved for actual problems, not normal operations. This is especially important for monitoring and alerting systems.

@NicolasLafitteMM
Copy link
Collaborator

🔴 Critical #2: Error handling in MITIGATING double transition

File: src/firefighter/raid/signals/incident_updated.py:98-125

When transitioning to MITIGATING status, if the first transition ("Pending resolution") fails, the return on line 119 prevents attempting the second transition ("in progress"), leaving Jira in an inconsistent state:

if incident_update.status == IncidentStatus.MITIGATING:
    for step in ("Pending resolution", "in progress"):
        try:
            client.transition_issue_auto(
                incident.jira_ticket.id, step, RAID_JIRA_WORKFLOW_NAME
            )
        except Exception:
            logger.exception(...)
            return  # ❌ Gives up completely - second transition never attempted

Problem Scenarios:

  1. First transition fails → Jira stuck at "Pending resolution"
  2. First succeeds but second fails → Jira stuck at "Pending resolution" instead of "in progress"
  3. Both fail → Same result, but we can't distinguish the cases

Recommendation:
Don't return immediately. Either:

  • Option A: Continue to second transition and let it fail/succeed independently
  • Option B: Check current Jira status before each transition
  • Option C: Collect errors and handle appropriately
if incident_update.status == IncidentStatus.MITIGATING:
    errors = []
    for step in ("Pending resolution", "in progress"):
        try:
            logger.debug(...)
            client.transition_issue_auto(
                incident.jira_ticket.id, step, RAID_JIRA_WORKFLOW_NAME
            )
        except Exception as e:
            logger.exception(...)
            errors.append((step, e))
            # Don't return - continue to next step
    
    if not errors:
        logger.info("Successfully transitioned through both steps")
    elif len(errors) == 2:
        logger.error("Both transitions failed")
    else:
        logger.warning("Partial transition completed")
    return

@NicolasLafitteMM
Copy link
Collaborator

⚠️ Medium #3: Code duplication - _normalize_cache_value()

Files:

  • src/firefighter/raid/serializers.py:58-67
  • src/firefighter/raid/signals/incident_updated.py:29-39

The function is duplicated in both files with nearly identical implementation (minor comment difference).

Recommendation:
Extract to raid/utils.py or raid/cache_utils.py:

# raid/cache_utils.py
from typing import Any

def normalize_cache_value(value: Any) -> str:
    """Normalize cache values for loop-prevention keys.
    
    Ensures consistent cache key generation across Impact→Jira and Jira→Impact syncs.
    """
    if value is None:
        return ""
    if isinstance(value, str):
        return value.strip().lower()
    try:
        return str(int(value))
    except (TypeError, ValueError):
        return str(value).strip().lower()

Then import in both files:

from firefighter.raid.cache_utils import normalize_cache_value

Benefits:

  • DRY principle
  • Single source of truth for normalization logic
  • Easier to test and maintain
  • Ensures consistency if logic needs to change

@NicolasLafitteMM
Copy link
Collaborator

⚠️ Medium #4: Silent error handling in postmortem validation

File: src/firefighter/raid/serializers.py:468-489

Returning True after exceptions suggests success, but actually the sync was skipped:

if incident.needs_postmortem and impact_status == IncidentStatus.CLOSED:
    if not hasattr(incident, "jira_postmortem_for"):
        logger.warning(...)
        return True  # ❌ Misleading - nothing was synced
    
    try:
        is_ready, current_status = jira_postmortem_service.is_postmortem_ready(...)
        if not is_ready:
            logger.warning(...)
            return True  # ❌ Misleading - sync was skipped
    except Exception:
        logger.exception(...)
        return True  # ❌ Misleading - error occurred

Problem:

  • Monitoring systems can't detect these failures
  • Debugging is harder when issues are masked as successes
  • Creates silent failures in production

Recommendations:

Option A (Preferred): Return False to indicate failure:

if not hasattr(incident, "jira_postmortem_for"):
    logger.warning(...)
    return False  # ✅ Clearly indicates sync didn't happen

try:
    is_ready, current_status = jira_postmortem_service.is_postmortem_ready(...)
    if not is_ready:
        logger.warning(...)
        return False  # ✅ Indicates sync was intentionally skipped
except Exception:
    logger.exception(...)
    return False  # ✅ Indicates failure

Option B: At minimum, add clear comments explaining why True is acceptable:

# Returning True because skipping sync is acceptable behavior,
# not an error condition - the webhook was processed successfully
return True

@NicolasLafitteMM
Copy link
Collaborator

💡 Suggestion #5: Potential N+1 query in webhook processing

File: src/firefighter/raid/serializers.py:328-332, 449, 514

The _get_incident_from_jira_ticket() method is called inside the loop for each change item:

def create(self, validated_data: dict[str, Any]) -> bool:
    changes = validated_data.get("changelog", {}).get("items") or []
    
    for change_item in changes:  # ← Loop over changes
        self._sync_jira_fields_to_incident(...)  # ← Calls _get_incident_from_jira_ticket

def _handle_status_update(...):
    incident = self._get_incident_from_jira_ticket(jira_ticket_key)  # ← DB query

def _handle_priority_update(...):
    incident = self._get_incident_from_jira_ticket(jira_ticket_key)  # ← DB query

Impact:

  • If a webhook has multiple changes (e.g., status + priority), the same JiraTicket is fetched multiple times
  • Minor performance issue, but can accumulate at scale

Optimization:
Fetch once at the beginning of create():

def create(self, validated_data: dict[str, Any]) -> bool:
    changes = validated_data.get("changelog", {}).get("items") or []
    if not changes:
        return True
    
    jira_ticket_key = validated_data["issue"].get("key")
    # Fetch JiraTicket + Incident once, outside the loop
    incident = self._get_incident_from_jira_ticket(jira_ticket_key)
    if incident is None:
        return True  # No incident linked, skip all changes
    
    for change_item in changes:
        self._sync_jira_fields_to_incident(
            validated_data, incident, change_item  # Pass pre-fetched incident
        )
    return True

Note: This is a minor optimization. In practice, webhooks rarely have multiple relevant changes, so the impact is small. Consider this a "nice to have" rather than critical.

@NicolasLafitteMM
Copy link
Collaborator

💡 Suggestion #6: Cache timeout configuration

File: src/firefighter/raid/signals/incident_updated.py:42-48

The 30-second cache timeout is hardcoded:

def _set_impact_to_jira_cache(
    incident_id: Any, field: str, value: Any, timeout: int = 30
) -> None:

Concern:
If Jira webhooks are delayed by more than 30 seconds (network issues, Jira load, etc.), the loop prevention cache will have expired, potentially causing loops.

Recommendation:
Make it configurable via Django settings:

from django.conf import settings

# Default to 60s for safer margin
JIRA_SYNC_CACHE_TIMEOUT = getattr(settings, 'JIRA_SYNC_CACHE_TIMEOUT', 60)

def _set_impact_to_jira_cache(
    incident_id: Any, 
    field: str, 
    value: Any, 
    timeout: int = JIRA_SYNC_CACHE_TIMEOUT
) -> None:
    cache_key = (
        f"sync:impact_to_jira:{incident_id}:{field}:{_normalize_cache_value(value)}"
    )
    cache.set(cache_key, value=True, timeout=timeout)

Benefits:

  • Ops team can adjust based on observed webhook latency
  • Easy to increase if loops are detected
  • No code changes needed to tune in production

Alternative: Just increase the default to 60 seconds if configurability isn't needed.

@NicolasLafitteMM
Copy link
Collaborator

💡 Code Quality #7: Magic strings for Jira statuses

File: src/firefighter/raid/signals/incident_updated.py:99, 140, etc.

Jira status strings are hardcoded throughout:

for step in ("Pending resolution", "in progress"):  # Line 99
    ...

target_jira_status = "Closed"  # Line 140

Recommendation:
Extract as module-level constants near the status map:

# Module-level constants
JIRA_STATUS_INCOMING = "Incoming"
JIRA_STATUS_PENDING_RESOLUTION = "Pending resolution"
JIRA_STATUS_IN_PROGRESS = "in progress"
JIRA_STATUS_REPORTER_VALIDATION = "Reporter validation"
JIRA_STATUS_CLOSED = "Closed"

IMPACT_TO_JIRA_STATUS_MAP: dict[IncidentStatus, str] = {
    IncidentStatus.OPEN: JIRA_STATUS_INCOMING,
    IncidentStatus.INVESTIGATING: JIRA_STATUS_IN_PROGRESS,
    IncidentStatus.MITIGATING: JIRA_STATUS_IN_PROGRESS,
    IncidentStatus.MITIGATED: JIRA_STATUS_REPORTER_VALIDATION,
    IncidentStatus.POST_MORTEM: JIRA_STATUS_REPORTER_VALIDATION,
    IncidentStatus.CLOSED: JIRA_STATUS_CLOSED,
}

# Then use:
for step in (JIRA_STATUS_PENDING_RESOLUTION, JIRA_STATUS_IN_PROGRESS):
    ...

target_jira_status = JIRA_STATUS_CLOSED

Benefits:

  • Single source of truth
  • IDE autocomplete and type checking
  • Easier refactoring if Jira status names change
  • Reduces typo risks
  • More maintainable

@NicolasLafitteMM
Copy link
Collaborator

❓ Questions for the Author

I have a few important questions:

1. Empty string in closure_reason constraint

File: src/firefighter/incidents/models/incident.py:291-292

check=models.Q(closure_reason__in=[*ClosureReason.values, ""])
    | models.Q(closure_reason__isnull=True),

Question: Why allow both empty string ("") AND NULL?

This creates ambiguity:

  • NULL = no closure reason
  • "" = empty closure reason (what does this mean?)
  • Valid enum value = actual reason

Is there a semantic difference between NULL and empty string? If not, consider removing the empty string option for clearer data model.


2. P4+ closure without key events

File: src/firefighter/incidents/models/incident.py:418-428

# Require key events only for P1-P3; P4+ can close without them.
require_milestones = not (self.priority and self.priority.value >= 4)

Questions:

  • Was this behavior change discussed with stakeholders/product?
  • Is it documented in user-facing documentation?
  • What's the rationale for allowing P4+ to skip key events?

This is a significant behavior change that affects how incidents are tracked.


3. Database Migrations

Are there Liquibase migrations for:

  • The modified closure_reason_valid constraint?
  • Any other schema changes?

4. Deployment & Rollout Strategy

  • Feature flag: Can this bidirectional sync be enabled/disabled without deployment?
  • Gradual rollout: Will this be enabled in staging first, then production?
  • Rollback plan: What's the plan if issues arise in production?
  • Monitoring: Are there plans for metrics/alerts on:
    • Sync success/failure rates
    • Loop prevention cache hits
    • Sync latency
    • Jira API errors

5. Testing in Production

Have you considered:

  • Load testing with multiple concurrent updates?
  • Webhook delay testing (what if Jira is slow)?
  • Error scenarios (Jira API down, timeout, rate limits)?

Would appreciate your thoughts on these points! 🙏

@NicolasLafitteMM
Copy link
Collaborator

🎯 Final Verdict & Summary

Great work on this PR! The bidirectional sync is a valuable feature with excellent documentation.


✅ Strengths

  1. Exceptional documentation - jira-integration.md is thorough and detailed
  2. Good test coverage - Main scenarios are tested
  3. Smart loop prevention - Cache-based mechanism is well thought out
  4. Proper null checks - Code handles edge cases correctly
  5. P4+ improvement - Relaxing key events makes sense for lower priorities
  6. CI passing - All checks green ✅

🔴 Blocking Issues (2)

Must be fixed before merge:

  1. Excessive WARNING logging (line 66) - Will pollute production logs
  2. Error handling in MITIGATING transition (line 119) - Can leave Jira in inconsistent state

⚠️ Non-Blocking Issues (5)

Can be addressed in follow-up PRs:

  1. Code duplication - _normalize_cache_value() in two files
  2. Silent error handling - Returning True on failures
  3. N+1 query potential - Minor performance issue
  4. Cache timeout - Consider making configurable
  5. Magic strings - Extract Jira status constants

❓ Questions (5)

Need clarification from author - see previous comment.


📊 Recommendation

Status: ⚠️ Request Changes

Next Steps:

  1. Fix the 2 blocking issues (chore(deps-dev): bump daisyui from 2.52.0 to 4.4.19 #1 and chore(deps-dev): bump postcss from 8.4.31 to 8.4.32 #2)
  2. Answer the questions when you have time
  3. Consider the non-blocking issues for follow-up PRs
  4. Re-request review once fixes are pushed

Note: Issues #3-7 are NOT blocking - you can address them later if preferred. The important fixes are #1 and #2.


Let me know if you need clarification on any of these comments! Happy to discuss approaches. 👍

@juanjogarciatorres43
Copy link
Collaborator Author

juanjogarciatorres43 commented Feb 4, 2026

⚠️ Medium #4: Silent error handling in postmortem validation

File: src/firefighter/raid/serializers.py:468-489

Returning True after exceptions suggests success, but actually the sync was skipped:

if incident.needs_postmortem and impact_status == IncidentStatus.CLOSED:
    if not hasattr(incident, "jira_postmortem_for"):
        logger.warning(...)
        return True  # ❌ Misleading - nothing was synced
    
    try:
        is_ready, current_status = jira_postmortem_service.is_postmortem_ready(...)
        if not is_ready:
            logger.warning(...)
            return True  # ❌ Misleading - sync was skipped
    except Exception:
        logger.exception(...)
        return True  # ❌ Misleading - error occurred

Problem:

  • Monitoring systems can't detect these failures
  • Debugging is harder when issues are masked as successes
  • Creates silent failures in production

Recommendations:

Option A (Preferred): Return False to indicate failure:

if not hasattr(incident, "jira_postmortem_for"):
    logger.warning(...)
    return False  # ✅ Clearly indicates sync didn't happen

try:
    is_ready, current_status = jira_postmortem_service.is_postmortem_ready(...)
    if not is_ready:
        logger.warning(...)
        return False  # ✅ Indicates sync was intentionally skipped
except Exception:
    logger.exception(...)
    return False  # ✅ Indicates failure

Option B: At minimum, add clear comments explaining why True is acceptable:

# Returning True because skipping sync is acceptable behavior,
# not an error condition - the webhook was processed successfully
return True

In this case that return True do not mean we are synced, it means we handle the signal in the expected way. Comment added on code for clarification.

@juanjogarciatorres43
Copy link
Collaborator Author

💡 Suggestion #6: Cache timeout configuration

File: src/firefighter/raid/signals/incident_updated.py:42-48

The 30-second cache timeout is hardcoded:

def _set_impact_to_jira_cache(
    incident_id: Any, field: str, value: Any, timeout: int = 30
) -> None:

Concern: If Jira webhooks are delayed by more than 30 seconds (network issues, Jira load, etc.), the loop prevention cache will have expired, potentially causing loops.

Recommendation: Make it configurable via Django settings:

from django.conf import settings

# Default to 60s for safer margin
JIRA_SYNC_CACHE_TIMEOUT = getattr(settings, 'JIRA_SYNC_CACHE_TIMEOUT', 60)

def _set_impact_to_jira_cache(
    incident_id: Any, 
    field: str, 
    value: Any, 
    timeout: int = JIRA_SYNC_CACHE_TIMEOUT
) -> None:
    cache_key = (
        f"sync:impact_to_jira:{incident_id}:{field}:{_normalize_cache_value(value)}"
    )
    cache.set(cache_key, value=True, timeout=timeout)

Benefits:

  • Ops team can adjust based on observed webhook latency
  • Easy to increase if loops are detected
  • No code changes needed to tune in production

Alternative: Just increase the default to 60 seconds if configurability isn't needed.

Added config.

@juanjogarciatorres43 juanjogarciatorres43 merged commit 8ceafa1 into main Feb 5, 2026
7 checks passed
@juanjogarciatorres43 juanjogarciatorres43 deleted the feat/impact_jira_status_sync branch February 5, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants