|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "From Celery/Redis to Temporal: A Journey Toward Idempotency and Reliable Workflows" |
| 4 | +date: 2026-01-08 |
| 5 | +category: Company Work |
| 6 | +tags: Temporal Celery Redis Idempotency Architecture Python Backend |
| 7 | +description: "Sharing our experience of migrating complex asynchronous KYC processing from Celery to Temporal to ensure idempotency and enhance system reliability." |
| 8 | +--- |
| 9 | + |
| 10 | +When handling asynchronous tasks in distributed systems, the combination of Celery and Redis is often the go-to choice. I also chose Celery for the initial design of our KYC (Know Your Customer) orchestrator due to its familiarity. However, as the service grew in complexity, we hit a massive wall: guaranteeing **idempotency** and managing **complex states**. |
| 11 | + |
| 12 | +In this post, I want to share our technical journey of why we moved away from Celery to Temporal and how we ensured idempotency during that process. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## 1. Limitations of Celery/Redis: Why Change Was Necessary? |
| 17 | + |
| 18 | +### Difficulties in Idempotency Management |
| 19 | +While Celery is excellent for "Fire and Forget" tasks, there's a high risk of duplicate execution during retries caused by network failures or worker downs. Especially for face recognition tasks that consume significant GPU resources, duplicate execution was critical in terms of both cost and performance. |
| 20 | + |
| 21 | +### Fragmentation of State |
| 22 | +The KYC process follows this sequence: |
| 23 | +1. User uploads an ID card image. |
| 24 | +2. User uploads a selfie video. |
| 25 | +3. Compare face similarity once both files exist. |
| 26 | + |
| 27 | +In a Celery environment, since we don't know when images and videos will be uploaded, we needed complex logic to query the DB every time or store intermediate states in Redis. The logic to check "Are all files collected?" was scattered across multiple places, making maintenance difficult. |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## 2. Introducing Temporal: A Paradigm Shift in Orchestration |
| 32 | + |
| 33 | +Temporal is not just a message queue; it's a **Stateful Workflows** engine. |
| 34 | + |
| 35 | +### Workflow Logic Must Be "Deterministic" |
| 36 | +Since Temporal workflow code is based on the premise of "Replay," it must always produce the same sequence of workflow API calls for the same input and history. Therefore, you should not directly perform "external-world-dependent operations" like network I/O, file I/O, system time (e.g., `DateTime.now`), randomness, or threading within a workflow. These side effects should be pushed to activities, while the workflow focuses solely on orchestration. |
| 37 | + |
| 38 | +Official Docs: https://docs.temporal.io/develop/python/core-application#workflow-logic-requirements |
| 39 | + |
| 40 | +### Workflow-Centric Design |
| 41 | +The first thing that changed after introducing Temporal was the visibility of business logic. `FaceSimilarityWorkflow` now gracefully waits until files are ready. |
| 42 | + |
| 43 | +```python |
| 44 | +# Core logic of FaceSimilarityWorkflow |
| 45 | +@workflow.run |
| 46 | +async def run(self, data: SimilarityData) -> SimilarityResult: |
| 47 | + # Wait up to 1 hour until both image and video are collected |
| 48 | + await workflow.wait_condition( |
| 49 | + lambda: any(f["type"] == "image" for f in self._files) |
| 50 | + and any(f["type"] == "video" for f in self._files), |
| 51 | + timeout=timedelta(hours=1), |
| 52 | + ) |
| 53 | + |
| 54 | + # Execute GPU activity once all files are ready |
| 55 | + result = await workflow.execute_activity( |
| 56 | + check_face_similarity_activity, |
| 57 | + activity_data, |
| 58 | + retry_policy=RetryPolicy(maximum_attempts=3) |
| 59 | + ) |
| 60 | + return result |
| 61 | +``` |
| 62 | + |
| 63 | +This code uses `workflow.wait_condition` to suspend the workflow until the condition is met without blocking the event loop. In Celery, this would have required complex polling or webhook logic. |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## 3. Idempotency Strategy: Building a Double Defense |
| 68 | + |
| 69 | +Even with Temporal, idempotency at the activity level remains crucial. We established a double defense strategy as follows. |
| 70 | + |
| 71 | +### Step 1: Temporal's Basic Guarantee |
| 72 | +Temporal records the progress of a workflow as event history. Therefore, even if a worker restarts, it resumes exactly from the last successful point. |
| 73 | + |
| 74 | +### Step 2: Explicit Checks within Activities |
| 75 | +Since Temporal activities follow an "at-least-once" execution model, an activity might be retried if a worker crashes after successfully performing it but before notifying the server. Thus, official documentation strongly recommends making activities idempotent. |
| 76 | + |
| 77 | +Official Docs: https://docs.temporal.io/develop/python/error-handling#make-activities-idempotent |
| 78 | + |
| 79 | +In practice, we use the following two together: |
| 80 | +- For external system calls, pass an **idempotency key** combined from the workflow execution and activity identifiers. |
| 81 | +- Internally, use unique keys (or check for existing results) in the DB to prevent **duplicate storage/processing**. |
| 82 | + |
| 83 | +```python |
| 84 | +@activity.defn |
| 85 | +async def check_face_similarity_activity(data: SimilarityData) -> SimilarityResult: |
| 86 | + info = activity.info() |
| 87 | + idempotency_key = f"{info.workflow_run_id}-{info.activity_id}" |
| 88 | + session_id = data["session_id"] |
| 89 | + |
| 90 | + with get_db_context() as db: |
| 91 | + existing = ( |
| 92 | + db.query(FaceSimilarity) |
| 93 | + .filter(FaceSimilarity.idempotency_key == idempotency_key) |
| 94 | + .first() |
| 95 | + ) |
| 96 | + if existing: |
| 97 | + return SimilarityResult(success=True, message="Already processed.") |
| 98 | + |
| 99 | + # Perform actual GPU-intensive work... |
| 100 | +``` |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## 4. Results: What Has Changed? |
| 105 | + |
| 106 | +| Comparison Item | Celery/Redis Based | Temporal Based | |
| 107 | +| :--- | :--- | :--- | |
| 108 | +| **State Management** | Manual storage in DB/Redis | Automatically managed by engine | |
| 109 | +| **Retry Strategy** | Manual exponential backoff | Declarative Retry Policy | |
| 110 | +| **Visibility** | Must dig through logs | Check history in Temporal UI | |
| 111 | +| **Idempotency** | Very difficult to guarantee | Structurally achievable | |
| 112 | + |
| 113 | +## Conclusion |
| 114 | + |
| 115 | +The transition from Celery to Temporal was not just about changing tools; it was about changing **how we define business processes in code**. Especially in financial/authentication systems where idempotency is paramount, Temporal provided irreplaceable stability. |
| 116 | + |
| 117 | +If you are losing sleep over complex asynchronous logic and idempotency issues, I strongly recommend migrating to Temporal. |
0 commit comments