Initial State: The Airflow DAG container abruptly terminated during RandomizedSearchCV (Exit Code 137).
The Fixes:
- Smart Undersampling (Lines ~287-292): Deleted the blind
data.sample(n=150000)code. Wrote targeted separation logic:This single change preserved 100% of the minority incidents while preventing RAM exhaustion.fraud_cases = data[data['is_fraud'] == 1] legit_cases = data[data['is_fraud'] == 0].sample(n=150000, random_state=42) data = pd.concat([fraud_cases, legit_cases])
- XGBoost Threading (Line ~352): Redefined
n_jobs=1insideRandomizedSearchCVinstead of-1to prevent CPU parallel threads from violently multiplying maximum memory allocation boundaries footprint. - Optimization Balance (Line ~335):
n_iter=20andSMOTE(sampling_strategy=0.5)to violently bias the data for maximum precision.
Initial State: Airflow Model trained on Python 3.12 (Scikit 1.6+). Inference decoded on Python 3.10 (Scikit 1.5.0). The Fix:
- Modified
airflow/requirements.txt(Line 11) directly fromscikit-learntoscikit-learn==1.5.0. - Modified
inference/requirements.txt(Line 7) directly fromscikit-learntoscikit-learn==1.5.0, enforcing absolute matching class signatures and fixingimblearnvsimbalanced-learncasing errors.
Initial State: Exception thrown automatically: Invalid transaction: 'M' is not of type 'integer'.
The Fix:
- Navigated to
TRANSACTION_SCHEMAdictionary (Line ~60). - Changed
"gender": {"type": "integer"}directly to"gender": {"type": "string"}allowing"M"/"F"formats to cross the barrier safely.
Initial State: Confluent Cloud disables auto-topic generation. Both the producer and inference containers stalled due to interacting with null Kafka sockets. The Fix:
- Bypassed the Docker container to execute raw runtime injection. Ran the following Python script against the
src-producer-1container twice:admin = AdminClient(conf) admin.create_topics([NewTopic('transactions', num_partitions=1)]) admin.create_topics([NewTopic('fraud_predictions', num_partitions=1)])
Initial State: Inference threw SASL Auth failures. The Fix:
- Fixed line 343 in
docker-compose.ymlto securely mount the environment variable configuration:- Incorrect:
- ./env:/app/.env - Corrected:
- ./.env:/app/.env
- Incorrect:
Initial State: Inference failed to output valid predictions due to a brutally aggressive probability ceiling. The Fix:
- Navigated to Line 345 inside the PySpark Pandas prediction UDF.
- Changed
threshold = 0.60explicitly tothreshold = 0.48. This perfectly offset the model's conservative bias.
- CLI Wrapper: Created
src/dags/predict_manual.pythat loads/app/models/fraud_detection_model.pklto process a hardcoded Pandas Dictionary representing exactly 1 transaction in 15 seconds. - Custom Streamlit Web Dashboard: Built a completely new
ui/app.pyinterface.- Used
@st.cache_resourcefor O(1) model loading. Bound input sliders to an interactive Dataframe triggeringpredict_proba. - Included a
requirements.txtidentifying a fatal internal module cache failure: Addeddillexplicitly (Line 6) because the Airflow DAG useddillcontext wrapping to initially pickle the SMOTE pipeline variables. - Resolved the final fatal Docker error (EOF memory dump during the UI compilation) by explicitly executing
docker builder prune -fin the host CLI to reclaim block storage memory, securing the final container launch on port 8501.
- Used