Update ALIGNMENT_INTEGRATION_WORKFLOW.md

singjc · web-flow · commit 9d27a8171268 · 2025-11-26T21:40:14.000-05:00
diff --git a/ALIGNMENT_INTEGRATION_WORKFLOW.md b/ALIGNMENT_INTEGRATION_WORKFLOW.md
@@ -326,91 +326,3 @@ These allow users to:
 - Track which features are aligned together via `alignment_group_id`
 - Find the reference feature that was used for alignment
 - Filter or analyze separately if needed
-
-## Technical Implementation Details
-
-### Precision Preservation for Large Feature IDs
-
-Large integer feature IDs (e.g., `5,405,272,318,039,692,409`) require special handling to prevent precision loss during database operations and pandas DataFrame creation.
-
-#### The Problem
-- Feature IDs can exceed 2^53, the maximum integer that float64 can represent precisely
-- When pandas reads INTEGER columns from databases without explicit typing, it may infer float64 dtype
-- This causes precision loss: `5,405,272,318,039,692,409` → `5,405,272,318,039,692,288`
-
-#### The Solution
-SQL queries use explicit CAST operations in SELECT clauses (but NOT in JOIN conditions):
-
-```sql
--- OSW (SQLite)
-SELECT CAST(FEATURE.ID AS INTEGER) AS id,
-       CAST(FEATURE_MS2_ALIGNMENT.REFERENCE_FEATURE_ID AS INTEGER) AS alignment_reference_feature_id
-FROM ...
-
--- Parquet (DuckDB)  
-SELECT CAST(fa.REFERENCE_FEATURE_ID AS BIGINT) AS REFERENCE_FEATURE_ID
-FROM ...
-```
-
-**Key Design Principles:**
-1. **CAST in SELECT**: Ensures pandas reads columns as integers, preserving precision
-2. **No CAST in JOIN**: Database can use indexes for fast lookups (~16 seconds vs 50 minutes)
-3. **Post-query conversion**: After reading, convert to pandas Int64 dtype for nullable integer support
-
-```python
-# After reading from database
-if "alignment_reference_feature_id" in df.columns:
-    df["alignment_reference_feature_id"] = df["alignment_reference_feature_id"].astype("Int64")
-if "id" in data.columns:
-    data["id"] = data["id"].astype("Int64")
-```
-
-### Alignment Group ID Assignment
-
-The `alignment_group_id` is computed using `DENSE_RANK()` to assign a unique identifier to each alignment group:
-
-```sql
-SELECT DENSE_RANK() OVER (ORDER BY PRECURSOR_ID, ALIGNMENT_ID) AS alignment_group_id,
-       ALIGNED_FEATURE_ID AS id,
-       REFERENCE_FEATURE_ID AS alignment_reference_feature_id
-FROM FEATURE_MS2_ALIGNMENT
-```
-
-#### Assigning Group IDs to Reference Features
-
-Reference features (those that aligned features point to) also need to receive their `alignment_group_id`. This is handled in post-processing:
-
-```python
-# 1. Extract mapping: reference_feature_id -> alignment_group_id
-ref_mapping = data[
-    data["alignment_reference_feature_id"].notna()
-][["alignment_reference_feature_id", "alignment_group_id"]].drop_duplicates()
-
-# 2. Create reverse mapping: id -> alignment_group_id for references
-ref_group_mapping = ref_mapping.rename(
-    columns={"alignment_reference_feature_id": "id", 
-             "alignment_group_id": "ref_alignment_group_id"}
-)
-
-# 3. Merge to assign group IDs to reference features
-data = pd.merge(data, ref_group_mapping, on="id", how="left")
-
-# 4. Fill in alignment_group_id where it's null but ref_alignment_group_id exists
-mask = data["alignment_group_id"].isna() & data["ref_alignment_group_id"].notna()
-data.loc[mask, "alignment_group_id"] = data.loc[mask, "ref_alignment_group_id"]
-```
-
-**Result:** All features in an alignment group (both aligned and reference features) share the same `alignment_group_id`, enabling:
-- Tracking which features are aligned together
-- Identifying the reference feature for each alignment group
-- Analyzing alignment quality across related features
-
-### Performance Considerations
-
-| Approach | Query Time | Precision | Index Usage |
-|----------|-----------|-----------|-------------|
-| No CAST | ~16 sec | ❌ Lost | ✅ Yes |
-| CAST in JOIN | ~50 min | ✅ Preserved | ❌ No |
-| CAST in SELECT | ~16 sec | ✅ Preserved | ✅ Yes |
-
-**Conclusion:** CAST in SELECT clause provides both precision preservation and optimal performance.