Skip to content

Commit 9d27a81

Browse files
authored
Update ALIGNMENT_INTEGRATION_WORKFLOW.md
1 parent 7127a56 commit 9d27a81

File tree

1 file changed

+0
-88
lines changed

1 file changed

+0
-88
lines changed

ALIGNMENT_INTEGRATION_WORKFLOW.md

Lines changed: 0 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -326,91 +326,3 @@ These allow users to:
326326
- Track which features are aligned together via `alignment_group_id`
327327
- Find the reference feature that was used for alignment
328328
- Filter or analyze separately if needed
329-
330-
## Technical Implementation Details
331-
332-
### Precision Preservation for Large Feature IDs
333-
334-
Large integer feature IDs (e.g., `5,405,272,318,039,692,409`) require special handling to prevent precision loss during database operations and pandas DataFrame creation.
335-
336-
#### The Problem
337-
- Feature IDs can exceed 2^53, the maximum integer that float64 can represent precisely
338-
- When pandas reads INTEGER columns from databases without explicit typing, it may infer float64 dtype
339-
- This causes precision loss: `5,405,272,318,039,692,409``5,405,272,318,039,692,288`
340-
341-
#### The Solution
342-
SQL queries use explicit CAST operations in SELECT clauses (but NOT in JOIN conditions):
343-
344-
```sql
345-
-- OSW (SQLite)
346-
SELECT CAST(FEATURE.ID AS INTEGER) AS id,
347-
CAST(FEATURE_MS2_ALIGNMENT.REFERENCE_FEATURE_ID AS INTEGER) AS alignment_reference_feature_id
348-
FROM ...
349-
350-
-- Parquet (DuckDB)
351-
SELECT CAST(fa.REFERENCE_FEATURE_ID AS BIGINT) AS REFERENCE_FEATURE_ID
352-
FROM ...
353-
```
354-
355-
**Key Design Principles:**
356-
1. **CAST in SELECT**: Ensures pandas reads columns as integers, preserving precision
357-
2. **No CAST in JOIN**: Database can use indexes for fast lookups (~16 seconds vs 50 minutes)
358-
3. **Post-query conversion**: After reading, convert to pandas Int64 dtype for nullable integer support
359-
360-
```python
361-
# After reading from database
362-
if "alignment_reference_feature_id" in df.columns:
363-
df["alignment_reference_feature_id"] = df["alignment_reference_feature_id"].astype("Int64")
364-
if "id" in data.columns:
365-
data["id"] = data["id"].astype("Int64")
366-
```
367-
368-
### Alignment Group ID Assignment
369-
370-
The `alignment_group_id` is computed using `DENSE_RANK()` to assign a unique identifier to each alignment group:
371-
372-
```sql
373-
SELECT DENSE_RANK() OVER (ORDER BY PRECURSOR_ID, ALIGNMENT_ID) AS alignment_group_id,
374-
ALIGNED_FEATURE_ID AS id,
375-
REFERENCE_FEATURE_ID AS alignment_reference_feature_id
376-
FROM FEATURE_MS2_ALIGNMENT
377-
```
378-
379-
#### Assigning Group IDs to Reference Features
380-
381-
Reference features (those that aligned features point to) also need to receive their `alignment_group_id`. This is handled in post-processing:
382-
383-
```python
384-
# 1. Extract mapping: reference_feature_id -> alignment_group_id
385-
ref_mapping = data[
386-
data["alignment_reference_feature_id"].notna()
387-
][["alignment_reference_feature_id", "alignment_group_id"]].drop_duplicates()
388-
389-
# 2. Create reverse mapping: id -> alignment_group_id for references
390-
ref_group_mapping = ref_mapping.rename(
391-
columns={"alignment_reference_feature_id": "id",
392-
"alignment_group_id": "ref_alignment_group_id"}
393-
)
394-
395-
# 3. Merge to assign group IDs to reference features
396-
data = pd.merge(data, ref_group_mapping, on="id", how="left")
397-
398-
# 4. Fill in alignment_group_id where it's null but ref_alignment_group_id exists
399-
mask = data["alignment_group_id"].isna() & data["ref_alignment_group_id"].notna()
400-
data.loc[mask, "alignment_group_id"] = data.loc[mask, "ref_alignment_group_id"]
401-
```
402-
403-
**Result:** All features in an alignment group (both aligned and reference features) share the same `alignment_group_id`, enabling:
404-
- Tracking which features are aligned together
405-
- Identifying the reference feature for each alignment group
406-
- Analyzing alignment quality across related features
407-
408-
### Performance Considerations
409-
410-
| Approach | Query Time | Precision | Index Usage |
411-
|----------|-----------|-----------|-------------|
412-
| No CAST | ~16 sec | ❌ Lost | ✅ Yes |
413-
| CAST in JOIN | ~50 min | ✅ Preserved | ❌ No |
414-
| CAST in SELECT | ~16 sec | ✅ Preserved | ✅ Yes |
415-
416-
**Conclusion:** CAST in SELECT clause provides both precision preservation and optimal performance.

0 commit comments

Comments
 (0)