Skip to content

Add include_version parameter to import_gencode and remove_y_par to add_gencode_transcript_annotations#806

Merged
jkgoodrich merged 5 commits intomainfrom
jg/updates_to_gencode_for_constraint
Sep 24, 2025
Merged

Add include_version parameter to import_gencode and remove_y_par to add_gencode_transcript_annotations#806
jkgoodrich merged 5 commits intomainfrom
jg/updates_to_gencode_for_constraint

Conversation

@jkgoodrich
Copy link
Contributor

@jkgoodrich jkgoodrich commented Sep 24, 2025

Summary

This PR updates the GENCODE import and transcript annotation functionality to better handle transcript ID versions and PAR region duplicates.

Changes

1. Update import_gencode to keep full transcript ID including version if requested

  • When include_version=True, the function now properly preserves the full gene_id and transcript_id including version numbers in separate fields (gene_id_version and transcript_id_version)
  • This allows downstream functions to access both the versioned and non-versioned IDs as needed

2. Fix add_gencode_transcript_annotations to include start and end position

  • Updated the default annotations parameter in add_gencode_transcript_annotations to include start_position and end_position by default
  • The function now properly calculates and includes transcript start and end positions from the GENCODE interval data
  • Updated documentation to reflect that these annotations are included by default

3. Add remove_y_par parameter to handle PAR region transcript ID duplicates

  • Added remove_y_par parameter (default: True) to add_gencode_transcript_annotations
  • This parameter filters out Y chromosome PAR (pseudoautosomal region) features to prevent duplicate transcript IDs
  • PAR regions can have identical transcript IDs on both chrX and chrY, which can cause issues in constraint calculations
  • The filtering uses the transcript_id_version field to identify Y_PAR transcripts (those ending with "Y_PAR")
  • Updated function documentation to explain the PAR region handling and requirements

Technical Details

  • The remove_y_par functionality requires that the input GENCODE table includes the transcript_id_version field, which is available when import_gencode is called with include_version=True
  • The PAR region filtering prevents duplicate transcript annotations that could skew constraint calculations
  • Start and end position annotations are computed from the GENCODE interval data and provide essential transcript boundary information

Impact

  • Improves data quality by preventing PAR region duplicates in constraint calculations
  • Provides more complete transcript annotation data including positional information
  • Maintains backward compatibility while adding new functionality
  • Enables better handling of GENCODE data with version information
Screenshot 2025-09-24 at 1 34 56 PM

…ion. Also include a remove_y_par parameter to fix cases where transcript_id is found twice on PAR regions (chrX and chrY).
@jkgoodrich jkgoodrich requested a review from a team as a code owner September 24, 2025 19:23
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple documentation suggestions and one question

@jkgoodrich jkgoodrich merged commit 83d2603 into main Sep 24, 2025
6 checks passed
@jkgoodrich jkgoodrich deleted the jg/updates_to_gencode_for_constraint branch September 24, 2025 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants