Code for working with AGP and TPF files as used within the Tree of Life project, where the combination of long read sequencing and HiC data is used to produce whole genome assemblies. It is not therefore intended to cover the full range of AGP and TPF syntax.
Installation with uv is recommended. First
install uv itself, following the instructions on
Astral's web page, which on MacOS or Linux is to
run:
curl -LsSf https://astral.sh/uv/install.sh | shthen, to install the scripts to be available to your user account, install a recent Python followed by these utilities:
uv python install 3.13
uv tool install 'git+https://github.com/sanger-tol/agp-tpf-utils'Installation into an isolated, reproducible environment can be achieved using
pixi. Follow the instructions on the
installation page to install pixi.
To create a new environment in the current directory, using the conda-forge
channel:
pixi init -c conda-forgeThen install the package into it's own environment:
# Add python from conda-forge channel to a feature (package collection)
pixi add --feature tol-curation-utils python=3.13
# Add feature to environment
pixi workspace environment add curation --feature tol-curation-utils
# Add this tool to the feature.
pixi add --feature tol-curation-utils --pypi "tola-agp-tpf-utils@git+https://github.com/sanger-tol/agp-tpf-utils"Use pixi shell -e curation to enter the environment.
Run with --help for usage.
Parses and reformats AGP and TPF files, converting into either format.
Finds overlapping entries within AGP or TPF assembly files. Useful for debugging.
Takes the AGP file output by
PretextView
and the input assembly (usually FASTA), and produces an output assembly in
FASTA and AGP formats. The input and output file formats are determined from
the extensions of the files. FASTA input and output uses the .fai index
format, as produced by
faidx, and uses a streaming
strategy with a 250 kB buffer to keep memory usage low no matter how large
the chromosome.
Both TPF and AGP file formats described here contain the same information. AGP is the more appropriate format to use, since it was designed for sequence assembly coordinates, whereas TPF was for listing (cosmid, fosmid, YAC or BAC) clones and their accessions in the order that they were tiled to build a chromosome.
Each line in the AGP v2.1 specification contains 9 tab delimited columns. Of these columns:
- DNA Sequence
- column 5 the "component_type" contains
Win our assemblies, meaning a contig from Whole Genome Shotgun (WGS) sequencing. - columns 10 and greater are extra tag metadata columns not included in the AGP v2.1 specification. (See below for their possible values.)
- column 5 the "component_type" contains
- Gaps
- column 5 the "component_type" contains
Uin our assemblies, for a gap of unknown length. (The other gap typeNis for gaps of known length.) - column 6 The default length in the specification for
Ugaps is 100 base pairs, but we use 200 bp gaps, as produced by yahs - column 7 has
scaffold, signifying a gap between two contigs in a scaffold. - column 8 has
yes, signifying that there is evidence of linkage between the sequence data on either side of the gap.
- column 5 the "component_type" contains
Single words appended in tab-delimted columns beyond column 9, they can contain:
ContaminantorTargetFalseDuplicateHaplotigfor haplotype-specific contigs.- Haplotypes:
Hap1,Hap2…
Paintedwhere fragment has HiC contacts.Primaryfor tagging the only curated haplotype in a multi-haplotype PretextView map.Singletonto flag chromosomes which were not found in any other haplotypes.Unlocare fragments attached to chromosomes but unlocalised within them.- Sex Chromosomes:
UVWorW1,W2…XorX1,X2…YorY1,Y2…ZorZ1,Z2…
- B Chromosomes:
B1,B2,B3…
Our TPF files are quite diverged from the original specification.
- We incorporate assembly coordinates, which was not the purpose of TPF files.
- We do not necessarily include any
##header lines, which were mandatory in the original specification. - DNA Sequence
- column 1 the "accession" is always
?since the components of our assemblies are not accessioned. - column 2 the "clone name" does not contain a clone name, but
contains the name of scaffold fragment or whole scaffold, with the
format:
<name>:<start>-<end>i.e. assembly coordinates. - column 3 the "local contig identifier" now contains the name of the scaffold each sequence fragment belongs to. Each TPF file used to contain a single chromosome, but we put a whole genome into a single file, and this column groups the fragments into chromosomes / scaffolds.
- column 4 holds assembly strand information, either
PLUSorMINUS.
- column 1 the "accession" is always
- Gaps
- column 2 is
TYPE-2, which meant a gap between two clones - column 3 length, using our default of 200 bp.
- column 2 is
In your cloned copy of the git repository:
uv sync
source .venv/bin/activateAn alias such as this:
alias atu="cd $HOME/git/agp-tpf-utils && source .venv/bin/activate"in your shell's .*rc file (e.g. ~/.bashrc for bash or ~/.zshrc for
zsh) can be convenient.
Some changes, such as adding a new command line script to
pyproject.toml, require the development environment to be
reinstalled, in which case just re-run:
uv syncTests, located in the tests/ directory, are run with the pytest
command from the project root.