Utilities for Tree of Life AGP and TPF Assembly Files

Code for working with AGP and TPF files as used within the Tree of Life project, where the combination of long read sequencing and HiC data is used to produce whole genome assemblies. It is not therefore intended to cover the full range of AGP and TPF syntax.

Installation

uv

Installation with uv is recommended. First install uv itself, following the instructions on Astral's web page, which on MacOS or Linux is to run:

curl -LsSf https://astral.sh/uv/install.sh | sh

then, to install the scripts to be available to your user account, install a recent Python followed by these utilities:

uv python install 3.13
uv tool install 'git+https://github.com/sanger-tol/agp-tpf-utils'

Pixi

Installation into an isolated, reproducible environment can be achieved using pixi. Follow the instructions on the installation page to install pixi.

To create a new environment in the current directory, using the conda-forge channel:

pixi init -c conda-forge

Then install the package into it's own environment:

# Add python from conda-forge channel to a feature (package collection)
pixi add --feature tol-curation-utils python=3.13
# Add feature to environment
pixi workspace environment add curation --feature tol-curation-utils
# Add this tool to the feature.
pixi add --feature tol-curation-utils --pypi "tola-agp-tpf-utils@git+https://github.com/sanger-tol/agp-tpf-utils"

Use pixi shell -e curation to enter the environment.

Scripts

Run with --help for usage.

`asm-format`

Parses and reformats AGP and TPF files, converting into either format.

`find-overlaps`

Finds overlapping entries within AGP or TPF assembly files. Useful for debugging.

`pretext-to-asm`

Takes the AGP file output by PretextView and the input assembly (usually FASTA), and produces an output assembly in FASTA and AGP formats. The input and output file formats are determined from the extensions of the files. FASTA input and output uses the .fai index format, as produced by faidx, and uses a streaming strategy with a 250 kB buffer to keep memory usage low no matter how large the chromosome.

File Formats

Both TPF and AGP file formats described here contain the same information. AGP is the more appropriate format to use, since it was designed for sequence assembly coordinates, whereas TPF was for listing (cosmid, fosmid, YAC or BAC) clones and their accessions in the order that they were tiled to build a chromosome.

AGP

Each line in the AGP v2.1 specification contains 9 tab delimited columns. Of these columns:

DNA Sequence
- column 5 the "component_type" contains W in our assemblies, meaning a contig from Whole Genome Shotgun (WGS) sequencing.
- columns 10 and greater are extra tag metadata columns not included in the AGP v2.1 specification. (See below for their possible values.)
Gaps
- column 5 the "component_type" contains U in our assemblies, for a gap of unknown length. (The other gap type N is for gaps of known length.)
- column 6 The default length in the specification for U gaps is 100 base pairs, but we use 200 bp gaps, as produced by yahs
- column 7 has scaffold, signifying a gap between two contigs in a scaffold.
- column 8 has yes, signifying that there is evidence of linkage between the sequence data on either side of the gap.

TPF

Our TPF files are quite diverged from the original specification.

We incorporate assembly coordinates, which was not the purpose of TPF files.
We do not necessarily include any ## header lines, which were mandatory in the original specification.
DNA Sequence
- column 1 the "accession" is always ? since the components of our assemblies are not accessioned.
- column 2 the "clone name" does not contain a clone name, but contains the name of scaffold fragment or whole scaffold, with the format: <name>:<start>-<end> i.e. assembly coordinates.
- column 3 the "local contig identifier" now contains the name of the scaffold each sequence fragment belongs to. Each TPF file used to contain a single chromosome, but we put a whole genome into a single file, and this column groups the fragments into chromosomes / scaffolds.
- column 4 holds assembly strand information, either PLUS or MINUS.
Gaps
- column 2 is TYPE-2, which meant a gap between two clones
- column 3 length, using our default of 200 bp.

Development Setup

In your cloned copy of the git repository:

uv sync
source .venv/bin/activate

An alias such as this:

alias atu="cd $HOME/git/agp-tpf-utils && source .venv/bin/activate"

in your shell's .*rc file (e.g. ~/.bashrc for bash or ~/.zshrc for zsh) can be convenient.

Reinstalling Development Environment

Some changes, such as adding a new command line script to pyproject.toml, require the development environment to be reinstalled, in which case just re-run:

uv sync

Running Tests

Tests, located in the tests/ directory, are run with the pytest command from the project root.

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
.github/workflows		.github/workflows
src/tola		src/tola
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Utilities for Tree of Life AGP and TPF Assembly Files

Installation

uv

Pixi

Scripts

`asm-format`

`find-overlaps`

`pretext-to-asm`

File Formats

AGP

Tags

TPF

Development Setup

Reinstalling Development Environment

Running Tests

About

Uh oh!

Releases 15

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Utilities for Tree of Life AGP and TPF Assembly Files

Installation

uv

Pixi

Scripts

asm-format

find-overlaps

pretext-to-asm

File Formats

AGP

Tags

TPF

Development Setup

Reinstalling Development Environment

Running Tests

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`asm-format`

`find-overlaps`

`pretext-to-asm`

Packages