|
1 | 1 | # quickdna |
2 | 2 |
|
3 | | -[](https://pypi.org/project/quickdna/) |
4 | | - |
5 | 3 | Quickdna is a simple, fast library for working with DNA sequences. It is up to 100x faster than Biopython for some |
6 | 4 | translation tasks, in part because it uses a native Rust module (via PyO3) for the translation. However, it exposes |
7 | 5 | an easy-to-use, type-annotated API that should still feel familiar for Biopython users. |
8 | 6 |
|
9 | | -⚠ *Quickdna is "pre-1.0" software. Its API is still evolving. For now, if you're interested in using quickdna, we suggest you depend on an [exact version](https://python-poetry.org/docs/dependency-specification/#exact-requirements) or [git `rev`](https://python-poetry.org/docs/dependency-specification/#git-dependencies), so that new releases don't break your code.* |
10 | | - |
11 | | -```python |
12 | | -# These are the two main library types. Unlike Biopython, DnaSequence and |
13 | | -# ProteinSequence are distinct, though they share a common BaseSequence base class |
14 | | ->>> from quickdna import DnaSequence, ProteinSequence |
15 | | - |
16 | | -# Sequences can be constructed from strs or bytes, and are stored internally as |
17 | | -# ascii-encoded bytes. |
18 | | ->>> d = DnaSequence("taatcaagactattcaaccaa") |
19 | | - |
20 | | -# Sequences can be sliced just like regular strings, and return new sequence instances. |
21 | | ->>> d[3:9] |
22 | | -DnaSequence(seq='tcaaga') |
23 | | - |
24 | | -# many other Python operations are supported on sequences as well: len, iter, |
25 | | -# ==, hash, concatenation with +, * a constant, etc. These operations are typed |
26 | | -# when appropriate and will not allow you to concatenate a ProteinSequence to a |
27 | | -# DnaSequence, for example |
28 | | - |
29 | | -# DNA sequences can be easily translated to protein sequences with `translate()`. |
30 | | -# If no table=... argument is given, NBCI table 1 will be used by default... |
31 | | ->>> d.translate() |
32 | | -ProteinSequence(seq='*SRLFNQ') |
33 | | - |
34 | | -# ...but any of the NCBI tables can be specified. A ValueError will be thrown |
35 | | -# for an invalid table. |
36 | | ->>> d.translate(table=22) |
37 | | -ProteinSequence(seq='**RLFNQ') |
38 | | - |
39 | | -# This exists too! It's somewhat faster than Biopython, but not as dramatically as |
40 | | -# `translate()` |
41 | | ->>> d[3:9].reverse_complement() |
42 | | -DnaSequence(seq='TCTTGA') |
43 | | - |
44 | | -# This method will return a list of all (up to 6) possible translated reading frames: |
45 | | -# (seq[:], seq[1:], seq[2:], seq.reverse_complement()[:], ...) |
46 | | ->>> d.translate_all_frames() |
47 | | -(ProteinSequence(seq='*SRLFNQ'), ProteinSequence(seq='NQDYST'), |
48 | | -ProteinSequence(seq='IKTIQP'), ProteinSequence(seq='LVE*S*L'), |
49 | | -ProteinSequence(seq='WLNSLD'), ProteinSequence(seq='G*IVLI')) |
50 | | - |
51 | | -# translate_all_frames will return less than 6 frames for sequences of len < 5 |
52 | | ->>> len(DnaSequence("AAAA").translate_all_frames()) |
53 | | -4 |
54 | | ->>> len(DnaSequence("AA").translate_all_frames()) |
55 | | -0 |
56 | | - |
57 | | -# There is a similar method, `translate_self_frames`, that only returns the |
58 | | -# (up to 3) translated frames for this direction, without the reverse complement |
59 | | - |
60 | | -# The IUPAC ambiguity codes are supported as well. |
61 | | -# Codons with N will translate to a specific amino acid if it is unambiguous, |
62 | | -# such as GGN -> G, or the ambiguous amino acid code 'X' if there are multiple |
63 | | -# possible translations. |
64 | | ->>> DnaSequence("GGNATN").translate() |
65 | | -ProteinSequence(seq='GX') |
66 | | - |
67 | | -# The fine-grained ambiguity codes like "R = A or G" are accepted too, and |
68 | | -# translation results are the same as Biopython. In the output, amino acid |
69 | | -# ambiguity code 'B' means "either asparagine or aspartic acid" (N or D). |
70 | | ->>> DnaSequence("RAT").translate() |
71 | | -ProteinSequence(seq='B') |
72 | | - |
73 | | -# To disallow ambiguity codes in translation, try: `.translate(strict=True)` |
74 | | -``` |
75 | | - |
76 | | -## Benchmarks |
77 | | - |
78 | | -For regular DNA translation tasks, quickdna is faster than Biopython. (See `benchmarks/bench.py` for source). |
79 | | -Machines and workloads vary, however -- always benchmark! |
80 | | - |
81 | | -task | time | comparison |
82 | | --------------------------------------------|------------------|----------- |
83 | | -translate_quickdna(small_genome) | 0.00306ms / iter | |
84 | | -translate_biopython(small_genome) | 0.05834ms / iter | 1908.90% |
85 | | -translate_quickdna(covid_genome) | 0.02959ms / iter | |
86 | | -translate_biopython(covid_genome) | 3.54413ms / iter | 11979.10% |
87 | | -reverse_complement_quickdna(small_genome) | 0.00238ms / iter | |
88 | | -reverse_complement_biopython(small_genome) | 0.00398ms / iter | 167.24% |
89 | | -reverse_complement_quickdna(covid_genome) | 0.02409ms / iter | |
90 | | -reverse_complement_biopython(covid_genome) | 0.02928ms / iter | 121.55% |
91 | | - |
92 | | -## Should you use quickdna? |
93 | | - |
94 | | -* Quickdna pros |
95 | | - * It's quick! |
96 | | - * It's simple and small. |
97 | | - * It has type annotations, including a `py.typed` marker file for checkers like MyPy or VSCode's PyRight. |
98 | | - * It makes a type distinction between DNA and protein sequences, preventing confusion. |
99 | | -* Quickdna cons: |
100 | | - * It's newer and less battle-tested than Biopython. |
101 | | - * It's not yet 1.0 -- the API is liable to change in the future. |
102 | | - * It doesn't support reading FASTA files or many of the other tasks Biopython can do, |
103 | | - so you'll probably end up still using Biopython or something else to do those tasks. |
104 | | - |
105 | | -## Installation |
106 | | - |
107 | | -Quickdna has prebuilt wheels for Linux (manylinux2010), OSX, and Windows available [on PyPi](https://pypi.org/project/quickdna/). |
108 | | - |
109 | 7 | ## Development |
110 | 8 |
|
111 | 9 | Quickdna uses `PyO3` and `maturin` to build and upload the wheels, and `poetry` for handling dependencies. This is handled via |
|
0 commit comments