Skip to content

Commit 23fe3b0

Browse files
authored
Merge pull request #1 from austintwang/dev
Merge into main
2 parents daab5c9 + 0f59972 commit 23fe3b0

File tree

10 files changed

+139
-42
lines changed

10 files changed

+139
-42
lines changed

.github/workflows/CI.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,7 @@ jobs:
158158
runs-on: ubuntu-latest
159159
if: ${{ startsWith(github.ref, 'refs/tags/') || github.event_name == 'workflow_dispatch' }}
160160
needs: [linux, musllinux, windows, macos, sdist]
161+
environment: release
161162
permissions:
162163
# Use to sign the release artifacts
163164
id-token: write
@@ -174,8 +175,8 @@ jobs:
174175
- name: Publish to PyPI
175176
if: ${{ startsWith(github.ref, 'refs/tags/') }}
176177
uses: PyO3/maturin-action@v1
177-
env:
178-
MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
178+
# env:
179+
# MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
179180
with:
180181
command: upload
181182
args: --non-interactive --skip-existing wheels-*/*

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "dinuc_shuf"
3-
version = "0.1.0"
3+
version = "0.1.0-beta.1"
44
edition = "2021"
55

66
[lib]

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 Austin Wang
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,47 @@
11
# dinuc_shuf
22

3-
This is a simple Python package with a single function `shuffle` that does a dinucleotide shuffle on input sequences.
3+
This Python package provides a minimal and efficient implementation for performing dinucleotide shuffles on one-hot-encoded sequences.
44

5-
A dinucleotide shuffle preserves the dinucleotide (doublet) frequencies of the input sequence while randomizing the order of the dinucleotides. This is useful for generating compositionally-matched random sequences.
5+
Dinucleotide shuffling preserves the dinucleotide (nucleotide pair) frequencies of the input sequence while randomizing the order of the pairs. This is particularly useful for generating random sequences that match the compositional properties of the original input.
6+
7+
To ensure a uniform random sample from all possible shuffles, the algorithm leverages the rank-one-update Kirchhoff matrix method described by [Colburn et al.](https://doi.org/10.1006/jagm.1996.0014) for sampling random arborescences, combined with a random Eulerian walk on the dinucleotide transition graph. The core algorithm is implemented in Rust for performance, with Python bindings for easy integration.
8+
9+
This package is lightweight, requiring only a single dependency on Numpy.
10+
11+
## Installation
12+
13+
To install the package from PyPI, run:
14+
15+
```bash
16+
pip install dinuc-shuf
17+
```
18+
19+
## Usage
20+
21+
```python
22+
import numpy as np
23+
from dinuc_shuf import shuffle
24+
25+
SEQ_ALPHABET = np.array(["A","C","G","T"], dtype="S1")
26+
27+
def one_hot_encode(sequence, dtype=np.uint8):
28+
sequence = sequence.upper()
29+
seq_chararray = np.frombuffer(sequence.encode('UTF-8'), dtype='S1')
30+
one_hot = (seq_chararray[:,None] == SEQ_ALPHABET[None,:]).astype(dtype)
31+
32+
return one_hot
33+
34+
def one_hot_decode(one_hot):
35+
return SEQ_ALPHABET[one_hot.argmax(axis=1)].tobytes().decode('UTF-8')
36+
37+
sequence = "ACCCACGATGATG"
38+
one_hot_sequence = one_hot_encode(sequence)
39+
shuffled_one_hot = shuffle(one_hot_sequence[None,:,:])
40+
shuffled = one_hot_decode(shuffled_one_hot[0,:,:])
41+
42+
print(shuffled) # Output: "ACATGATGACCCG"
43+
```
44+
45+
## API Reference
46+
47+
A full API reference is available [here](https://austintwang.github.io/dinuc_shuf/).

docs/dinuc_shuf.html

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -42,20 +42,26 @@ <h2>API Documentation</h2>
4242
<h1 class="modulename">
4343
dinuc_shuf </h1>
4444

45-
45+
<div class="docstring"><p>This module provides a method <code><a href="#shuffle">shuffle</a></code> to dinucleotide shuffle one-hot encoded sequences.</p>
46+
47+
<p>For installation and usage instructions, check out the <a href="https://github.com/austintwang/dinuc_shuf">GitHub repository</a>.</p>
48+
</div>
49+
4650
<input id="mod-dinuc_shuf-view-source" class="view-source-toggle-state" type="checkbox" aria-hidden="true" tabindex="-1">
4751

4852
<label class="view-source-button" for="mod-dinuc_shuf-view-source"><span>View Source</span></label>
4953

50-
<div class="pdoc-code codehilite"><pre><span></span><span id="L-1"><a href="#L-1"><span class="linenos">1</span></a><span class="kn">from</span><span class="w"> </span><span class="nn">.main</span><span class="w"> </span><span class="kn">import</span> <span class="n">shuffle</span>
51-
</span><span id="L-2"><a href="#L-2"><span class="linenos">2</span></a>
52-
</span><span id="L-3"><a href="#L-3"><span class="linenos">3</span></a><span class="sd">&quot;&quot;&quot;</span>
53-
</span><span id="L-4"><a href="#L-4"><span class="linenos">4</span></a><span class="sd">.. include:: ../../README.md</span>
54-
</span><span id="L-5"><a href="#L-5"><span class="linenos">5</span></a><span class="sd">&quot;&quot;&quot;</span>
55-
</span><span id="L-6"><a href="#L-6"><span class="linenos">6</span></a>
56-
</span><span id="L-7"><a href="#L-7"><span class="linenos">7</span></a><span class="n">__all__</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;shuffle&#39;</span><span class="p">]</span>
57-
</span><span id="L-8"><a href="#L-8"><span class="linenos">8</span></a>
58-
</span><span id="L-9"><a href="#L-9"><span class="linenos">9</span></a><span class="n">__docformat__</span> <span class="o">=</span> <span class="s1">&#39;numpy&#39;</span>
54+
<div class="pdoc-code codehilite"><pre><span></span><span id="L-1"><a href="#L-1"><span class="linenos"> 1</span></a><span class="sd">&quot;&quot;&quot;</span>
55+
</span><span id="L-2"><a href="#L-2"><span class="linenos"> 2</span></a><span class="sd">This module provides a method `shuffle` to dinucleotide shuffle one-hot encoded sequences.</span>
56+
</span><span id="L-3"><a href="#L-3"><span class="linenos"> 3</span></a>
57+
</span><span id="L-4"><a href="#L-4"><span class="linenos"> 4</span></a><span class="sd">For installation and usage instructions, check out the [GitHub repository](https://github.com/austintwang/dinuc_shuf).</span>
58+
</span><span id="L-5"><a href="#L-5"><span class="linenos"> 5</span></a><span class="sd">&quot;&quot;&quot;</span>
59+
</span><span id="L-6"><a href="#L-6"><span class="linenos"> 6</span></a>
60+
</span><span id="L-7"><a href="#L-7"><span class="linenos"> 7</span></a><span class="kn">from</span><span class="w"> </span><span class="nn">.main</span><span class="w"> </span><span class="kn">import</span> <span class="n">shuffle</span>
61+
</span><span id="L-8"><a href="#L-8"><span class="linenos"> 8</span></a>
62+
</span><span id="L-9"><a href="#L-9"><span class="linenos"> 9</span></a><span class="n">__all__</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;shuffle&#39;</span><span class="p">]</span>
63+
</span><span id="L-10"><a href="#L-10"><span class="linenos">10</span></a>
64+
</span><span id="L-11"><a href="#L-11"><span class="linenos">11</span></a><span class="n">__docformat__</span> <span class="o">=</span> <span class="s1">&#39;numpy&#39;</span>
5965
</span></pre></div>
6066

6167

@@ -78,7 +84,7 @@ <h1 class="modulename">
7884
</span><span id="shuffle-12"><a href="#shuffle-12"><span class="linenos">12</span></a><span class="sd"> Parameters</span>
7985
</span><span id="shuffle-13"><a href="#shuffle-13"><span class="linenos">13</span></a><span class="sd"> ----------</span>
8086
</span><span id="shuffle-14"><a href="#shuffle-14"><span class="linenos">14</span></a><span class="sd"> seqs : np.ndarray</span>
81-
</span><span id="shuffle-15"><a href="#shuffle-15"><span class="linenos">15</span></a><span class="sd"> A three-dimensional array of one-hot-encoded sequences with shape (num_seqs, seq_len, alphabet_size).</span>
87+
</span><span id="shuffle-15"><a href="#shuffle-15"><span class="linenos">15</span></a><span class="sd"> A three-dimensional array of one-hot-encoded sequences with shape (num_seqs, seq_len, alphabet_size). Will be cast to np.uint8 if not already so.</span>
8288
</span><span id="shuffle-16"><a href="#shuffle-16"><span class="linenos">16</span></a><span class="sd"> rng : Optional[np.random.Generator], optional</span>
8389
</span><span id="shuffle-17"><a href="#shuffle-17"><span class="linenos">17</span></a><span class="sd"> A NumPy random number generator instance. If None, a new default generator instance will be used.</span>
8490
</span><span id="shuffle-18"><a href="#shuffle-18"><span class="linenos">18</span></a><span class="sd"> verify : bool, optional</span>
@@ -128,7 +134,7 @@ <h6 id="parameters">Parameters</h6>
128134

129135
<ul>
130136
<li><strong>seqs</strong> (np.ndarray):
131-
A three-dimensional array of one-hot-encoded sequences with shape (num_seqs, seq_len, alphabet_size).</li>
137+
A three-dimensional array of one-hot-encoded sequences with shape (num_seqs, seq_len, alphabet_size). Will be cast to np.uint8 if not already so.</li>
132138
<li><strong>rng</strong> (Optional[np.random.Generator], optional):
133139
A NumPy random number generator instance. If None, a new default generator instance will be used.</li>
134140
<li><strong>verify</strong> (bool, optional):

pyproject.toml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,28 @@ build-backend = "maturin"
44

55
[project]
66
name = "dinuc_shuf"
7+
description = "A utility for shuffling biological sequences while preserving dinucleotide frequencies."
8+
authors = [{ name = "Austin Wang" }]
79
requires-python = ">=3.8"
810
classifiers = [
911
"Programming Language :: Rust",
1012
"Programming Language :: Python :: Implementation :: CPython",
1113
"Programming Language :: Python :: Implementation :: PyPy",
14+
"Development Status :: 4 - Beta",
1215
]
1316
dynamic = ["version"]
1417
dependencies = [
1518
"numpy >= 1.16.0"
1619
]
20+
readme = "README.md"
21+
license = "MIT"
1722

1823
[tool.maturin]
1924
features = ["pyo3/extension-module"]
2025
python-source = "python"
21-
# module-name = "dinuc_shuf._internal"
26+
27+
[project.urls]
28+
Homepage = "https://github.com/austintwang/dinuc_shuf"
29+
Documentation = "https://austintwang.github.io/dinuc_shuf"
30+
Repository = "https://github.com/austintwang/dinuc_shuf.git"
31+
"Bug Tracker" = "https://github.com/austintwang/dinuc_shuf/issues"

python/dinuc_shuf/__init__.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1-
from .main import shuffle
2-
31
"""
4-
.. include:: ../../README.md
2+
This module provides a method `shuffle` to dinucleotide shuffle one-hot encoded sequences.
3+
4+
For installation and usage instructions, check out the [GitHub repository](https://github.com/austintwang/dinuc_shuf).
55
"""
66

7+
from .main import shuffle
8+
79
__all__ = ['shuffle']
810

911
__docformat__ = 'numpy'

python/dinuc_shuf/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ def shuffle(seqs: np.ndarray, rng: Optional[np.random.Generator] = None, verify:
1111
Parameters
1212
----------
1313
seqs : np.ndarray
14-
A three-dimensional array of one-hot-encoded sequences with shape (num_seqs, seq_len, alphabet_size).
14+
A three-dimensional array of one-hot-encoded sequences with shape (num_seqs, seq_len, alphabet_size). Will be cast to np.uint8 if not already so.
1515
rng : Optional[np.random.Generator], optional
1616
A NumPy random number generator instance. If None, a new default generator instance will be used.
1717
verify : bool, optional

tests/test.py

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ def one_hot_decode(one_hot):
2020
return SEQ_ALPHABET[one_hot.argmax(axis=1)].tobytes().decode('UTF-8')
2121

2222

23-
def test_factory(seq_str, num_shuffles_true=None):
23+
def test_factory(seq_str, num_shuffles_true=None, n=100000):
2424
class TestDinucShuffle(unittest.TestCase):
2525
def setUp(self):
2626
seq = one_hot_encode(seq_str)
@@ -31,18 +31,19 @@ def setUp(self):
3131

3232
self.num_shuffles_true = num_shuffles_true
3333

34+
self.n = n
35+
seq_expanded = np.repeat(self.seq[None,:,:], self.n, axis=0)
36+
self.shuffled = shuffle(seq_expanded, rng=self.rng)
37+
self.decoded_shuffled = [one_hot_decode(i) for i in self.shuffled]
38+
3439
def test_composition(self):
35-
shuffled = shuffle(self.seq[None,:,:], rng=self.rng)
36-
shuffled_adj = shuffled[0,:-1,:].T @ shuffled[0,1:,:]
37-
np.testing.assert_array_equal(self.seq_adj, shuffled_adj)
40+
for shuffled in self.shuffled:
41+
shuffled_adj = shuffled[:-1,:].T @ shuffled[1:,:]
42+
np.testing.assert_array_equal(self.seq_adj, shuffled_adj)
3843

3944
def test_uniformity(self):
40-
n = 100000
41-
seq_expanded = np.repeat(self.seq[None,:,:], n, axis=0)
42-
shuffled = shuffle(seq_expanded, rng=self.rng)
43-
decoded_shuffled = [one_hot_decode(i) for i in shuffled]
4445
counts = {}
45-
for i in decoded_shuffled:
46+
for i in self.decoded_shuffled:
4647
counts.setdefault(i, 0)
4748
counts[i] += 1
4849

@@ -63,12 +64,7 @@ def test_coverage(self):
6364
if self.num_shuffles_true is None:
6465
self.skipTest("True number of unique sequences not provided")
6566

66-
n = 1000
67-
seq_expanded = np.repeat(self.seq[None,:,:], n, axis=0)
68-
shuffled = shuffle(seq_expanded, rng=self.rng)
69-
decoded_shuffled = [one_hot_decode(i) for i in shuffled]
70-
71-
unique_seqs = set(decoded_shuffled)
67+
unique_seqs = set(self.decoded_shuffled)
7268
unique_seqs.discard("")
7369
num_unique_seqs = len(unique_seqs)
7470

@@ -77,10 +73,29 @@ def test_coverage(self):
7773
return TestDinucShuffle
7874

7975

80-
empty = test_factory("", 0)
81-
A = test_factory("A", 1)
82-
TT = test_factory("TT", 1)
83-
ACGT = test_factory("ACGT", 1)
76+
class TestEmptyInput(unittest.TestCase):
77+
def test_empty_input(self):
78+
seq = np.zeros((0, 1000, 4), dtype=np.uint8)
79+
shuffled = shuffle(seq)
80+
81+
np.testing.assert_array_equal(seq, shuffled)
82+
83+
84+
class TestMalformedInput(unittest.TestCase):
85+
def test_wrong_shape(self):
86+
seq = one_hot_encode("ACCCACGATGATA")
87+
with self.assertRaises(ValueError):
88+
shuffle(seq)
89+
90+
def test_not_one_hot(self):
91+
seq = np.zeros((1, 1000, 4), dtype=np.uint8)
92+
with self.assertRaises(ValueError):
93+
shuffle(seq)
94+
95+
96+
A = test_factory("A", 1, n=10)
97+
TT = test_factory("TT", 1, n=10)
98+
ACGT = test_factory("ACGT", 1, n=10)
8499
ACGCACGG = test_factory("ACGCACGG")
85100
ACCCACGATGATA = test_factory("ACCCACGATGATA", 72)
86101
ACCCACGATGATG = test_factory("ACCCACGATGATG", 27)

0 commit comments

Comments
 (0)