Skip to content

Commit b995f21

Browse files
authored
Implement all missing multibase encodings and API features (#21)
* feat: Implement all missing multibase encodings and API features - Add 10 missing encodings: base16upper, base32upper, base32pad, base32padupper, base32hexupper, base32hexpad, base32hexpadupper, base64pad, base64urlpad, base256emoji - Implement RFC 4648 padding support for base32 and base64 variants - Add structured exception classes (UnsupportedEncodingError, InvalidMultibaseStringError, DecodingError) - Add Encoder and Decoder classes for reusable encoding/decoding - Add decode(return_encoding=True) to return encoding type - Add encoding metadata functions (get_encoding_info, list_encodings, is_encoding_supported) - Add decoder composition support via Decoder.or_() method - Update tests for all new encodings and API features - Update documentation and create news fragment Closes #20 Achieves 100% encoding coverage (24/24 encodings) * Address PR review feedback: simplify TOXENV logic, improve newsfragment, add make targets docs * Fix issues from PR review: simplify padding logic, add type hints, improve docs, add test - Simplify redundant padding logic in BaseByteStringConverter - Add return type hints to Base256EmojiConverter methods - Add clarifying comments to exception handling in decode() - Add test for ComposedDecoder error messages when all decoders fail - Improve Base256EmojiConverter documentation with better docstrings * Fix base256emoji to use exact hardcoded alphabet from reference implementations - Replace dynamic emoji generation with hardcoded alphabet matching js-multiformats and go-multibase - Update decode method to iterate character-by-character (matching reference implementations) - Ensures full compatibility with js-multiformats and go-multibase - All test cases from js-multiformats spec tests now pass exactly This fixes the compatibility issue where our implementation was generating a different emoji set than the reference implementations.
1 parent 4529b71 commit b995f21

File tree

10 files changed

+621
-58
lines changed

10 files changed

+621
-58
lines changed

.github/workflows/tox.yml

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,11 @@ jobs:
2828
python-version: ${{ matrix.python-version }}
2929
- name: Set TOXENV
3030
run: |
31-
python_version="${{ matrix.python-version }}"
32-
toxenv="${{ matrix.toxenv }}"
33-
if [[ "$toxenv" == "docs" ]]; then
34-
echo "TOXENV=docs" | tee -a "$GITHUB_ENV"
31+
if [[ "${{ matrix.toxenv }}" == "docs" ]]; then
32+
echo "TOXENV=docs" >> "$GITHUB_ENV"
3533
else
36-
echo "TOXENV=py${python_version}-${toxenv}" | tr -d '.' | tee -a "$GITHUB_ENV"
34+
python_version="${{ matrix.python-version }}"
35+
echo "TOXENV=py${python_version//./}-${{ matrix.toxenv }}" >> "$GITHUB_ENV"
3736
fi
3837
- run: |
3938
python -m pip install --upgrade pip
@@ -58,8 +57,7 @@ jobs:
5857
shell: bash
5958
run: |
6059
python_version="${{ matrix.python-version }}"
61-
toxenv="${{ matrix.toxenv }}"
62-
echo "TOXENV=py${python_version}-${toxenv}" | tr -d '.' | tee -a "$GITHUB_ENV"
60+
echo "TOXENV=py${python_version//./}-${{ matrix.toxenv }}" >> "$GITHUB_ENV"
6361
- name: Install dependencies
6462
run: |
6563
python -m pip install --upgrade pip

CONTRIBUTING.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,33 @@ Ready to contribute? Here's how to set up `multibase` for local development.
9696

9797
If you installed pre-commit hooks (step 4), they will run automatically on commit.
9898

99+
Development Workflow Commands
100+
-------------------------------
101+
102+
The project provides several ``make`` targets to help with development:
103+
104+
* ``make fix`` - Automatically fix formatting and linting issues using ruff.
105+
Use this when you want to auto-fix code style issues.
106+
107+
* ``make lint`` - Run all pre-commit hooks on all files to check for code quality
108+
issues. This includes YAML/TOML validation, trailing whitespace checks, pyupgrade,
109+
ruff linting and formatting, and mypy type checking.
110+
111+
* ``make typecheck`` - Run mypy type checking only. Use this when you want to
112+
quickly check for type errors without running all other checks.
113+
114+
* ``make test`` - Run the test suite with pytest using the default Python version.
115+
For testing across multiple Python versions, use ``tox`` instead.
116+
117+
* ``make pr`` - Run a complete pre-PR check: clean build artifacts, fix formatting,
118+
run linting, type checking, and tests. This is the recommended command to run
119+
before submitting a pull request.
120+
121+
* ``make coverage`` - Run tests with coverage reporting and open the HTML report
122+
in your browser.
123+
124+
For a full list of available commands, run ``make help``.
125+
99126
7. Commit your changes and push your branch to GitHub::
100127

101128
$ git add .

README.rst

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,28 @@ Sample Usage
6161
>>> decode(encode('base2', b'hello world'))
6262
b'hello world'
6363
64+
>>> # Using reusable Encoder/Decoder classes
65+
>>> from multibase import Encoder, Decoder
66+
>>> encoder = Encoder('base64')
67+
>>> encoded1 = encoder.encode('data1')
68+
>>> encoded2 = encoder.encode('data2')
69+
70+
>>> decoder = Decoder()
71+
>>> decoded = decoder.decode(encoded1)
72+
73+
>>> # Getting encoding information
74+
>>> from multibase import get_encoding_info, list_encodings, is_encoding_supported
75+
>>> info = get_encoding_info('base64')
76+
>>> print(info.encoding, info.code)
77+
base64 b'm'
78+
>>> all_encodings = list_encodings()
79+
>>> is_encoding_supported('base64')
80+
True
81+
82+
>>> # Decode with encoding return
83+
>>> encoding, data = decode(encoded1, return_encoding=True)
84+
>>> print(f'Encoded with {encoding}: {data}')
85+
6486
6587
Supported codecs
6688
================
@@ -69,14 +91,22 @@ Supported codecs
6991
* base8
7092
* base10
7193
* base16
72-
* base16
73-
* base16
94+
* base16upper
7495
* base32hex
96+
* base32hexupper
97+
* base32hexpad
98+
* base32hexpadupper
7599
* base32
100+
* base32upper
101+
* base32pad
102+
* base32padupper
76103
* base32z
77104
* base36
78105
* base36upper
79106
* base58flickr
80107
* base58btc
81108
* base64
109+
* base64pad
82110
* base64url
111+
* base64urlpad
112+
* base256emoji

multibase/__init__.py

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,23 @@
44
__email__ = "[email protected]"
55
__version__ = "1.0.3"
66

7-
from .multibase import ENCODINGS, Encoding, decode, encode, get_codec, is_encoded # noqa: F401
7+
from .exceptions import ( # noqa: F401
8+
DecodingError,
9+
InvalidMultibaseStringError,
10+
MultibaseError,
11+
UnsupportedEncodingError,
12+
)
13+
from .multibase import ( # noqa: F401
14+
ENCODINGS,
15+
ComposedDecoder,
16+
Decoder,
17+
Encoder,
18+
Encoding,
19+
decode,
20+
encode,
21+
get_codec,
22+
get_encoding_info,
23+
is_encoded,
24+
is_encoding_supported,
25+
list_encodings,
26+
)

multibase/converters.py

Lines changed: 144 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,17 +28,38 @@ def decode(self, bytes):
2828

2929

3030
class Base16StringConverter(BaseStringConverter):
31+
def __init__(self, digits):
32+
super().__init__(digits)
33+
self.uppercase = digits.isupper()
34+
3135
def encode(self, bytes):
32-
return ensure_bytes("".join([f"{byte:02x}" for byte in bytes]))
36+
result = "".join([f"{byte:02x}" for byte in bytes])
37+
if self.uppercase:
38+
result = result.upper()
39+
return ensure_bytes(result)
40+
41+
def decode(self, data):
42+
# Base16 decode is case-insensitive, normalize to our digits case
43+
if isinstance(data, bytes):
44+
data_str = data.decode("utf-8")
45+
else:
46+
data_str = data
47+
# Convert to match our digits case
48+
if self.uppercase:
49+
data_str = data_str.upper()
50+
else:
51+
data_str = data_str.lower()
52+
return super().decode(data_str.encode("utf-8"))
3353

3454

3555
class BaseByteStringConverter:
3656
ENCODE_GROUP_BYTES = 1
3757
ENCODING_BITS = 1
3858
DECODING_BITS = 1
3959

40-
def __init__(self, digits):
60+
def __init__(self, digits, pad=False):
4161
self.digits = digits
62+
self.pad = pad
4263

4364
def _chunk_with_padding(self, iterable, n, fillvalue=None):
4465
"Collect data into fixed-length chunks or blocks"
@@ -49,9 +70,11 @@ def _chunk_with_padding(self, iterable, n, fillvalue=None):
4970
def _chunk_without_padding(self, iterable, n):
5071
return map("".join, zip(*[iter(iterable)] * n))
5172

52-
def _encode_bytes(self, bytes_, group_bytes, encoding_bits, decoding_bits):
73+
def _encode_bytes(self, bytes_, group_bytes, encoding_bits, decoding_bits, output_chars):
5374
buffer = BytesIO(bytes_)
5475
encoded_bytes = BytesIO()
76+
input_length = len(bytes_)
77+
5578
while True:
5679
byte_ = buffer.read(group_bytes)
5780
if not byte_:
@@ -67,9 +90,26 @@ def _encode_bytes(self, bytes_, group_bytes, encoding_bits, decoding_bits):
6790
# convert binary representation to an integer
6891
encoded_bytes.write(ensure_bytes(self.digits[digit]))
6992

70-
return encoded_bytes.getvalue()
93+
result = encoded_bytes.getvalue()
94+
95+
# Add padding if needed (RFC 4648)
96+
if self.pad:
97+
remainder = input_length % group_bytes
98+
if remainder > 0:
99+
# For partial groups, we need to pad the output
100+
# The padding makes the output length a multiple of output_chars
101+
chars_produced = len(result)
102+
# Calculate padding needed to reach next multiple of output_chars
103+
padding_needed = output_chars - (chars_produced % output_chars)
104+
result += ensure_bytes("=" * padding_needed)
105+
106+
return result
71107

72108
def _decode_bytes(self, bytes_, group_bytes, decoding_bits, encoding_bits):
109+
# Remove padding if present
110+
if self.pad:
111+
bytes_ = bytes_.rstrip(b"=")
112+
73113
buffer = BytesIO()
74114
decoded_bytes = BytesIO()
75115

@@ -104,20 +144,118 @@ def decode(self, bytes):
104144

105145
class Base64StringConverter(BaseByteStringConverter):
106146
def encode(self, bytes):
107-
return self._encode_bytes(ensure_bytes(bytes), 3, 8, 6)
147+
return self._encode_bytes(ensure_bytes(bytes), 3, 8, 6, 4)
108148

109149
def decode(self, bytes):
110150
return self._decode_bytes(ensure_bytes(bytes), 4, 6, 8)
111151

112152

113153
class Base32StringConverter(BaseByteStringConverter):
114154
def encode(self, bytes):
115-
return self._encode_bytes(ensure_bytes(bytes), 5, 8, 5)
155+
return self._encode_bytes(ensure_bytes(bytes), 5, 8, 5, 8)
116156

117157
def decode(self, bytes):
118158
return self._decode_bytes(ensure_bytes(bytes), 8, 5, 8)
119159

120160

161+
class Base256EmojiConverter:
162+
"""Base256 emoji encoding using 256 unique emoji characters.
163+
164+
This implementation uses the exact same hardcoded emoji alphabet as
165+
js-multiformats and go-multibase reference implementations to ensure
166+
full compatibility. The alphabet is curated from Unicode emoji frequency
167+
data, excluding modifier-based emojis (such as flags) that are bigger
168+
than one single code point.
169+
"""
170+
171+
# Hardcoded emoji alphabet matching js-multiformats and go-multibase
172+
# This is the exact same alphabet used in reference implementations
173+
# Source: js-multiformats/src/bases/base256emoji.ts and go-multibase/base256emoji.go
174+
_EMOJI_ALPHABET = (
175+
"🚀🪐☄🛰🌌" # Space
176+
"🌑🌒🌓🌔🌕🌖🌗🌘" # Moon
177+
"🌍🌏🌎" # Earth
178+
"🐉" # Dragon
179+
"☀" # Sun
180+
"💻🖥💾💿" # Computer
181+
# Rest from Unicode emoji frequency data (most used first)
182+
"😂❤😍🤣😊🙏💕😭😘👍"
183+
"😅👏😁🔥🥰💔💖💙😢🤔"
184+
"😆🙄💪😉☺👌🤗💜😔😎"
185+
"😇🌹🤦🎉💞✌✨🤷😱😌"
186+
"🌸🙌😋💗💚😏💛🙂💓🤩"
187+
"😄😀🖤😃💯🙈👇🎶😒🤭"
188+
"❣😜💋👀😪😑💥🙋😞😩"
189+
"😡🤪👊🥳😥🤤👉💃😳✋"
190+
"😚😝😴🌟😬🙃🍀🌷😻😓"
191+
"⭐✅🥺🌈😈🤘💦✔😣🏃"
192+
"💐☹🎊💘😠☝😕🌺🎂🌻"
193+
"😐🖕💝🙊😹🗣💫💀👑🎵"
194+
"🤞😛🔴😤🌼😫⚽🤙☕🏆"
195+
"🤫👈😮🙆🍻🍃🐶💁😲🌿"
196+
"🧡🎁⚡🌞🎈❌✊👋😰🤨"
197+
"😶🤝🚶💰🍓💢🤟🙁🚨💨"
198+
"🤬✈🎀🍺🤓😙💟🌱😖👶"
199+
"🥴▶➡❓💎💸⬇😨🌚🦋"
200+
"😷🕺⚠🙅😟😵👎🤲🤠🤧"
201+
"📌🔵💅🧐🐾🍒😗🤑🌊🤯"
202+
"🐷☎💧😯💆👆🎤🙇🍑❄"
203+
"🌴💣🐸💌📍🥀🤢👅💡💩"
204+
"👐📸👻🤐🤮🎼🥵🚩🍎🍊"
205+
"👼💍📣🥂"
206+
)
207+
208+
def __init__(self):
209+
# Verify alphabet length
210+
if len(self._EMOJI_ALPHABET) != 256:
211+
raise ValueError(f"EMOJI_ALPHABET must contain exactly 256 characters, got {len(self._EMOJI_ALPHABET)}")
212+
# Create mapping from byte value to emoji character
213+
self.byte_to_emoji = {i: self._EMOJI_ALPHABET[i] for i in range(256)}
214+
# Create reverse mapping from emoji character to byte value
215+
# This matches the approach in js-multiformats and go-multibase
216+
self.emoji_to_byte = {emoji: byte for byte, emoji in self.byte_to_emoji.items()}
217+
218+
def encode(self, bytes_) -> bytes:
219+
"""Encode bytes to emoji string.
220+
221+
:param bytes_: Bytes to encode
222+
:type bytes_: bytes or str
223+
:return: UTF-8 encoded emoji string
224+
:rtype: bytes
225+
"""
226+
bytes_ = ensure_bytes(bytes_)
227+
result = []
228+
for byte_val in bytes_:
229+
result.append(self.byte_to_emoji[byte_val])
230+
return "".join(result).encode("utf-8")
231+
232+
def decode(self, bytes_) -> bytes:
233+
"""Decode emoji string to bytes.
234+
235+
Decodes character-by-character, matching the behavior of js-multiformats
236+
and go-multibase reference implementations. Each emoji in the alphabet
237+
is a single Unicode code point, so we can safely iterate character by
238+
character.
239+
240+
:param bytes_: UTF-8 encoded emoji string
241+
:type bytes_: bytes or str
242+
:return: Decoded bytes
243+
:rtype: bytes
244+
:raises ValueError: if an invalid emoji character is encountered
245+
"""
246+
bytes_ = ensure_bytes(bytes_, "utf8")
247+
# Decode UTF-8 to get emoji string
248+
emoji_str = bytes_.decode("utf-8")
249+
result = bytearray()
250+
# Iterate character by character (Python string iteration handles
251+
# single code point emojis correctly, matching js-multiformats and go-multibase)
252+
for char in emoji_str:
253+
if char not in self.emoji_to_byte:
254+
raise ValueError(f"Non-base256emoji character: {char}")
255+
result.append(self.emoji_to_byte[char])
256+
return bytes(result)
257+
258+
121259
class IdentityConverter:
122260
def encode(self, x):
123261
return x

multibase/exceptions.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
"""Custom exceptions for multibase encoding/decoding errors."""
2+
3+
4+
class MultibaseError(ValueError):
5+
"""Base exception for all multibase errors."""
6+
7+
pass
8+
9+
10+
class UnsupportedEncodingError(MultibaseError):
11+
"""Raised when an encoding is not supported."""
12+
13+
pass
14+
15+
16+
class InvalidMultibaseStringError(MultibaseError):
17+
"""Raised when a multibase string is invalid or cannot be decoded."""
18+
19+
pass
20+
21+
22+
class DecodingError(MultibaseError):
23+
"""Raised when decoding fails."""
24+
25+
pass

0 commit comments

Comments
 (0)