distfeat is a standalone Python package for manipulating phonological
features.
It provides:
- bundled phonological feature datasets
- pluggable feature systems
- feature geometry and distance functions
- query and analysis helpers for graphemes and feature sets
distfeat is dependency-free at runtime and is the standalone home for the
feature subsystem extracted from alteruphono.
The canonical modern API is built around native representations:
- use
get_representation(...)when you want the system's native feature model - use
matches(...)andsegment_distance(...)for system-native comparison - treat
get_features(...),partial_match(...), andsound_distance(...)as convenience helpers for categorical systems
Install from PyPI:
pip install distfeatRequires Python 3.12+.
Development install:
git clone https://github.com/tresoldi/distfeat.git
cd distfeat
uv venv
uv pip install -e ".[dev]"Run checks in the project environment:
uv run ruff check .
uv run mypy src
uv run pytest -q
uv run python scripts/verify_examples.pyThe package is organized around:
- a bundled
FeatureDataset - a lazy default registry plus explicit
Registryinstances - built-in systems:
ipatresoldidistinctivepbase-hcpbase-jfhpbase-spepbase-uftc
The package does not define a Sound object. It works directly with graphemes,
feature bundles, native multi-state feature tables, scalar dimensions, and
matrices.
import distfeat
# Built-in systems
print(distfeat.list_systems())
# ['ipa', 'tresoldi', 'distinctive', 'pbase-hc', 'pbase-jfh', 'pbase-spe', 'pbase-uftc']
# Basic grapheme lookup
print(distfeat.get_features("p"))
# frozenset({'consonant', 'voiceless', 'bilabial', 'stop'})
# Predefined sound classes
print(distfeat.get_class_features("V"))
# frozenset({'vowel'})
# Direct grapheme distance
print(distfeat.distance("a", "e"))You can use the lazy default registry through top-level helpers, or you can work with a specific system object.
import distfeat
ipa = distfeat.get_system("ipa")
tresoldi = distfeat.get_system("tresoldi")
distinctive = distfeat.get_system("distinctive")
pbase = distfeat.get_system("pbase-hc")
print(ipa.grapheme_to_features("a"))
print(tresoldi.grapheme_to_features("a"))
print(distinctive.grapheme_to_features("a"))
print(pbase.grapheme_to_representation("a"))Exact reverse lookup is available when a native representation maps directly to
a known grapheme. For categorical systems this is usually a frozenset[str];
for valued systems it can be a dict[str, FeatureState | str] or
ValuedFeatures.
ipa = distfeat.get_system("ipa")
grapheme = ipa.features_to_grapheme(
frozenset({"consonant", "voiced", "bilabial", "stop"})
)
print(grapheme)
# 'b'Use features_to_graphemes(...) to retrieve all graphemes satisfying a
feature query.
By default, matching is partial and uses the semantics of the selected system.
import distfeat
# All vowels in the default system
vowels = distfeat.features_to_graphemes(frozenset({"vowel"}))
print(vowels[:10])
# Voiceless consonants
voiceless_consonants = distfeat.features_to_graphemes(
frozenset({"consonant", "-voiced"})
)
print(voiceless_consonants[:10])You can also force exact matching:
import distfeat
ipa = distfeat.get_system("ipa")
features = ipa.grapheme_to_features("a")
print(distfeat.features_to_graphemes(features, exact=True))distfeat also supports systems whose native representation is a named
feature-value table instead of a categorical set. The bundled P-base-derived
systems expose multi-state values such as +, -, n, ., o, and x
through FeatureState.
import distfeat
rep = distfeat.get_representation("a", system="pbase-hc")
print(rep.values["syllabic"])
# FeatureState.POSITIVE
matches = distfeat.features_to_graphemes({"syllabic": "+"}, system="pbase-hc")
print(matches[:10])The bundled P-base table is intentionally described as derived rather than
verbatim. The source data contains duplicate IPA rows, including rows with
conflicting values in a small number of columns. distfeat merges duplicate
rows conservatively:
- identical duplicate rows collapse into one row
- if duplicate rows disagree, only the conflicting cells are downgraded to
.(FeatureState.DOT)
This preserves a single usable row per grapheme without inventing new positive or negative values where the source disagrees.
Use derive_class_features(...) to compute the strict shared feature
intersection of a set of graphemes.
import distfeat
print(distfeat.derive_class_features(["t", "d"]))
# frozenset({'consonant', 'alveolar', 'stop', ...})
print(distfeat.derive_class_features(["t", "d", "s"]))
# fewer shared features than the pair aboveFor multi-state systems, the result is a dictionary of shared feature states:
import distfeat
print(distfeat.derive_class_features(["t", "d"], system="pbase-hc"))
# {'consonantal': <FeatureState.POSITIVE: '+'>, ...}Use minimal_matrix(...) to compute the smallest feature set needed to
distinguish a given list of graphemes.
import distfeat
matrix = distfeat.minimal_matrix(["t", "d"], system="ipa")
print(matrix.columns)
print(matrix.rows)For ipa and tresoldi, the matrix is categorical and boolean. For
distinctive, it uses scalar dimensions. For P-base-derived systems, it uses
native multi-state values.
import distfeat
matrix = distfeat.minimal_matrix(["t", "d", "s"], system="ipa")
print(distfeat.tabulate_matrix(matrix))Example plain-text output:
grapheme | continuant | voiced
---------+------------+-------
t | False | False
d | False | True
s | True | False
Markdown output is also supported:
print(distfeat.tabulate_matrix(matrix, format="markdown"))P-base-derived systems render symbolic state values directly:
import distfeat
matrix = distfeat.minimal_matrix(["t", "d"], system="pbase-hc")
print(distfeat.tabulate_matrix(matrix))The distinctive system also exposes scalar representations.
from distfeat import DistinctiveFeatureSystem, load_builtin_dataset
system = DistinctiveFeatureSystem(dataset=load_builtin_dataset())
print(system.grapheme_to_scalars("a"))
print(system.features_to_scalars(system.grapheme_to_features("a")))
print(system.scalars_to_features({"voice": 1.0, "labial": 1.0}))The default distance(...) helper resolves graphemes through the selected
system and uses that system's native distance.
import distfeat
print(distfeat.distance("a", "e"))
print(distfeat.distance("a", "u"))
print(distfeat.distance("p", "b"))
print(distfeat.distance("t", "d", system="pbase-hc"))You can also supply a precomputed nested dictionary.
import distfeat
precomputed = {
"a": {"e": 1.5, "u": 2.0},
"p": {"b": 0.5},
}
print(distfeat.distance("a", "e", precomputed=precomputed))
print(distfeat.distance("b", "p", precomputed=precomputed))If a requested pair is missing from the precomputed matrix, the function raises
KeyError.
from distfeat import create_registry, load_dataset
dataset = load_dataset(directory="my_feature_data")
registry = create_registry(dataset=dataset)
system = registry.get_system("ipa")
print(system.grapheme_to_features("k"))Expected files in my_feature_data/:
sounds.tsvclasses.tsvfeatures.tsv
distfeat bundles a derived segment table based on the P-base distribution.
The bundled systems are:
pbase-hcpbase-jfhpbase-spepbase-uftc
These systems use the same registry and analysis APIs as the categorical and scalar systems, but operate on native multi-state feature values.
The P-base-derived data is bundled separately from the MIT-licensed code and
retains its own attribution and license notice in src/distfeat/data/pbase/.
from distfeat import create_registry, dataset_from_rows
from distfeat.systems.ipa import IPAFeatureSystem
dataset = dataset_from_rows(
sounds={"a": "open front vowel", "p": "voiceless bilabial consonant stop"},
classes={"V": ("vowel", "vowel", ["a"])},
features=[("open", "height"), ("front", "centrality"), ("stop", "manner")],
)
registry = create_registry(dataset=dataset, register_builtin=False)
registry.register("ipa", IPAFeatureSystem(dataset))
print(registry.get_system("ipa").grapheme_to_features("a"))Use explicit registries when you want isolated state instead of the default global registry.
from distfeat import create_registry, load_builtin_dataset
registry = create_registry(dataset=load_builtin_dataset())
registry.set_default("tresoldi")
print(registry.get_system().name)
print(registry.list_systems())The current package intentionally does not provide:
- a legacy
DistFeatfacade class - the old binary/tristate feature-table interface
grapheme2features(..., t_values=False)style+/-/0rendering- vector output modes for feature tables or matrices
- a command-line interface
- ML-based distance training
The current public API is built around categorical feature bundles, native
multi-state feature tables, scalar dimensions for the distinctive system,
and analysis helpers over those representations.
- docs/index.md for the package overview
- docs/api.md for the public API
- docs/datasets.md for dataset loading
- docs/systems.md for built-in systems
- docs/recipes.md for task-oriented workflows
- docs/development.md for implementation constraints
alteruphono should be treated as a consumer of distfeat, not the owner of
the feature subsystem.