distfeat

distfeat is a standalone Python package for manipulating phonological features.

It provides:

bundled phonological feature datasets
pluggable feature systems
feature geometry and distance functions
query and analysis helpers for graphemes and feature sets

distfeat is dependency-free at runtime and is the standalone home for the feature subsystem extracted from alteruphono.

The canonical modern API is built around native representations:

use get_representation(...) when you want the system's native feature model
use matches(...) and segment_distance(...) for system-native comparison
treat get_features(...), partial_match(...), and sound_distance(...) as convenience helpers for categorical systems

Installation

Install from PyPI:

pip install distfeat

Requires Python 3.12+.

Development install:

git clone https://github.com/tresoldi/distfeat.git
cd distfeat
uv venv
uv pip install -e ".[dev]"

Run checks in the project environment:

uv run ruff check .
uv run mypy src
uv run pytest -q
uv run python scripts/verify_examples.py

Core Concepts

The package is organized around:

a bundled FeatureDataset
a lazy default registry plus explicit Registry instances
built-in systems:
- ipa
- tresoldi
- distinctive
- pbase-hc
- pbase-jfh
- pbase-spe
- pbase-uftc

The package does not define a Sound object. It works directly with graphemes, feature bundles, native multi-state feature tables, scalar dimensions, and matrices.

Quick Start

import distfeat

# Built-in systems
print(distfeat.list_systems())
# ['ipa', 'tresoldi', 'distinctive', 'pbase-hc', 'pbase-jfh', 'pbase-spe', 'pbase-uftc']

# Basic grapheme lookup
print(distfeat.get_features("p"))
# frozenset({'consonant', 'voiceless', 'bilabial', 'stop'})

# Predefined sound classes
print(distfeat.get_class_features("V"))
# frozenset({'vowel'})

# Direct grapheme distance
print(distfeat.distance("a", "e"))

Working With Systems

You can use the lazy default registry through top-level helpers, or you can work with a specific system object.

import distfeat

ipa = distfeat.get_system("ipa")
tresoldi = distfeat.get_system("tresoldi")
distinctive = distfeat.get_system("distinctive")
pbase = distfeat.get_system("pbase-hc")

print(ipa.grapheme_to_features("a"))
print(tresoldi.grapheme_to_features("a"))
print(distinctive.grapheme_to_features("a"))
print(pbase.grapheme_to_representation("a"))

Exact reverse lookup is available when a native representation maps directly to a known grapheme. For categorical systems this is usually a frozenset[str]; for valued systems it can be a dict[str, FeatureState | str] or ValuedFeatures.

ipa = distfeat.get_system("ipa")

grapheme = ipa.features_to_grapheme(
    frozenset({"consonant", "voiced", "bilabial", "stop"})
)
print(grapheme)
# 'b'

Feature Queries

Find Graphemes Matching a Feature Set

Use features_to_graphemes(...) to retrieve all graphemes satisfying a feature query.

By default, matching is partial and uses the semantics of the selected system.

import distfeat

# All vowels in the default system
vowels = distfeat.features_to_graphemes(frozenset({"vowel"}))
print(vowels[:10])

# Voiceless consonants
voiceless_consonants = distfeat.features_to_graphemes(
    frozenset({"consonant", "-voiced"})
)
print(voiceless_consonants[:10])

You can also force exact matching:

import distfeat

ipa = distfeat.get_system("ipa")
features = ipa.grapheme_to_features("a")
print(distfeat.features_to_graphemes(features, exact=True))

Native Multi-State Systems

distfeat also supports systems whose native representation is a named feature-value table instead of a categorical set. The bundled P-base-derived systems expose multi-state values such as +, -, n, ., o, and x through FeatureState.

import distfeat

rep = distfeat.get_representation("a", system="pbase-hc")
print(rep.values["syllabic"])
# FeatureState.POSITIVE

matches = distfeat.features_to_graphemes({"syllabic": "+"}, system="pbase-hc")
print(matches[:10])

The bundled P-base table is intentionally described as derived rather than verbatim. The source data contains duplicate IPA rows, including rows with conflicting values in a small number of columns. distfeat merges duplicate rows conservatively:

identical duplicate rows collapse into one row
if duplicate rows disagree, only the conflicting cells are downgraded to . (FeatureState.DOT)

This preserves a single usable row per grapheme without inventing new positive or negative values where the source disagrees.

Derive Shared Class Features

Use derive_class_features(...) to compute the strict shared feature intersection of a set of graphemes.

import distfeat

print(distfeat.derive_class_features(["t", "d"]))
# frozenset({'consonant', 'alveolar', 'stop', ...})

print(distfeat.derive_class_features(["t", "d", "s"]))
# fewer shared features than the pair above

For multi-state systems, the result is a dictionary of shared feature states:

import distfeat

print(distfeat.derive_class_features(["t", "d"], system="pbase-hc"))
# {'consonantal': <FeatureState.POSITIVE: '+'>, ...}

Minimal Distinguishing Matrices

Use minimal_matrix(...) to compute the smallest feature set needed to distinguish a given list of graphemes.

import distfeat

matrix = distfeat.minimal_matrix(["t", "d"], system="ipa")
print(matrix.columns)
print(matrix.rows)

For ipa and tresoldi, the matrix is categorical and boolean. For distinctive, it uses scalar dimensions. For P-base-derived systems, it uses native multi-state values.

import distfeat

matrix = distfeat.minimal_matrix(["t", "d", "s"], system="ipa")
print(distfeat.tabulate_matrix(matrix))

Example plain-text output:

grapheme | continuant | voiced
---------+------------+-------
t        | False      | False
d        | False      | True
s        | True       | False

Markdown output is also supported:

print(distfeat.tabulate_matrix(matrix, format="markdown"))

P-base-derived systems render symbolic state values directly:

import distfeat

matrix = distfeat.minimal_matrix(["t", "d"], system="pbase-hc")
print(distfeat.tabulate_matrix(matrix))

Distinctive Scalars

The distinctive system also exposes scalar representations.

from distfeat import DistinctiveFeatureSystem, load_builtin_dataset

system = DistinctiveFeatureSystem(dataset=load_builtin_dataset())

print(system.grapheme_to_scalars("a"))
print(system.features_to_scalars(system.grapheme_to_features("a")))
print(system.scalars_to_features({"voice": 1.0, "labial": 1.0}))

Distance

System-Based Distance

The default distance(...) helper resolves graphemes through the selected system and uses that system's native distance.

import distfeat

print(distfeat.distance("a", "e"))
print(distfeat.distance("a", "u"))
print(distfeat.distance("p", "b"))
print(distfeat.distance("t", "d", system="pbase-hc"))

Precomputed Distance Matrices

You can also supply a precomputed nested dictionary.

import distfeat

precomputed = {
    "a": {"e": 1.5, "u": 2.0},
    "p": {"b": 0.5},
}

print(distfeat.distance("a", "e", precomputed=precomputed))
print(distfeat.distance("b", "p", precomputed=precomputed))

If a requested pair is missing from the precomputed matrix, the function raises KeyError.

Custom Datasets

Load From a Directory

from distfeat import create_registry, load_dataset

dataset = load_dataset(directory="my_feature_data")
registry = create_registry(dataset=dataset)
system = registry.get_system("ipa")

print(system.grapheme_to_features("k"))

Expected files in my_feature_data/:

sounds.tsv
classes.tsv
features.tsv

Bundled P-base-Derived Data

distfeat bundles a derived segment table based on the P-base distribution. The bundled systems are:

pbase-hc
pbase-jfh
pbase-spe
pbase-uftc

These systems use the same registry and analysis APIs as the categorical and scalar systems, but operate on native multi-state feature values.

The P-base-derived data is bundled separately from the MIT-licensed code and retains its own attribution and license notice in src/distfeat/data/pbase/.

Build From In-Memory Rows

from distfeat import create_registry, dataset_from_rows
from distfeat.systems.ipa import IPAFeatureSystem

dataset = dataset_from_rows(
    sounds={"a": "open front vowel", "p": "voiceless bilabial consonant stop"},
    classes={"V": ("vowel", "vowel", ["a"])},
    features=[("open", "height"), ("front", "centrality"), ("stop", "manner")],
)

registry = create_registry(dataset=dataset, register_builtin=False)
registry.register("ipa", IPAFeatureSystem(dataset))

print(registry.get_system("ipa").grapheme_to_features("a"))

Explicit Registries

Use explicit registries when you want isolated state instead of the default global registry.

from distfeat import create_registry, load_builtin_dataset

registry = create_registry(dataset=load_builtin_dataset())
registry.set_default("tresoldi")

print(registry.get_system().name)
print(registry.list_systems())

What The Package Does Not Do

The current package intentionally does not provide:

a legacy DistFeat facade class
the old binary/tristate feature-table interface
grapheme2features(..., t_values=False) style +/-/0 rendering
vector output modes for feature tables or matrices
a command-line interface
ML-based distance training

The current public API is built around categorical feature bundles, native multi-state feature tables, scalar dimensions for the distinctive system, and analysis helpers over those representations.

Documentation

docs/index.md for the package overview
docs/api.md for the public API
docs/datasets.md for dataset loading
docs/systems.md for built-in systems
docs/recipes.md for task-oriented workflows
docs/development.md for implementation constraints

Relationship to alteruphono

alteruphono should be treated as a consumer of distfeat, not the owner of the feature subsystem.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
docs		docs
scripts		scripts
src/distfeat		src/distfeat
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distfeat

Installation

Core Concepts

Quick Start

Working With Systems

Feature Queries

Find Graphemes Matching a Feature Set

Native Multi-State Systems

Derive Shared Class Features

Minimal Distinguishing Matrices

Distinctive Scalars

Distance

System-Based Distance

Precomputed Distance Matrices

Custom Datasets

Load From a Directory

Bundled P-base-Derived Data

Build From In-Memory Rows

Explicit Registries

What The Package Does Not Do

Documentation

Relationship to alteruphono

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

distfeat

Installation

Core Concepts

Quick Start

Working With Systems

Feature Queries

Find Graphemes Matching a Feature Set

Native Multi-State Systems

Derive Shared Class Features

Minimal Distinguishing Matrices

Distinctive Scalars

Distance

System-Based Distance

Precomputed Distance Matrices

Custom Datasets

Load From a Directory

Bundled P-base-Derived Data

Build From In-Memory Rows

Explicit Registries

What The Package Does Not Do

Documentation

Relationship to alteruphono

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages