Skip to content

tresoldi/distfeat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

distfeat

distfeat is a standalone Python package for manipulating phonological features.

It provides:

  • bundled phonological feature datasets
  • pluggable feature systems
  • feature geometry and distance functions
  • query and analysis helpers for graphemes and feature sets

distfeat is dependency-free at runtime and is the standalone home for the feature subsystem extracted from alteruphono.

The canonical modern API is built around native representations:

  • use get_representation(...) when you want the system's native feature model
  • use matches(...) and segment_distance(...) for system-native comparison
  • treat get_features(...), partial_match(...), and sound_distance(...) as convenience helpers for categorical systems

Installation

Install from PyPI:

pip install distfeat

Requires Python 3.12+.

Development install:

git clone https://github.com/tresoldi/distfeat.git
cd distfeat
uv venv
uv pip install -e ".[dev]"

Run checks in the project environment:

uv run ruff check .
uv run mypy src
uv run pytest -q
uv run python scripts/verify_examples.py

Core Concepts

The package is organized around:

  • a bundled FeatureDataset
  • a lazy default registry plus explicit Registry instances
  • built-in systems:
    • ipa
    • tresoldi
    • distinctive
    • pbase-hc
    • pbase-jfh
    • pbase-spe
    • pbase-uftc

The package does not define a Sound object. It works directly with graphemes, feature bundles, native multi-state feature tables, scalar dimensions, and matrices.

Quick Start

import distfeat

# Built-in systems
print(distfeat.list_systems())
# ['ipa', 'tresoldi', 'distinctive', 'pbase-hc', 'pbase-jfh', 'pbase-spe', 'pbase-uftc']

# Basic grapheme lookup
print(distfeat.get_features("p"))
# frozenset({'consonant', 'voiceless', 'bilabial', 'stop'})

# Predefined sound classes
print(distfeat.get_class_features("V"))
# frozenset({'vowel'})

# Direct grapheme distance
print(distfeat.distance("a", "e"))

Working With Systems

You can use the lazy default registry through top-level helpers, or you can work with a specific system object.

import distfeat

ipa = distfeat.get_system("ipa")
tresoldi = distfeat.get_system("tresoldi")
distinctive = distfeat.get_system("distinctive")
pbase = distfeat.get_system("pbase-hc")

print(ipa.grapheme_to_features("a"))
print(tresoldi.grapheme_to_features("a"))
print(distinctive.grapheme_to_features("a"))
print(pbase.grapheme_to_representation("a"))

Exact reverse lookup is available when a native representation maps directly to a known grapheme. For categorical systems this is usually a frozenset[str]; for valued systems it can be a dict[str, FeatureState | str] or ValuedFeatures.

ipa = distfeat.get_system("ipa")

grapheme = ipa.features_to_grapheme(
    frozenset({"consonant", "voiced", "bilabial", "stop"})
)
print(grapheme)
# 'b'

Feature Queries

Find Graphemes Matching a Feature Set

Use features_to_graphemes(...) to retrieve all graphemes satisfying a feature query.

By default, matching is partial and uses the semantics of the selected system.

import distfeat

# All vowels in the default system
vowels = distfeat.features_to_graphemes(frozenset({"vowel"}))
print(vowels[:10])

# Voiceless consonants
voiceless_consonants = distfeat.features_to_graphemes(
    frozenset({"consonant", "-voiced"})
)
print(voiceless_consonants[:10])

You can also force exact matching:

import distfeat

ipa = distfeat.get_system("ipa")
features = ipa.grapheme_to_features("a")
print(distfeat.features_to_graphemes(features, exact=True))

Native Multi-State Systems

distfeat also supports systems whose native representation is a named feature-value table instead of a categorical set. The bundled P-base-derived systems expose multi-state values such as +, -, n, ., o, and x through FeatureState.

import distfeat

rep = distfeat.get_representation("a", system="pbase-hc")
print(rep.values["syllabic"])
# FeatureState.POSITIVE

matches = distfeat.features_to_graphemes({"syllabic": "+"}, system="pbase-hc")
print(matches[:10])

The bundled P-base table is intentionally described as derived rather than verbatim. The source data contains duplicate IPA rows, including rows with conflicting values in a small number of columns. distfeat merges duplicate rows conservatively:

  • identical duplicate rows collapse into one row
  • if duplicate rows disagree, only the conflicting cells are downgraded to . (FeatureState.DOT)

This preserves a single usable row per grapheme without inventing new positive or negative values where the source disagrees.

Derive Shared Class Features

Use derive_class_features(...) to compute the strict shared feature intersection of a set of graphemes.

import distfeat

print(distfeat.derive_class_features(["t", "d"]))
# frozenset({'consonant', 'alveolar', 'stop', ...})

print(distfeat.derive_class_features(["t", "d", "s"]))
# fewer shared features than the pair above

For multi-state systems, the result is a dictionary of shared feature states:

import distfeat

print(distfeat.derive_class_features(["t", "d"], system="pbase-hc"))
# {'consonantal': <FeatureState.POSITIVE: '+'>, ...}

Minimal Distinguishing Matrices

Use minimal_matrix(...) to compute the smallest feature set needed to distinguish a given list of graphemes.

import distfeat

matrix = distfeat.minimal_matrix(["t", "d"], system="ipa")
print(matrix.columns)
print(matrix.rows)

For ipa and tresoldi, the matrix is categorical and boolean. For distinctive, it uses scalar dimensions. For P-base-derived systems, it uses native multi-state values.

import distfeat

matrix = distfeat.minimal_matrix(["t", "d", "s"], system="ipa")
print(distfeat.tabulate_matrix(matrix))

Example plain-text output:

grapheme | continuant | voiced
---------+------------+-------
t        | False      | False
d        | False      | True
s        | True       | False

Markdown output is also supported:

print(distfeat.tabulate_matrix(matrix, format="markdown"))

P-base-derived systems render symbolic state values directly:

import distfeat

matrix = distfeat.minimal_matrix(["t", "d"], system="pbase-hc")
print(distfeat.tabulate_matrix(matrix))

Distinctive Scalars

The distinctive system also exposes scalar representations.

from distfeat import DistinctiveFeatureSystem, load_builtin_dataset

system = DistinctiveFeatureSystem(dataset=load_builtin_dataset())

print(system.grapheme_to_scalars("a"))
print(system.features_to_scalars(system.grapheme_to_features("a")))
print(system.scalars_to_features({"voice": 1.0, "labial": 1.0}))

Distance

System-Based Distance

The default distance(...) helper resolves graphemes through the selected system and uses that system's native distance.

import distfeat

print(distfeat.distance("a", "e"))
print(distfeat.distance("a", "u"))
print(distfeat.distance("p", "b"))
print(distfeat.distance("t", "d", system="pbase-hc"))

Precomputed Distance Matrices

You can also supply a precomputed nested dictionary.

import distfeat

precomputed = {
    "a": {"e": 1.5, "u": 2.0},
    "p": {"b": 0.5},
}

print(distfeat.distance("a", "e", precomputed=precomputed))
print(distfeat.distance("b", "p", precomputed=precomputed))

If a requested pair is missing from the precomputed matrix, the function raises KeyError.

Custom Datasets

Load From a Directory

from distfeat import create_registry, load_dataset

dataset = load_dataset(directory="my_feature_data")
registry = create_registry(dataset=dataset)
system = registry.get_system("ipa")

print(system.grapheme_to_features("k"))

Expected files in my_feature_data/:

  • sounds.tsv
  • classes.tsv
  • features.tsv

Bundled P-base-Derived Data

distfeat bundles a derived segment table based on the P-base distribution. The bundled systems are:

  • pbase-hc
  • pbase-jfh
  • pbase-spe
  • pbase-uftc

These systems use the same registry and analysis APIs as the categorical and scalar systems, but operate on native multi-state feature values.

The P-base-derived data is bundled separately from the MIT-licensed code and retains its own attribution and license notice in src/distfeat/data/pbase/.

Build From In-Memory Rows

from distfeat import create_registry, dataset_from_rows
from distfeat.systems.ipa import IPAFeatureSystem

dataset = dataset_from_rows(
    sounds={"a": "open front vowel", "p": "voiceless bilabial consonant stop"},
    classes={"V": ("vowel", "vowel", ["a"])},
    features=[("open", "height"), ("front", "centrality"), ("stop", "manner")],
)

registry = create_registry(dataset=dataset, register_builtin=False)
registry.register("ipa", IPAFeatureSystem(dataset))

print(registry.get_system("ipa").grapheme_to_features("a"))

Explicit Registries

Use explicit registries when you want isolated state instead of the default global registry.

from distfeat import create_registry, load_builtin_dataset

registry = create_registry(dataset=load_builtin_dataset())
registry.set_default("tresoldi")

print(registry.get_system().name)
print(registry.list_systems())

What The Package Does Not Do

The current package intentionally does not provide:

  • a legacy DistFeat facade class
  • the old binary/tristate feature-table interface
  • grapheme2features(..., t_values=False) style +/-/0 rendering
  • vector output modes for feature tables or matrices
  • a command-line interface
  • ML-based distance training

The current public API is built around categorical feature bundles, native multi-state feature tables, scalar dimensions for the distinctive system, and analysis helpers over those representations.

Documentation

Relationship to alteruphono

alteruphono should be treated as a consumer of distfeat, not the owner of the feature subsystem.

About

A Python library for manipulating segmental/distinctive phonological features

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages