Token-to-Marble is an experimental research repository exploring how to move from token-centric representations of language to meaning-centric semantic entities called Marbles.
A token is a surface symbol.
A Marble is a structured semantic unit with measurable properties in a vector space and in a corpus—so that meaning can be treated as something that can be located, compared, and eventually composed.
This repository focuses on construction, measurement, and reproducibility — not grand claims.
Most NLP pipelines treat tokens as flat, discrete symbols. This project asks a different question:
What if meaning is not only represented, but also behaves (clusters, attracts, repels) in a measurable semantic space?
Token → Marble is an attempt to add a layer on top of embeddings, where we can:
- define stable meaning anchors,
- measure cross-lingual stability,
- and build higher-level semantic structures from those anchors.
A Marble is a semantic particle-like data structure with properties such as:
- Position: derived from vector embeddings (e.g., FastText / Word2Vec) — where a concept sits in semantic space.
- Mass / Density: derived from corpus statistics (frequency, dispersion) — a proxy for “semantic pull” in a given corpus.
- Interactions: attraction / repulsion / resonance between Marbles, computed from proximity and calibrated constraints.
A Marble is not a metaphor in this repo. It’s a measurable structure we can compute and test.
This is:
- an exploratory research codebase,
- a reproducible pipeline for semantic experiments,
- a controlled sandbox for measuring cross-lingual meaning stability.
This is not:
- a finished model,
- a production system,
- a claim of “new physics.”
Treat all outputs as experimental.
Before generating any “Marbles,” we validate a prerequisite:
Do core human affective states (“core feelings”) form stable neighborhoods across languages in aligned semantic space?
We run a small seed set (“his20”) through aligned FastText vector spaces and measure:
- whether each anchor can be found (
rank), - how strongly it matches (
sim), - and how stable it is across languages (
anchor_sim_mean).
If meaning clusters are cross-lingually stable, then “Marble candidates” can be defined as language-independent semantic attractors rather than as language-specific tokens.
- Aligned FastText vectors (examples):
wiki.en.align.vecwiki.de.align.vecwiki.fr.align.vecwiki.tr.align.vec
- TSV reports per run (e.g.,
his20_probe.tsv) - Seed files (e.g.,
nk_his20_seed.tsv)
(Folder layout for these will be committed alongside the first public drop.)
- Corpus processing
- streaming token extraction
- frequency + dispersion analysis
- Token universe construction
- unique token space
- frequency-ranked vocabulary
- candidate selection
- Marble generation (experimental)
- token/anchor → Marble
- mass/volume/density assignment
- initial interaction fields
- Field calibration & mapping (research)
- calibration experiments
- semantic polarity constraints
- early topography / clustering maps
- Primary working language: Turkish
- Cross-lingual validation: EN / TR / DE / FR (expandable)
- Embeddings are external and replaceable
- The framework is intended to be language-agnostic
- Version: v0.1 (experimental)
- Phase-0 cross-lingual probe: active / stabilizing
- Marble generation: not yet locked
- Field calibration: research phase
Meaning is not static. It has structure, weight, and neighborhoods.
This repository is an attempt to model and measure that — carefully, with reproducible steps.
Open research use (to be clarified as the project matures).