Skip to content

An experiment in generating domain-based meaning particles from raw linguistic data.

Notifications You must be signed in to change notification settings

surmeliugur/token-to-marble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Token-to-Marble

Token-to-Marble is an experimental research repository exploring how to move from token-centric representations of language to meaning-centric semantic entities called Marbles.

A token is a surface symbol.

A Marble is a structured semantic unit with measurable properties in a vector space and in a corpus—so that meaning can be treated as something that can be located, compared, and eventually composed.

This repository focuses on construction, measurement, and reproducibility — not grand claims.


Core idea

Most NLP pipelines treat tokens as flat, discrete symbols. This project asks a different question:

What if meaning is not only represented, but also behaves (clusters, attracts, repels) in a measurable semantic space?

Token → Marble is an attempt to add a layer on top of embeddings, where we can:

  • define stable meaning anchors,
  • measure cross-lingual stability,
  • and build higher-level semantic structures from those anchors.

What is a Marble?

A Marble is a semantic particle-like data structure with properties such as:

  • Position: derived from vector embeddings (e.g., FastText / Word2Vec) — where a concept sits in semantic space.
  • Mass / Density: derived from corpus statistics (frequency, dispersion) — a proxy for “semantic pull” in a given corpus.
  • Interactions: attraction / repulsion / resonance between Marbles, computed from proximity and calibrated constraints.

A Marble is not a metaphor in this repo. It’s a measurable structure we can compute and test.


Repository scope (what this repo is / is not)

This is:

  • an exploratory research codebase,
  • a reproducible pipeline for semantic experiments,
  • a controlled sandbox for measuring cross-lingual meaning stability.

This is not:

  • a finished model,
  • a production system,
  • a claim of “new physics.”

Treat all outputs as experimental.


Phase-0 (current): Cross-lingual meaning stability probe (EN / DE / FR / TR)

Before generating any “Marbles,” we validate a prerequisite:

Do core human affective states (“core feelings”) form stable neighborhoods across languages in aligned semantic space?

We run a small seed set (“his20”) through aligned FastText vector spaces and measure:

  • whether each anchor can be found (rank),
  • how strongly it matches (sim),
  • and how stable it is across languages (anchor_sim_mean).

Why this matters for Marble work

If meaning clusters are cross-lingually stable, then “Marble candidates” can be defined as language-independent semantic attractors rather than as language-specific tokens.

Inputs

  • Aligned FastText vectors (examples):
    • wiki.en.align.vec
    • wiki.de.align.vec
    • wiki.fr.align.vec
    • wiki.tr.align.vec

Outputs

  • TSV reports per run (e.g., his20_probe.tsv)
  • Seed files (e.g., nk_his20_seed.tsv)

(Folder layout for these will be committed alongside the first public drop.)


Planned pipeline stages (roadmap)

  1. Corpus processing
  • streaming token extraction
  • frequency + dispersion analysis
  1. Token universe construction
  • unique token space
  • frequency-ranked vocabulary
  • candidate selection
  1. Marble generation (experimental)
  • token/anchor → Marble
  • mass/volume/density assignment
  • initial interaction fields
  1. Field calibration & mapping (research)
  • calibration experiments
  • semantic polarity constraints
  • early topography / clustering maps

Language and data

  • Primary working language: Turkish
  • Cross-lingual validation: EN / TR / DE / FR (expandable)
  • Embeddings are external and replaceable
  • The framework is intended to be language-agnostic

Status

  • Version: v0.1 (experimental)
  • Phase-0 cross-lingual probe: active / stabilizing
  • Marble generation: not yet locked
  • Field calibration: research phase

Minimal philosophy

Meaning is not static. It has structure, weight, and neighborhoods.

This repository is an attempt to model and measure that — carefully, with reproducible steps.


License

Open research use (to be clarified as the project matures).

About

An experiment in generating domain-based meaning particles from raw linguistic data.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages