Embeddings of Guardian headlines represented as a network by Semnet and visualised in Cosmograph
Semnet constructs graph structures from embeddings, enabling graph-based analysis and operations over collections of embedded documents.
Semnet uses Annoy to perform efficient pair-wise distance calculations, allowing for million-embedding network construction in under ten minutes on consumer hardware.
Graphs are returned as NetworkX objects, opening up a wide range of algorithms for downstream use.
The name "Semnet" derives from semantic network1, as it was initially designed for an NLP use-case, but the tool will work well with any form of embedded document (e.g., images, audio, even or graphs).
Semnet may be used for:
- Graph algorithms: enrich your data with communities, centrality and much more for down-stream use in search, RAG and context engineering
- Deduplication: remove duplicate records (e.g., "Donald Trump", "Donald J. Trump) from datasets
- Exploratory data analysis and visualisation, Cosmograph works brilliantly for large corpora
Exposing the full NetworkX and Annoy APIs, Semnet offers plenty of opportunity for experimentation depending on your use-case.
Check out the launch blog for more about Semnet and the examples for inspiration.
pip install semnetfrom semnet import SemanticNetwork
from sentence_transformers import SentenceTransformer
# Your documents
docs = [
"The cat sat on the mat",
"A cat was sitting on a mat",
"The dog ran in the park",
"I love Python",
"Python is a great programming language",
]
# Generate embeddings (use any embedding provider)
embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
embeddings = embedding_model.encode(docs)
# Create and configure semantic network
sem = SemanticNetwork(thresh=0.3, verbose=True) # Larger values give sparser networks
# Build a NetworkX graph object from your embeddings
G = sem.fit_transform(embeddings, labels=docs)
# Export to pandas using the standalone function
from semnet import to_pandas
nodes, edges = to_pandas(G)- Python 3.8+
- networkx
- annoy
- numpy
- pandas
- tqdm
Recommended for examples:
- sentence-transformers
- cosmograph
I love network analysis, and have explored embedding-derived semantic networks in the past as an alternative approach to representing, clustering and querying news data.
Semnet started life as a few functions I'd been using for deduplication and disambiguation of structured output from LLMs. I could see a number of potential uses for my code, so I decided to package it up for others to use.
I kicked off the project by hand-refactoring my initial code into the class-based structure that forms the core functionality of the current module.
I then used Github Copilot in VSCode to:
- Bootstrap scaffolding, tests, documentation, examples and typing
- Refactor the core methods in the style of the scikit-learn API
- Add additional functionality, e.g., the ability to pass custom data to nodes
- Walk me through deployment to readthedocs and pypi
Semnet is a relatively simple project focused on core graph construction functionality. I don't have much in the way of immediate plans to expand it, however can see the potential for a few future additions:
- Performance optimizations for very large datasets
- Utilities for deduplication, as that's my main use case
- Integration with graph visualization tools
MIT License
If you use Semnet in academic work, please cite:
@software{semnet,
title={Semnet: Semantic Networks from Embeddings},
author={Ian Goodrich},
year={2025},
url={https://github.com/specialprocedures/semnet}
}Footnotes
-
Technically-speaking a Semantic Similarity Network (SSN) ↩