Ollama + Standardizing LLM-representations by MaartenGr · Pull Request #2467 · MaartenGr/BERTopic

MaartenGr · 2026-01-03T09:33:05Z

Implements ollama. This is also a first attempt at standardizing the LLM-based representations to make it easier for further development. I might add more features as I continue working on this or merge as is.

Done

Added ollama as a representation
Added structured output for ollama but it only extracts the topic label for now

To Do

Add ollama documentation
Add more structured generations other than the topic label
Create separate .py files for all variations
Update type hinting
Simplify MyLogger

This will be tricky to implement since each topic representation is typically added independently (multiple instances of LiteLLM for example) rather than a single topic representation (one instance of Ollama generating multiple representations).

I'm thinking of creating (data)classes for a single Topic as a way to more easily generate multiple representations and track them. This, however, requires significant changes to the codebase...

MaartenGr · 2026-01-05T09:19:10Z

It seems that downloading the data gives back errors, although it works without any issues locally... Perhaps it has something to do with how the containers are configured in gh-actions.

MaartenGr · 2026-01-20T08:48:34Z

This is merely a test but something I've wanted to do for a while now. This is my first attempt to re-organize much of the codebase such that further development will be MUCH easier. The idea is to have separate dataclasses to organize the topics (in Topics) and documents/images (in Corpus) with the intent of simplifying much of the codebase and reducing the likelihood of bugs appearing.

This revolves a lot around three main classes:

Corpus

The Corpus dataclass contains all information on a document-level, which includes:

Documents
Images
Document/image embeddings
Dimensionality-reduced embeddings
Topic assignment
Probabilities
etc.

The underlying idea is that it is much easier for me to track everything when it is all in one spot, which also simplifies updates. This class therefore contains all input data on a document-level, as well as everything that is generated during fitting (such as embeddings, assignments, probabilities, etc.).

Topic

The Topic dataclass contains all information on a topic-level, which includes:

Topic ID
Topic representations (c-TF-ID, LLM, etc.)
Representative documents/images
Topic Label
Topic Embeddings/c-TF-IDF
Additional. metadata (e.g., nr documents, type of topic)

Compared to the current structure, this makes it significantly easier to track all information related to a single topic and update them when necessary. In turn, this will also make it easier to merge, delete, and update topics (although I still need to implement that).

Topics

The Topics dataclass combines all Topic dataclasses to easily track the collection of Topic. At some point, I also want to use this to make tracking hierarchical topics easier.

This is also used for tracking the predictions and probabilities that were generated in the Corpus object. I'm not entirely sure yet if this is the best place for that, but I also do not want to combine Corpus with Topics. Have to think about this a bit more.

Conclusion

I'm not sure if I want to continue this route, but so far, this seems like a much easier and more elegant approach to BERTopic. In turn, I hope this will also allow for some cool new features I have been thinking of.

…labels

MaartenGr added 3 commits January 3, 2026 10:24

Start of standardizing LLM-representations and structured output

81fb4be

Simplify prompt creation and document truncation

83c1c7a

Simplify inheritance

808576c

Introducing Topics, Topic, Corpus, etc. classes

a844587

MaartenGr added 2 commits January 20, 2026 12:22

Start with variations, more simplification

7aef963

Zero-shot and topics over time

5aecef3

MaartenGr marked this pull request as draft January 21, 2026 09:55

MaartenGr added 12 commits January 21, 2026 11:24

Topics per class

19f9f49

Start integrating polars, reduce code in _bertopic.py

ba233da

Hierarchical Topics - also calculates other representations

8a26dd8

Update get_representative_docs, generate_topic_labels, and set_topic_…

cabaeff

…labels

Update get_topic_info

0d9a30a

Update get_document_info and update_topics

d52a55a

Update reduce_topics

0298751

Update merge_topics and delete_topics

e3cf6a9

Update partial_fit

8611f14

Update serialization

88e1fc5

Update merge_models

abcec99

Simplify here and there

8ef2b3b

MaartenGr mentioned this pull request Feb 3, 2026

BadRequestError: Error code: 400 #2469

Open

1 task

MaartenGr added 2 commits February 20, 2026 11:04

Plotting, tests, variations, etc.

b3827e6

Update representations with newer APIs

e7072db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ollama + Standardizing LLM-representations#2467

Ollama + Standardizing LLM-representations#2467
MaartenGr wants to merge 20 commits intomasterfrom
structured_output

MaartenGr commented Jan 3, 2026 •

edited

Loading

Uh oh!

MaartenGr commented Jan 5, 2026

Uh oh!

MaartenGr commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaartenGr commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Done

To Do

Uh oh!

MaartenGr commented Jan 5, 2026

Uh oh!

MaartenGr commented Jan 20, 2026

Corpus

Topic

Topics

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaartenGr commented Jan 3, 2026 •

edited

Loading