Skip to content

Conversation

@MaartenGr
Copy link
Owner

@MaartenGr MaartenGr commented Jan 3, 2026

Implements ollama. This is also a first attempt at standardizing the LLM-based representations to make it easier for further development. I might add more features as I continue working on this or merge as is.

Done

  • Added ollama as a representation
  • Added structured output for ollama but it only extracts the topic label for now

To Do

  • Add ollama documentation
  • Add more structured generations other than the topic label
  • Create separate .py files for all variations
  • Update type hinting
  • Simplify MyLogger

This will be tricky to implement since each topic representation is typically added independently (multiple instances of LiteLLM for example) rather than a single topic representation (one instance of Ollama generating multiple representations).

I'm thinking of creating (data)classes for a single Topic as a way to more easily generate multiple representations and track them. This, however, requires significant changes to the codebase...

@MaartenGr
Copy link
Owner Author

It seems that downloading the data gives back errors, although it works without any issues locally... Perhaps it has something to do with how the containers are configured in gh-actions.

@MaartenGr
Copy link
Owner Author

This is merely a test but something I've wanted to do for a while now. This is my first attempt to re-organize much of the codebase such that further development will be MUCH easier. The idea is to have separate dataclasses to organize the topics (in Topics) and documents/images (in Corpus) with the intent of simplifying much of the codebase and reducing the likelihood of bugs appearing.

This revolves a lot around three main classes:

Corpus

The Corpus dataclass contains all information on a document-level, which includes:

  • Documents
  • Images
  • Document/image embeddings
  • Dimensionality-reduced embeddings
  • Topic assignment
  • Probabilities
  • etc.

The underlying idea is that it is much easier for me to track everything when it is all in one spot, which also simplifies updates. This class therefore contains all input data on a document-level, as well as everything that is generated during fitting (such as embeddings, assignments, probabilities, etc.).

Topic

The Topic dataclass contains all information on a topic-level, which includes:

  • Topic ID
  • Topic representations (c-TF-ID, LLM, etc.)
  • Representative documents/images
  • Topic Label
  • Topic Embeddings/c-TF-IDF
  • Additional. metadata (e.g., nr documents, type of topic)

Compared to the current structure, this makes it significantly easier to track all information related to a single topic and update them when necessary. In turn, this will also make it easier to merge, delete, and update topics (although I still need to implement that).

Topics

The Topics dataclass combines all Topic dataclasses to easily track the collection of Topic. At some point, I also want to use this to make tracking hierarchical topics easier.

This is also used for tracking the predictions and probabilities that were generated in the Corpus object. I'm not entirely sure yet if this is the best place for that, but I also do not want to combine Corpus with Topics. Have to think about this a bit more.

Conclusion

I'm not sure if I want to continue this route, but so far, this seems like a much easier and more elegant approach to BERTopic. In turn, I hope this will also allow for some cool new features I have been thinking of.

@MaartenGr MaartenGr marked this pull request as draft January 21, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants