Skip to content

Milestones

List view

  • We aim to build a helpful chat solution for users of the Bioconductor ecosystem using the technologies around current Large Language Models. To be able to help with questions regarding the Bioconductor ecosystem, the LLM requires a good knowledge of Bioconductor packages, structure, and coding etiquette; something which cannot be assumed to be present in any LLM. To counter this problem, we aim to implement a custom solution with concrete knowledge from the Bioconductor ecosystem, including a Retrieval-Augmented Generation (RAG) process that can inject context into the LLM prompt for answering specific questions. To achieve this, we will use and extend the [BioChatter](https://github.com/biocypher/biochatter) library, an open source library for the biomedical application of LLMs. The desired outcome is a chatbot instance that can inform Bioconductor users about the ecosystems (which packages to use for which purpose, where to get more info), usage of specific libraries (including their vignettes and idiomatic programming style), and troubleshooting of common problems (by integrating the Bioconductor support forum). In the process of managing the complex multi-layered knowledge required for this assistant, we will use the [BioCypher](https://github.com/biocypher/biocypher) library, which is designed to facilitate knowledge management, and which natively interacts with BioChatter. Building specific knowledge graphs for the layers of information in the Bioconductor ecosystem (from the meta-information about the ecosystem down until the occurrence of errors in a specific package) will allow the RAG mechanism to give context-specific answers to the users’ questions.

    No due date
    0/7 issues closed
  • Issues related to the podcasting feature. Low priority since google have been addressing this, specifically.

    No due date
    0/3 issues closed
  • Includes all issues that are concerned with making the user experience more streamlined

    No due date
    5/14 issues closed
  • The BioChatter benchmark should be continuously expanded and made more robust to keep up with developments in the field and allow BioChatter use in more contexts

    No due date
    15/25 issues closed
  • Issues to be addressed in the de.NBI biohackathon2

    No due date
    3/6 issues closed
  • LLMs can assist in curating large amounts of data, which until now has been a task reserved for human domain experts. Curation efforts can include: - cell type annotation based on free text, marker genes, foundation model embeddings, or other features - curation of interactions (e.g. metabolites) based on a combination of database and literature information

    No due date
    0/3 issues closed
  • Vector database handling is rudimentary at the moment, to the extent that for each run of the class, a new vector DB instance is created. There is no way of managing the vector DB from biochatter, and the only vector DB vendor available is Milvus. The Milvus DB needs to be available locally through the official docker container. We should add a more sophisticated vector DB handling system that allows the user to connect to an existing instance locally or in the cloud, and that can manage this DB, i.e., inform about contents, wipe the database, access collections, etc. We should be able to add metadata to DB entries, which can carry information such as which document or other source the embedded information comes from. We can also add additional DB vendors, such as Pinecone etc.

    No due date
    3/4 issues closed