Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
ff3fe68
feat: cache huggingface models
rti Feb 1, 2024
38a3bf9
fix: sentence_transformers version
rti Feb 1, 2024
3fb6fd0
chore: remove custom model based on modelfile
rti Feb 1, 2024
a4c7294
fix(frontend): do not filter by score for now TBD
rti Feb 1, 2024
d38c5f0
chore: remove debug/test code
rti Feb 1, 2024
dc4501a
fix: required sentence_transformers version was actually > 2.2.0
rti Feb 1, 2024
42cdcc5
docs: add notes about embedding models to readme
rti Feb 1, 2024
13bc12e
chore: add debug output to api.py
rti Feb 1, 2024
4933a9a
fix: question in prompt
rti Feb 1, 2024
b23833b
chore: top_k 3 results for now
rti Feb 1, 2024
da1017b
wip: embeddings cache
rti Feb 1, 2024
41ff046
feat: document splitter
rti Feb 1, 2024
4e69697
Update .dockerignore
exowanderer Feb 2, 2024
10103c6
Merge branch 'main' into integration
rti Feb 4, 2024
0ee6ed5
docs: note on how to dev locally
rti Feb 4, 2024
7a2c955
docs: add research_log.md
rti Feb 4, 2024
0a5e2be
feat: set top_k via api
rti Feb 5, 2024
332e3dc
feat: support en and de on the api to switch prompts
rti Feb 5, 2024
6225fcc
feat: cache embedding model during docker build
rti Feb 5, 2024
4877807
wip: smaller chunk size, 5 sentences for now
rti Feb 5, 2024
da9859d
chore: remove comment
rti Feb 5, 2024
291aaaf
feat: enable embeddings cache (for developmnet)
rti Feb 9, 2024
936d83e
feat: add document cleaner
rti Feb 9, 2024
1b88437
Merge branch 'main' into integration
rti Feb 9, 2024
3e0b8f4
docs: long docker run options
rti Feb 9, 2024
edf5eb2
fix: access mode
rti Feb 9, 2024
63baf2b
fix: redraw loading animation on subsequent searches
rti Feb 9, 2024
56a7b8c
wip: workaround for runpod.io http port forwarding
rti Feb 9, 2024
8e05473
feat: switch to openchat 7b model
rti Feb 9, 2024
8276e35
Merge branch 'openchat' into integration
rti Feb 9, 2024
22b04d0
added logging via logger with Handler to api.py; PEP8 formatted api.py
exowanderer Feb 9, 2024
10f6b21
debugging use of homepage instead of hard coded endpoint values
exowanderer Feb 9, 2024
bfbd245
returning to previous to restart without errors
exowanderer Feb 9, 2024
7b6ba0a
renewed app.mount; bug fixed PEP8 changes in api.py; reformatted rag.…
exowanderer Feb 9, 2024
0428f87
returned to stablelm2 model for testing purposes. PEP8 upgrades in ap…
exowanderer Feb 9, 2024
8104dde
added OLLAMA_MODEL_NAME and OLLAMA_URL as environment variables; call…
exowanderer Feb 9, 2024
fbc4591
created logger.py to serve get_logger to all modules
exowanderer Feb 9, 2024
caecfd1
created a rag_pipeline in the rag.py based on the usage in api.py; re…
exowanderer Feb 9, 2024
5c0b4d0
UPdated with PEP8 formatting in vector_store_interface.py
exowanderer Feb 9, 2024
8833af7
chore(Dockerfile): install python deps early
rti Feb 12, 2024
9ee8a32
fix(sentence-transformers): use cuda if available
rti Feb 12, 2024
b2357e3
fix(frontend): run from webserver root
rti Feb 12, 2024
b518abf
feat: store embedding cache in volume
rti Feb 12, 2024
69800b0
feat(start.sh): pull llm using ollama (if not built into container)
rti Feb 12, 2024
7803649
feat(ollama): use chat api to leverage prompt templates
rti Feb 12, 2024
ff1fcab
docs: fix run cmd
rti Feb 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 0 additions & 7 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,6 @@ ARG MODEL=stablelm2:1.6b-zephyr
ENV MODEL=${MODEL}
RUN ollama serve & while ! curl http://localhost:11434; do sleep 1; done; ollama pull $MODEL

# Build a language model
# ARG MODEL=discolm
# ENV MODEL=${MODEL}
# WORKDIR /tmp/model
# COPY --chmod=644 Modelfile Modelfile
# RUN curl --location https://huggingface.co/TheBloke/DiscoLM_German_7b_v1-GGUF/resolve/main/discolm_german_7b_v1.Q5_K_S.gguf?download=true --output discolm_german_7b_v1.Q5_K_S.gguf; ollama serve & while ! curl http://localhost:11434; do sleep 1; done; ollama create ${MODEL} -f Modelfile && rm -rf /tmp/model


# Setup the custom API and frontend
WORKDIR /workspace
Expand Down
2 changes: 0 additions & 2 deletions Modelfile

This file was deleted.

60 changes: 57 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,14 @@
To build and run the container locally with hot reload on python files do:
```
DOCKER_BUILDKIT=1 docker build . -t gbnc
docker run -v "$(pwd)/gswikichat":/workspace/gswikichat \
-p 8000:8000 --rm --name gbnc -it gbnc \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN
docker run \
-v "$(pwd)/gswikichat":/workspace/gswikichat \
-v "$(pwd)/cache":/root/.cache \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN
-p 8000:8000 \
--rm -it \
--name gbnc \
gbnc
```
Point your browser to http://localhost:8000/ and use the frontend.

Expand Down Expand Up @@ -44,3 +49,52 @@ A [FastAPI](https://fastapi.tiangolo.com/) server is running in the container. I
### Frontend

A minimal frontend lets the user input a question and renders the response from the system.

## Sentence Transformers Statistics

```
basic_transformer_models = [
"all-MiniLM-L6-v2",
"xlm-clm-ende-1024",
"xlm-mlm-ende-1024",
"bert-base-german-cased",
"bert-base-german-dbmdz-cased",
"bert-base-german-dbmdz-uncased",
"distilbert-base-german-cased",
"xlm-roberta-large-finetuned-conll03-german",
"deutsche-telekom/gbert-large-paraphrase-cosine"
]

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
sentence_transformer_model = "all-MiniLM-L6-v2"
3 minutes to batch 82

https://huggingface.co/deutsche-telekom/gbert-large-paraphrase-cosine
sentence_transformer_model = 'deutsche-telekom/gbert-large-paraphrase-cosine'
76 minutes to batch 82

https://huggingface.co/jinaai/jina-embeddings-v2-base-de
sentence_transformer_model = 'jinaai/jina-embeddings-v2-base-de'
Cannot find or load the embedding model
Unknown minutes to batch 82

https://huggingface.co/aari1995/German_Semantic_STS_V2
sentence_transformer_model = 'aari1995/German_Semantic_STS_V2'
75 minutes to batch 82

https://huggingface.co/Sahajtomar/German-semantic
sentence_transformer_model = 'Sahajtomar/German-semantic'
72 minutes to batch 82

https://huggingface.co/svalabs/german-gpl-adapted-covid
ntence_transformer_model = 'svalabs/german-gpl-adapted-covid'
2 minutes to batch 82

https://huggingface.co/PM-AI/bi-encoder_msmarco_bert-base_german
sentence_transformer_model = 'PM-AI/bi-encoder_msmarco_bert-base_german'
14 minutes to batch 82

https://huggingface.co/JoBeer/german-semantic-base
sentence_transformer_model = 'JoBeer/german-semantic-base'
22 minutes to batch 82
```
Empty file added cache/.keep
Empty file.
2 changes: 1 addition & 1 deletion frontend/src/components/field/FieldAnswer.vue
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
<div v-else>
<div v-if="response && response.sources">
<div v-for="s in response.sources" :key="s.id">
<div v-if="s.score > 2" class="mb-2">
<div v-if="s.score > 0" class="mb-2">
<details
class="text-sm cursor-pointer text-light-distinct-text dark:text-dark-distinct-text"
>
Expand Down
1 change: 0 additions & 1 deletion gswikichat/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
from .api import *
# from .haystack2beta_tutorial_InMemoryEmbeddingRetriever import *
48 changes: 23 additions & 25 deletions gswikichat/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
from fastapi.staticfiles import StaticFiles
from fastapi import FastAPI

# from .rag import rag_pipeline
from .rag import embedder, retriever, prompt_builder, llm, answer_builder
from haystack import Document

Expand All @@ -22,50 +21,49 @@ async def root():

@app.get("/api")
async def api(q):
print("query: ", q)

embedder, retriever, prompt_builder, llm, answer_builder

# query = "How many languages are there?"
query = Document(content=q)

result = embedder.run([query])
queryEmbedded = embedder.run([query])
queryEmbedding = queryEmbedded['documents'][0].embedding

results = retriever.run(
query_embedding=list(result['documents'][0].embedding),
retrieverResults = retriever.run(
query_embedding=list(queryEmbedding),
filters=None,
top_k=None,
top_k=3,
scale_score=None,
return_embedding=None
)
# .run(
# result['documents'][0].embedding
# )

prompt = prompt_builder.run(documents=results['documents'])['prompt']
print("retriever results:")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement the logging as suggested above, we should include this as a debug statement

logging.debug('retriever results:')

Copy link
Collaborator

@exowanderer exowanderer Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in #24 by adding the get_logger function in the logger.py file.
If you confirm, then we can close this review comment.

for retrieverResult in retrieverResults:
print(retrieverResult)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the suggestion above to include a logger, then we should replace this print statement as a debug

logging.debug(retriever_result_)

Note that this suggestion includes the trailing underscore that I prefer, but is non-standard

Copy link
Collaborator

@exowanderer exowanderer Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in #24 by adding the get_logger function in the logger.py file.
If you confirm, then we can close this review comment.


response = llm.run(prompt=prompt, generation_kwargs=None)
# reply = response['replies'][0]
promptBuild = prompt_builder.run(question=q, documents=retrieverResults['documents'])
prompt = promptBuild['prompt']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the PEP8 standards, it is highly recommended to use snake_case here:

prompt = prompt_build['prompt']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in #24 by renaming promptBuild to prompt_build.
If you confirm, then we can close this review comment.


print("prompt: ", prompt)

# rag_pipeline.connect("llm.replies", "answer_builder.replies")
# rag_pipeline.connect("llm.metadata", "answer_builder.meta")
# rag_pipeline.connect("retriever", "answer_builder.documents")
response = llm.run(prompt=prompt, generation_kwargs=None)

results = answer_builder.run(
answerBuild = answer_builder.run(
query=q,
replies=response['replies'],
meta=response['meta'],
documents=results['documents'],
documents=retrieverResults['documents'],
pattern=None,
reference_pattern=None
)
print("answerBuild", answerBuild)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow the above suggestions of adding logging and using snake_case, we should change this line to

logging.debug(f'{answer_build=}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in #24 by renaming answer_Build as answer_build.
If you confirm, then we can close this review comment.


answer = answerBuild['answers'][0]

sources= [{ "src": d.meta['src'], "content": d.content, "score": d.score } for d in answer.documents]

answer = results['answers'][0]
print("answer", answer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement the logging suggestion above, we should change this line to

logging.debug(f'{answer=}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in #24 by adding the get_logger function in the logger.py file.
If you confirm, then we can close this review comment.


return {
"answer": answer.data,
"sources": [{
"src": d.meta['src'],
"content": d.content,
"score": d.score
} for d in answer.documents]
"sources": sources
}
130 changes: 37 additions & 93 deletions gswikichat/vector_store_interface.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,25 @@
import os
import json

# from sentence_transformers import SentenceTransformer
from tqdm import tqdm

from haystack import Document # , Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
# from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
# from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
# from haystack.components.writers import DocumentWriter
from haystack.document_stores.types.policy import DuplicatePolicy
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.preprocessors import DocumentCleaner

HUGGING_FACE_HUB_TOKEN = os.environ.get('HUGGING_FACE_HUB_TOKEN')
EMBEDDING_CACHE_FILE = '/tmp/gbnc_embeddings.json'

top_k = 5
input_documents = []

json_dir = 'json_input'
json_fname = 'excellent-articles_10_paragraphs.json'
json_fname = 'excellent-articles_10.json'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tried our existing package with the full 2000 articles?


json_fpath = os.path.join(json_dir, json_fname)

if os.path.isfile(json_fpath):
Expand All @@ -30,11 +31,11 @@
for k, v in tqdm(json_obj.items()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid single letter variable names, we should use the more standard key, val decomposition of the items() method:

for key, val in tqdm(json_obj.items()):

Note that I am not using my preferred trailing underscore here because the names key and val are standard as they are above: for key, val in struct.items():

Copy link
Collaborator

@exowanderer exowanderer Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in #24 by renaming "k, v" into "url_, content_", which is symmetric with the other input_document list comprehensions.
If you confirm, then we can close this review comment.

print(f"Loading {k}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we include the logging suggestion above, then we need to add and configure the logging here

import logging

logging.basicConfig(
    filename='gbnc.log',
    encoding='utf-8',
    level=logging.DEBUG
)

Then, we would update line 34 to

    logging.info(f'Loading {k}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in #24 by adding the get_logger function in the logger.py file.
If you confirm, then we can close this review comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i removed this print debug line in favor of the link comprehension. If we decide to keep the logger.debug, then we can make a function that outputs the same per item value, while then logging the debug/info we want

input_documents.append(Document(content=v, meta={"src": k}))
Copy link
Collaborator

@exowanderer exowanderer Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With function calls that include multiple kwargs, I prefer to use multi-line inputs to improve readability and maintainability.

Suggestion:

    for key, val in tqdm(json_obj.items()):
            input_documents.append(
                Document(
                    content=val,
                    meta={"src": key}
                )
            )

Note that I included the above suggestion to change the variables names from k, v to key, val


Final note: if we skip the logging.info step, than the above json input can be converted to a more streamlined list comprehension:

        input_documents = [
            Document(
                content=val,
                meta={'src': key}
            )
            for key, val in tqdm(json_obj.items())
        ]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in #24 by renaming "k, v" into "url_, content_", which is symmetric with the other input_document list comprehensions.
If you confirm, then we can close this review comment.


elif isinstance(json_obj, list):
for obj_ in tqdm(json_obj):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also update this line to use the .items() method inside a list comprehension as

        input_documents = [
            Document(
                content=content_,
                meta={'src': url_}
            )
            for url_, content_ in tqdm(json_obj.items())
        ]

The above suggestion includes the trailing underscores that I prefer, but are not standard

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in #24 by refactoring into a list comprehensions, exactly as seen in the suggestion.
If you confirm, then we can close this review comment.

url = obj_['meta']
content = obj_['content']

input_documents.append(
Document(
content=content,
Expand All @@ -57,112 +58,55 @@
),
]

# Write documents to InMemoryDocumentStore
# cleaner = DocumentCleaner(
# remove_empty_lines=True,
# remove_extra_whitespaces=True,
# remove_repeated_substrings=False)
# input_documents = cleaner.run(input_documents)['documents']

splitter = DocumentSplitter(split_by="sentence", split_length=20, split_overlap=0)
input_documents = splitter.run(input_documents)['documents']

document_store = InMemoryDocumentStore(
embedding_similarity_function="cosine",
# embedding_dim=768,
# duplicate_documents="overwrite"
)
# document_store.write_documents(input_documents)

# TODO Introduce Jina.AI from HuggingFace. Establish env-variable for trust_...

# basic_transformer_models = [
# "all-MiniLM-L6-v2",
# "xlm-clm-ende-1024",
# "xlm-mlm-ende-1024",
# "bert-base-german-cased",
# "bert-base-german-dbmdz-cased",
# "bert-base-german-dbmdz-uncased",
# "distilbert-base-german-cased",
# "xlm-roberta-large-finetuned-conll03-german",
# "deutsche-telekom/gbert-large-paraphrase-cosine"
# ]

# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# sentence_transformer_model = "all-MiniLM-L6-v2"
# 3 minutes to batch 82

# https://huggingface.co/deutsche-telekom/gbert-large-paraphrase-cosine
# sentence_transformer_model = 'deutsche-telekom/gbert-large-paraphrase-cosine'
# 76 minutes to batch 82

# https://huggingface.co/jinaai/jina-embeddings-v2-base-de
# sentence_transformer_model = 'jinaai/jina-embeddings-v2-base-de'
# Cannot find or load the embedding model
# Unknown minutes to batch 82

# https://huggingface.co/aari1995/German_Semantic_STS_V2
# sentence_transformer_model = 'aari1995/German_Semantic_STS_V2'
# 75 minutes to batch 82

# https://huggingface.co/Sahajtomar/German-semantic
# sentence_transformer_model = 'Sahajtomar/German-semantic'
# 72 minutes to batch 82

# https://huggingface.co/svalabs/german-gpl-adapted-covid
sentence_transformer_model = 'svalabs/german-gpl-adapted-covid'
# 2 minutes to batch 82

# https://huggingface.co/PM-AI/bi-encoder_msmarco_bert-base_german
# sentence_transformer_model = 'PM-AI/bi-encoder_msmarco_bert-base_german'
# 14 minutes to batch 82

# https://huggingface.co/JoBeer/german-semantic-base
# sentence_transformer_model = 'JoBeer/german-semantic-base'
# 22 minutes to batch 82

print(f'Sentence Transformer Name:{sentence_transformer_model}')
print(f'Sentence Transformer Name: {sentence_transformer_model}')

embedder = SentenceTransformersDocumentEmbedder(
model=sentence_transformer_model,
# model="T-Systems-onsite/german-roberta-sentence-transformer-v2",
# model="jinaai/jina-embeddings-v2-base-de",
# token=HUGGING_FACE_HUB_TOKEN
)

# hg_embedder = SentenceTransformer(
# "jinaai/jina-embeddings-v2-base-de",
# token=HUGGING_FACE_HUB_TOKEN
# )

embedder.warm_up()

documents_with_embeddings = embedder.run(input_documents)
# documents_with_embeddings = embedder.encode(input_documents)


# print('\n\n')
# # print(documents_with_embeddings['documents'])
# print(type(documents_with_embeddings['documents']))
# print(len(documents_with_embeddings['documents']))
# print(dir(documents_with_embeddings['documents'][0]))
# print('\n\n')
# print(type(embedder.model))
# print('\n\n')
# # print(dir(hg_embedder))


document_store.write_documents(
documents=documents_with_embeddings['documents'],
policy=DuplicatePolicy.OVERWRITE
)
# if os.path.isfile(EMBEDDING_CACHE_FILE):
# print("[INFO] Loading embeddings from cache")
#
# with open(EMBEDDING_CACHE_FILE, 'r') as f:
# documentsDict = json.load(f)
# document_store.write_documents(
# documents=[Document.from_dict(d) for d in documentsDict],
# policy=DuplicatePolicy.OVERWRITE
# )
#
# else:
if True:
embedded = embedder.run(input_documents)
document_store.write_documents(
documents=embedded['documents'],
policy=DuplicatePolicy.OVERWRITE
)

with open(EMBEDDING_CACHE_FILE, 'w') as f:
documentsDict = [Document.to_dict(d) for d in embedded['documents']]
json.dump(documentsDict, f)

retriever = InMemoryEmbeddingRetriever(
# embedding_model="sentence-transformers/all-MiniLM-L6-v2",
document_store=document_store,
top_k=top_k
)

# writer = DocumentWriter(document_store=document_store)

# indexing_pipeline = Pipeline()
# indexing_pipeline.add_component("embedder", embedder)
# indexing_pipeline.add_component("writer", writer)
# indexing_pipeline.connect("embedder", "writer")
# indexing_pipeline.run(
# {
# "embedder": {"documents": input_documents}
# }
# )
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ python-dotenv==1.0.1
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
sentence-transformers>=2.2.0
sentence-transformers==2.3.1
six==1.16.0
sniffio==1.3.0
starlette==0.35.1
Expand Down