Merge pull request #367 from weaviate/2.1.2

thomashacker · web-flow · commit bda056514c3f · 2025-01-27T11:09:51.000+01:00
v2.1.2
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,17 @@
 
 All notable changes to this project will be documented in this file.
 
+## [2.1.2] Adding Novita!
+
+## Added
+
+- Added Novita Generator (https://www.novita.ai/)
+- Added basic tests for Document class
+
+## Fixed
+
+- spaCy Language Issues (https://github.com/weaviate/Verba/issues/359#issuecomment-2612233766) (https://github.com/weaviate/Verba/issues/352)
+
 ## [2.1.1] More Bugs!
 
 ## Added
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -8,7 +8,7 @@ Open source is at the heart of Verba. We appreciate feedback, ideas, and enhance
 
 ## 📚 Before You Begin
 
-Before contributing, please take a moment to read through the [README](https://github.com/weaviate/Verba/README.md) and the [Technical Documentation](https://github.com/weaviate/Verba/TECHNICAL.md). These documents provide a comprehensive understanding of the project and are essential reading to ensure that we're all on the same page.
+Before contributing, please take a moment to read through the [README](https://github.com/weaviate/Verba/README.md) and the [Technical Documentation](https://github.com/weaviate/Verba/TECHNICAL.md). These documents provide a comprehensive understanding of the project and are essential reading to ensure that we're all on the same page. Please note that the technical documentation is a work in progress and will be updated as we progress.
 
 ## 🐛 Reporting Issues
 
@@ -22,6 +22,16 @@ If you've identified a bug or have an idea for an enhancement, please begin by c
 
 We welcome all ideas and feedback. If you're not ready to open an Issue or if you're just looking for a place to discuss ideas, head over to our [GitHub Discussions](https://github.com/weaviate/Verba/discussions) or the [Weaviate Support Page](https://forum.weaviate.io/).
 
+## 🧪 Testing
+
+We use [pytest](https://docs.pytest.org) for testing. Please note that the tests are WIP and some are missing. We still encourage you to run the tests and add more tests as you see fit.
+
+To run the tests, use the following command:
+
+```bash
+pytest goldenverba/tests
+```
+
 ## 📝 Pull Requests
 
 If you're ready to contribute code or documentation, please submit a Pull Request (PR) to the dev branch. Here's the process:
@@ -34,13 +44,6 @@ If you're ready to contribute code or documentation, please submit a Pull Reques
 - Include a clear description of your changes in the PR.
 - Link to the Issue in your PR description.
 
-### 🧪 Tests and Formatting
-
-To maintain the quality of the codebase, we ask that all contributors:
-
-- Run unit tests to ensure that nothing is broken.
-- Use [Black](https://github.com/psf/black) to format your code before submitting.
-
 ### 🔄 Pull Request Process
 
 - PRs are reviewed on a regular basis.
diff --git a/README.md b/README.md
@@ -25,6 +25,7 @@ pip install goldenverba
   - [OpenAI](#openai)
   - [HuggingFace](#huggingface)
   - [Groq](#groq)
+  - [Novita AI](#novitaai)
 - [Quickstart: Deploy with pip](#how-to-deploy-with-pip)
 - [Quickstart: Build from Source](#how-to-build-from-source)
 - [Quickstart: Deploy with Docker](#how-to-install-verba-with-docker)
@@ -55,6 +56,7 @@ Verba is a fully-customizable personal assistant utilizing [Retrieval Augmented
 | Anthrophic (e.g. Claude Sonnet)   | ✅          | Embedding and Generation Models by Anthrophic           |
 | OpenAI (e.g. GPT4)                | ✅          | Embedding and Generation Models by OpenAI               |
 | Groq (e.g. Llama3)                | ✅          | Generation Models by Groq (LPU inference)               |
+| Novita AI (e.g. Llama3.3)         | ✅          | Generation Models by Novita AI                          |
 | Upstage (e.g. Solar)              | ✅          | Embedding and Generation Models by Upstage              |
 
 | 🤖 Embedding Support | Implemented | Description                              |
@@ -168,6 +170,7 @@ Below is a comprehensive list of the API keys and variables you may require:
 | OPENAI_BASE_URL        | URL to OpenAI instance                                     | Models                                                                                                         |
 | COHERE_API_KEY         | Your API Key                                               | Get Access to [Cohere](https://cohere.com/) Models                                                             |
 | GROQ_API_KEY           | Your Groq API Key                                          | Get Access to [Groq](https://groq.com/) Models                                                                 |
+| NOVITA_API_KEY         | Your Novita API Key                                        | Get Access to [Novita AI](https://novita.ai?utm_source=github_verba&utm_medium=github_readme&utm_campaign=github_link) Models                                                              |
 | OLLAMA_URL             | URL to your Ollama instance (e.g. http://localhost:11434 ) | Get Access to [Ollama](https://ollama.com/) Models                                                             |
 | UNSTRUCTURED_API_KEY   | Your API Key                                               | Get Access to [Unstructured](https://docs.unstructured.io/welcome) Data Ingestion                              |
 | UNSTRUCTURED_API_URL   | URL to Unstructured Instance                               | Get Access to [Unstructured](https://docs.unstructured.io/welcome) Data Ingestion                              |
@@ -264,6 +267,11 @@ To use Groq LPUs as generation engine, you need to get an API key from [Groq](ht
 > Although you can provide it in the graphical interface when Verba is up, it is recommended to specify it as `GROQ_API_KEY` environment variable before you launch the application.  
 > It will allow you to choose the generation model in an up-to-date available models list.
 
+## Novita
+
+To use Novita AI as generation engine, you need to get an API key from [Novita AI](https://novita.ai/settings/key-management?utm_source=github_verba&utm_medium=github_readme&utm_campaign=github_link).
+
+
 # How to deploy with pip
 
 `Python >=3.10.0`
diff --git a/goldenverba/.env.example b/goldenverba/.env.example
@@ -25,3 +25,4 @@
 
 # UPSTAGE_API_KEY=
 
+# NOVITA_API_KEY=
diff --git a/goldenverba/components/document.py b/goldenverba/components/document.py
@@ -7,15 +7,6 @@
 
 from langdetect import detect
 
-SUPPORTED_LANGUAGES = {
-    "en": "English",
-    "zh": "Simplified Chinese",
-    "zh-hant": "Traditional Chinese",
-    "fr": "French",
-    "de": "German",
-    "nl": "Dutch",
-}
-
 
 def load_nlp_for_language(language: str):
     """Load SpaCy models based on language"""
@@ -32,13 +23,10 @@ def load_nlp_for_language(language: str):
     elif language == "nl":
         nlp = spacy.blank("nl")
     else:
-        raise ValueError(f"Unsupported language: {language}")
+        nlp = spacy.blank("en")
+
+    nlp.add_pipe("sentencizer")
 
-    # Add sentence segmentation to languages
-    if language == "en":
-        nlp.add_pipe("sentencizer", config={"punct_chars": None})
-    else:
-        nlp.add_pipe("sentencizer")  #
     return nlp
 
 
@@ -55,57 +43,6 @@ def detect_language(text: str) -> str:
         return "unknown"
 
 
-def split_text_by_language(text: str):
-    """Separate text into language parts based on character ranges"""
-    chinese_simplified = "".join(
-        [char for char in text if "\u4e00" <= char <= "\u9fff"]
-    )
-    chinese_traditional = "".join(
-        [
-            char
-            for char in text
-            if "\u3400" <= char <= "\u4dbf" or "\u4e00" <= char <= "\u9fff"
-        ]
-    )
-    english_part = "".join([char for char in text if char.isascii()])
-    other_text = "".join(
-        [char for char in text if not (char.isascii() or "\u4e00" <= char <= "\u9fff")]
-    )
-
-    return chinese_simplified, chinese_traditional, english_part, other_text
-
-
-def process_mixed_language(content: str):
-    """Process mixed language text"""
-    chinese_simplified, chinese_traditional, english_text, other_text = (
-        split_text_by_language(content)
-    )
-
-    docs = []
-
-    if chinese_simplified:
-        nlp_zh = load_nlp_for_language("zh")
-        docs.append(nlp_zh(chinese_simplified))
-
-    if chinese_traditional:
-        nlp_zh_hant = load_nlp_for_language("zh-hant")
-        docs.append(nlp_zh_hant(chinese_traditional))
-
-    if english_text:
-        nlp_en = load_nlp_for_language("en")
-        docs.append(nlp_en(english_text))
-
-    if other_text:
-        detected_lang = detect_language(other_text)
-        if detected_lang in SUPPORTED_LANGUAGES:
-            nlp_other = load_nlp_for_language(detected_lang)
-            docs.append(nlp_other(other_text))
-
-    # Merge all processed documents
-    doc = Doc.from_docs(docs)
-    return doc
-
-
 class Document:
     def __init__(
         self,
@@ -132,13 +69,9 @@ def __init__(
 
         if len(content) > MAX_BATCH_SIZE:
             # Process content in batches
-            print("TOOO BIG!")
             docs = []
             detected_language = detect_language(content[0:MAX_BATCH_SIZE])
-            if detected_language in SUPPORTED_LANGUAGES:
-                nlp = load_nlp_for_language(detected_language)
-            else:
-                nlp = process_mixed_language
+            nlp = load_nlp_for_language(detected_language)
 
             for i in range(0, len(content), MAX_BATCH_SIZE):
                 docs.append(nlp(content[i : i + MAX_BATCH_SIZE]))
@@ -148,12 +81,8 @@ def __init__(
         else:
             # Process smaller content, directly based on language
             detected_language = detect_language(content)
-            if detected_language in SUPPORTED_LANGUAGES:
-                nlp = load_nlp_for_language(detected_language)
-                doc = nlp(content)
-            else:
-                # Process mixed language content
-                doc = process_mixed_language(content)
+            nlp = load_nlp_for_language(detected_language)
+            doc = nlp(content)
 
         self.spacy_doc = doc
 
diff --git a/goldenverba/components/generation/NovitaGenerator.py b/goldenverba/components/generation/NovitaGenerator.py
@@ -0,0 +1,139 @@
+import os
+from dotenv import load_dotenv
+import json
+import aiohttp
+import requests
+
+from goldenverba.components.interfaces import Generator
+from goldenverba.components.types import InputConfig
+from goldenverba.components.util import get_environment, get_token
+
+load_dotenv()
+
+base_url = "https://api.novita.ai/v3/openai"
+
+
+class NovitaGenerator(Generator):
+    """
+    Novita Generator.
+    """
+
+    def __init__(self):
+        super().__init__()
+        self.name = "Novita AI"
+        self.description = "Using Novita AI LLM models to generate answers to queries"
+        self.context_window = 8192
+
+        models = get_models()
+
+        self.config["Model"] = InputConfig(
+            type="dropdown",
+            value=models[0],
+            description="Select a Novita Model",
+            values=models,
+        )
+
+        if get_token("NOVITA_API_KEY") is None:
+            self.config["API Key"] = InputConfig(
+                type="password",
+                value="",
+                description="You can set your Novita API Key here or set it as environment variable `NOVITA_API_KEY`",
+                values=[],
+            )
+
+    async def generate_stream(
+        self,
+        config: dict,
+        query: str,
+        context: str,
+        conversation: list[dict] = [],
+    ):
+        system_message = config.get("System Message").value
+        model = config.get("Model", {"value": "deepseek/deepseek_v3"}).value
+        novita_key = get_environment(
+            config, "API Key", "NOVITA_API_KEY", "No Novita API Key found"
+        )
+        novita_url = base_url
+
+        messages = self.prepare_messages(query, context, conversation, system_message)
+
+        headers = {
+            "Content-Type": "application/json",
+            "Authorization": f"Bearer {novita_key}",
+        }
+        data = {
+            "messages": messages,
+            "model": model,
+            "stream": True,
+        }
+
+        async with aiohttp.ClientSession() as client:
+            async with client.post(
+                url=f"{novita_url}/chat/completions",
+                json=data,
+                headers=headers,
+                timeout=None,
+            ) as response:
+                if response.status == 200:
+                    async for line in response.content:
+                        if line.strip():
+                            line = line.decode("utf-8").strip()
+                            if line == "data: [DONE]":
+                                yield {"message": "", "finish_reason": "stop"}
+                            else:
+                                if line.startswith("data:"):
+                                    line = line[5:].strip()
+                                json_line = json.loads(line)
+                                choice = json_line.get("choices")[0]
+                                yield {
+                                    "message": choice.get("delta", {}).get(
+                                        "content", ""
+                                    ),
+                                    "finish_reason": (
+                                        "stop"
+                                        if choice.get("finish_reason", "") == "stop"
+                                        else ""
+                                    ),
+                                }
+                else:
+                    error_message = await response.text()
+                    yield {
+                        "message": f"HTTP Error {response.status}: {error_message}",
+                        "finish_reason": "stop",
+                    }
+
+    def prepare_messages(
+        self, query: str, context: str, conversation: list[dict], system_message: str
+    ) -> list[dict]:
+        messages = [
+            {
+                "role": "system",
+                "content": system_message,
+            }
+        ]
+
+        for message in conversation:
+            messages.append({"role": message.type, "content": message.content})
+
+        messages.append(
+            {
+                "role": "user",
+                "content": f"Answer this query: '{query}' with this provided context: {context}",
+            }
+        )
+
+        return messages
+
+
+def get_models():
+    try:
+        response = requests.get(base_url + "/models")
+        models = [model.get("id") for model in response.json().get("data")]
+        if len(models) > 0:
+            return models
+        else:
+            # msg.info("No Novita AI Model detected")
+            return ["No Novita AI Model detected"]
+    except Exception as e:
+        # msg.fail(f"Couldn't connect to Novita AI: {e}")
+        return [f"Couldn't connect to Novita AI"]
diff --git a/goldenverba/components/managers.py b/goldenverba/components/managers.py
@@ -70,6 +70,7 @@
 from goldenverba.components.generation.OllamaGenerator import OllamaGenerator
 from goldenverba.components.generation.OpenAIGenerator import OpenAIGenerator
 from goldenverba.components.generation.GroqGenerator import GroqGenerator
+from goldenverba.components.generation.NovitaGenerator import NovitaGenerator
 from goldenverba.components.generation.UpstageGenerator import UpstageGenerator
 
 try:
@@ -116,6 +117,7 @@
         AnthropicGenerator(),
         CohereGenerator(),
         GroqGenerator(),
+        NovitaGenerator(),
         UpstageGenerator(),
     ]
 else:
diff --git a/goldenverba/tests/document/test_document.py b/goldenverba/tests/document/test_document.py
diff --git a/setup.py b/setup.py

Original file line number	Diff line number	Diff line change
`@@ -25,3 +25,4 @@`
`25`	`25`
`26`	`26`	`# UPSTAGE_API_KEY=`
`27`	`27`
	`28`	`+# NOVITA_API_KEY=`