chore: README.md

thewebscraping · thewebscraping · commit cef786dbba1c · 2024-12-29T11:55:31.000+07:00
diff --git a/Makefile b/Makefile
@@ -58,3 +58,12 @@ readme:
 
 docs:
 	mkdocs serve
+
+
+publish:
+	rm -rf build dist .egg gemma_template.egg-info
+	python -m pip install -r requirements-dev.txt
+	python -m pip install 'twine>=6.0.1'
+	python setup.py sdist bdist_wheel
+	twine upload --skip-existing dist/*
+	rm -rf build dist .egg gemma_template.egg-info
diff --git a/README.md b/README.md
@@ -57,3 +57,183 @@ It enhances text readability, aligns with linguistic nuances, and preserves orig
 - Support output multiple formats such as Alpaca, GPT, STF text.
 - Can be used with other models such as LLama.
 - Dynamic prompts are enhanced using Round-Robin loop.
+
+**Installation**
+----------------
+
+To install the library, you can choose between two methods:
+
+#### **1\. Install via PyPI:**
+
+```shell
+pip install gemma-template
+```
+
+#### **2\. Install via GitHub Repository:**
+
+```shell
+pip install git+https://github.com/thewebscraping/gemma-template.git
+```
+
+**Quick Start**
+----------------
+Start using Gemma Template with just a few lines of code:
+
+```python
+from gemma_template.models import *
+
+prompt_instance = Template(
+         structure_field=StructureField(
+         title=["Custom Title"],
+         description=["Custom Description"],
+         document=["Custom Article"],
+         main_points=["Custom Main Points"],
+         categories=["Custom Categories"],
+         tags=["Custom Tags"],
+    ),
+)   # Create fully customized structured reminders.
+
+response = prompt_instance.template(
+    template=GEMMA_TEMPLATE,
+    user_template=USER_TEMPLATE,
+    instruction_template=INSTRUCTION_TEMPLATE,
+    structure_template=STRUCTURE_TEMPLATE,
+    title="Gemma open models",
+    description="Gemma: Introducing new state-of-the-art open models.",
+    document="Gemma open models are built from the same research and technology as Gemini models. Gemma 2 comes in 2B, 9B and 27B and Gemma 1 comes in 2B and 7B sizes.",
+    main_points=["Main point 1", "Main point 2"],
+    categories=["Artificial Intelligence", "Gemma"],
+    tags=["AI", "LLM", "Google"],
+    output="A new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety.",
+    max_hidden_words=.1,  # set 0 if you don't want to hide words.
+    min_chars_length=2,  # Minimum character of a word, used to create unigrams, bigrams, and trigrams. Default is 2.
+    max_chars_length=0,  # Maximum character of a word, used to create unigrams, bigrams and trigrams.. Default is 0.
+ )  # remove kwargs if not used.
+print(response)
+```
+
+### Output:
+
+```text
+<start_of_turn>user
+
+You are a multilingual professional writer.
+
+Rewrite the text with a more engaging and creative tone. Use vivid imagery, descriptive language, and a conversational style to captivate the reader.
+
+# Role:
+You are a highly skilled professional content writer, linguistic analyst, and multilingual expert specializing in structured writing and advanced text processing.
+
+# Task:
+Your primary objectives are:
+1. Your primary task is to rewrite the provided content into a more structured, professional format that maintains its original intent and meaning.
+2. Enhance vocabulary comprehension by analyzing text with unigrams (single words), bigrams (two words), and trigrams (three words).
+3. Ensure your response adheres strictly to the prescribed structure format.
+4. Respond in the primary language of the input text unless alternative instructions are explicitly given.
+
+# Additional Expectations:
+1. Provide a rewritten, enhanced version of the input text, ensuring professionalism, clarity, and improved structure.
+2. Focus on multilingual proficiency, using complex vocabulary, grammar to improve your responses.
+3. Preserve the context and cultural nuances of the original text when rewriting.
+
+Topics: Artificial Intelligence, Gemma
+Keywords: AI, LLM, Google
+
+# Text Analysis:
+Example 1: Unigrams (single words)
+and => English
+built => English
+from => English
+the => English
+research => English
+Text Analysis 3: These are common English words, indicating the text is in English.
+
+Example 2: Bigrams (two words)
+technology as => English
+Text Analysis 2: Frequent bigrams in Vietnamese confirm the language context.
+
+Example 3: Trigrams (three words)
+technology as Gemini => English
+Text Analysis 3: Trigrams further validate the linguistic analysis and the necessity to respond in English.
+
+# Conclusion of Text Analysis:
+The linguistic analysis confirms the text is predominantly in English. Consequently, the response should be structured and written in English to align with the original text and context.
+
+# Response Structure Format:
+You must follow the response structure:
+**Custom Title (Title):** Rewrite the title to make it concise, memorable, and optimized for SEO.
+**Custom Description (Description):** Write description of the article in one or two sentences while focusing on reader benefits and engage curiosity.
+**Custom Article (Article):** Rewrite this content to be SEO-friendly. Include relevant tags, optimize the title and subheadings, and ensure the text flows naturally for search engines and readers.
+**Custom Main Points (Main Points):** Simplify the original key points to make them clearer and more reader-friendly.
+**Custom Categories (Categories):** Assign appropriate categories to the article based text or target audience.
+**Custom Tags (Tags):** Create tags to include relevant keywords. Ensure the tags align with popular search queries.
+
+By adhering to this format, the response will maintain linguistic integrity while enhancing professionalism, structure and alignment with user expectations.
+
+# Text:
+Gemma open models are built from _____ same research _____ technology as Gemini models. Gemma 2 comes in 2B, 9B _____ 27B and Gemma 1 comes in 2B and 7B sizes.
+
+<end_of_turn>
+<start_of_turn>model
+
+## **Custom Title**:
+### Gemma open models
+
+## **Custom Description**:
+Gemma: Introducing new state-of-the-art open models.
+
+## **Custom Article**:
+A new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety.
+
+## **Custom Main Points**:
+- Main point 1
+- Main point 2
+
+## **Custom Categories**:
+- Artificial Intelligence
+- Gemma
+
+## **Custom Tags**:
+- AI
+- LLM
+- Google<end_of_turn>
+
+```
+
+## Load Dataset
+Returns: Dataset: A Hugging Face Dataset or DatasetDict object containing the processed prompts.
+
+**Load Dataset from local file path**
+```python
+prompt_instance = Template()
+data_dict = [
+    {
+        "id": "JnZJolR76_u2",
+        "title": "Sample title",
+        "description": "Sample description",
+        "document": "Sample document",
+        "categories": ["Topic 1", "Topic 2"],
+        "tags": ["Tag 1", "Tag 2"],
+        "output": "Sample output",
+        "main_points": ["Main point 1", "Main point 2"],
+    }
+]
+dataset = prompt_instance.load_dataset(data_dict, output_format='text')   # enum: text, gpt, alpaca
+print(dataset['text'][0])
+```
+
+**Load Dataset from HuggingFace**
+```python
+dataset = gemma_template.load_dataset(
+    "your_huggingface_dataset",
+    output_format='gpt',   # enum: text, gpt, alpaca
+    instruction_template=INSTRUCTION_TEMPLATE,  # Template for instruction the user prompt.
+    structure_template=STRUCTURE_TEMPLATE,   # Template for structuring the user prompt.
+    max_hidden_ratio=10,  # Percentage of documents that need to be word masked. Min: 0, Max: 1. Default: 0.
+    # Replace 10% of words in the input document with '_____'.
+    # Use int to extract the correct number of words. The `max_hidden_ratio` parameter must be greater than 0.
+    max_hidden_words=.1,
+    min_chars_length=2,   # Minimum character of a word, used to create unigrams, bigrams, and trigrams. Default is 2.
+    max_chars_length=8    # Maximum character of a word, used to create unigrams, bigrams and trigrams. Default is 0.
+)
+```
diff --git a/gemma_template/__version__.py b/gemma_template/__version__.py
@@ -3,5 +3,5 @@
 __url__ = "https://github.com/thewebscraping/gemma-template"
 __author__ = "Tu Pham"
 __author_email__ = "thetwofarm@gmail.com"
-__version__ = "0.0.1"
-__license__ = "MIT"
+__version__ = "0.1.0"
+__license__ = "Apache-2.0"
diff --git a/gemma_template/constants.py b/gemma_template/constants.py
@@ -20,6 +20,7 @@
 <start_of_turn>model
 
 {model_template}<end_of_turn>
+
 """
 
 GEMMA_PROMPT_TEMPLATE = """<start_of_turn>user
diff --git a/gemma_template/models.py b/gemma_template/models.py
@@ -197,13 +197,16 @@ class Template(BaseTemplate):
         ...    tags=["AI", "LLM", "Google"],
         ...    document="Gemma open models are built from the same research and technology as Gemini models. Gemma 2 comes in 2B, 9B and 27B and Gemma 1 comes in 2B and 7B sizes.",
         ...    output="A new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety.",
+        ...    max_hidden_words=.1,  # set 0 if you don't want to hide words.
+        ...    min_chars_length=2,  # Minimum character of a word, used to create unigrams, bigrams, and trigrams. Default is 2.
+        ...    max_chars_length=0,  # Maximum character of a word, used to create unigrams, bigrams and trigrams. Default is 0.
         ... )  # remove kwargs if not used.
         >>> print(response)
         <start_of_turn>user
 
         You are a multilingual professional writer.
 
-        Rewrite the text with a more engaging and creative tone. Use vivid imagery, descriptive language, and a conversational style to captivate the reader.
+        Rewrite the text to be more search engine friendly. Incorporate relevant keywords naturally, improve readability, and ensure it aligns with SEO best practices.
 
         # Role:
         You are a highly skilled professional content writer, linguistic analyst, and multilingual expert specializing in structured writing and advanced text processing.
@@ -233,7 +236,6 @@ class Template(BaseTemplate):
         Text Analysis 3: These are common English words, indicating the text is in English.
 
         Example 2: Bigrams (two words)
-        comes in => English
         technology as => English
         Text Analysis 2: Frequent bigrams in Vietnamese confirm the language context.
 
@@ -246,8 +248,8 @@ class Template(BaseTemplate):
 
         # Response Structure Format:
         You must follow the response structure:
-        **Custom Title (Title):** Rewrite the title to reflect the main keyword and topic.
-        **Custom Description (Description):** Rewrite the description with a bold claim or statistic to grab attention.
+        **Custom Title (Title):** Rewrite the title to make it concise, memorable, and optimized for SEO.
+        **Custom Description (Description):** Write description of the article in one or two sentences while focusing on reader benefits and engage curiosity.
         **Custom Article (Article):** Transform this text into a formal, professional tone suitable for business communication or an academic audience. Focus on improving vocabulary, grammar, and structure.
         **Custom Main Points (Main Points):** Summarize the main ideas into concise, actionable key points for added context to make them more engaging.
         **Custom Categories (Categories):** Rewrite categories to align with industry standards or popular topics.
@@ -256,11 +258,33 @@ class Template(BaseTemplate):
         By adhering to this format, the response will maintain linguistic integrity while enhancing professionalism, structure and alignment with user expectations.
 
         # Text:
-        Gemma open models are built from the same research and technology as Gemini models. Gemma 2 comes in 2B, 9B and 27B and Gemma 1 comes in 2B and 7B sizes.
+        Gemma open models are built from the same research _____ technology as Gemini models. Gemma 2 comes in 2B, 9B _____ 27B and Gemma 1 comes in 2B _____ 7B sizes.
 
         <end_of_turn>
         <start_of_turn>model
 
+        ## **Custom Title**:
+        ### Gemma open models
+
+        ## **Custom Description**:
+        Gemma: Introducing new state-of-the-art open models.
+
+        ## **Custom Article**:
+        A new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety.
+
+        ## **Custom Main Points**:
+        - Main point 1
+        - Main point 2
+
+        ## **Custom Categories**:
+        - Artificial Intelligence
+        - Gemma
+
+        ## **Custom Tags**:
+        - AI
+        - LLM
+        - Google<end_of_turn>
+
     """  # noqa: E501
 
     _structure_items: dict[str, tuple[str, str, str]] = {}
@@ -394,7 +418,7 @@ def load_dataset(
             instruction_template (Optional[TemplateTypes]):
                 Template for including specific instructions in the prompts.
             structure_template (Optional[TemplateTypes]):
-                Template for structuring the response content.
+                Template for structuring the user prompt.
             output_format (Union[str, Literal["text", "alpaca", "gpt"]]):
                 Specifies the format for the generated prompts. Default is "text".
             eos_token_str (Optional[str]):
@@ -407,7 +431,7 @@ def load_dataset(
             min_chars_length (int):
                 Minimum character of a word, used to create unigrams, bigrams, and trigrams. Default is 2.
             max_chars_length (int):
-                Maximum character of a word, used to create unigrams, bigrams and trigrams.. Default is 0.
+                Maximum character of a word, used to create unigrams, bigrams and trigrams. Default is 0.
             max_concurrency (int):
                 Maximum number of concurrent threads for processing data. Default is 4.
             **kwargs: Additional parameters, including:
@@ -447,6 +471,8 @@ async def create_task(config, hidden_count: int = 0):
                 config.update(dict(min_chars_length=min_chars_length, max_chars_length=max_chars_length))
                 if max_hidden_ratio > 0 and hidden_count < max_hidden_count:
                     config["max_hidden_words"] = max_hidden_words
+                else:
+                    config["max_hidden_words"] = 0
 
                 if output_format == "alpaca":
                     items.append(
@@ -622,7 +648,7 @@ def get_user_kwargs(
             n_words=n_words,
             language=language,
             bullet_style=bullet_style,
-            is_hidden=bool(kwargs.get("max_hidden_words")),
+            is_masked=bool(kwargs.get("max_hidden_words")),
         )
         if isinstance(instruction_template, Callable):
             instruction_template_str = instruction_template(
@@ -823,14 +849,16 @@ def generate_model_prompt(
         """  # noqa: E501
 
         output_document = kwargs.get("output", "")
-        if isinstance(structure_template, Callable):
+        if isinstance(structure_template, (str, Callable)):
+            kwargs["document"] = output_document
             if isinstance(structure_template, Callable):
-                output_document = structure_template(self._structure_items, **kwargs)
+                if isinstance(structure_template, Callable):
+                    self._structure_items = structure_template(structure_data, **kwargs)
 
-        elif isinstance(structure_template, str):
-            output_document = self._formatting_structure_model_fn(
-                self._structure_items, bullet_style, **kwargs
-            )
+            else:
+                output_document = self._formatting_structure_model_fn(
+                    self._structure_items, bullet_style, **kwargs
+                )
 
         return output_document.strip() + eos_token_str
 
@@ -880,7 +908,7 @@ def to_text(
             bigrams=user_kwargs.get("bigrams", []) or [],
             trigrams=user_kwargs.get("trigrams", []) or [],
             language=user_kwargs.get("language"),
-            is_hidden=bool(user_kwargs.get("is_hidden")),
+            is_masked=bool(user_kwargs.get("is_masked")),
         )
 
     def to_alpaca(
@@ -909,7 +937,7 @@ def to_alpaca(
             bigrams=user_kwargs.get("bigrams", []) or [],
             trigrams=user_kwargs.get("trigrams", []) or [],
             language=user_kwargs.get("language"),
-            is_hidden=bool(user_kwargs.get("is_hidden")),
+            is_masked=bool(user_kwargs.get("is_masked")),
         )
 
     def to_openai(
@@ -938,7 +966,7 @@ def to_openai(
             bigrams=user_kwargs.get("bigrams", []) or [],
             trigrams=user_kwargs.get("trigrams", []) or [],
             language=user_kwargs.get("language"),
-            is_hidden=bool(user_kwargs.get("is_hidden")),
+            is_masked=bool(user_kwargs.get("is_masked")),
         )
 
     def _get_prompts(
diff --git a/setup.cfg b/setup.cfg
@@ -1,7 +1,7 @@
 [metadata]
 long_description = file: README.md
 long_description_content_type = text/markdown
-license = MIT
+license = Apache2
 license_file = LICENSE
 python_requires = >=3.9
 install_requires =
@@ -14,7 +14,7 @@ classifiers =
     Development Status :: 4 - Beta
     Intended Audience :: Developers
     Environment :: Web Environment
-    License :: OSI Approved :: Apache License 2.0
+    License :: OSI Approved :: Apache Software License
     Natural Language :: English
     Operating System :: OS Independent
     Programming Language :: Python