Skip to content

Commit cef786d

Browse files
chore: README.md
1 parent 9e9b5aa commit cef786d

File tree

6 files changed

+239
-21
lines changed

6 files changed

+239
-21
lines changed

Makefile

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,3 +58,12 @@ readme:
5858

5959
docs:
6060
mkdocs serve
61+
62+
63+
publish:
64+
rm -rf build dist .egg gemma_template.egg-info
65+
python -m pip install -r requirements-dev.txt
66+
python -m pip install 'twine>=6.0.1'
67+
python setup.py sdist bdist_wheel
68+
twine upload --skip-existing dist/*
69+
rm -rf build dist .egg gemma_template.egg-info

README.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,3 +57,183 @@ It enhances text readability, aligns with linguistic nuances, and preserves orig
5757
- Support output multiple formats such as Alpaca, GPT, STF text.
5858
- Can be used with other models such as LLama.
5959
- Dynamic prompts are enhanced using Round-Robin loop.
60+
61+
**Installation**
62+
----------------
63+
64+
To install the library, you can choose between two methods:
65+
66+
#### **1\. Install via PyPI:**
67+
68+
```shell
69+
pip install gemma-template
70+
```
71+
72+
#### **2\. Install via GitHub Repository:**
73+
74+
```shell
75+
pip install git+https://github.com/thewebscraping/gemma-template.git
76+
```
77+
78+
**Quick Start**
79+
----------------
80+
Start using Gemma Template with just a few lines of code:
81+
82+
```python
83+
from gemma_template.models import *
84+
85+
prompt_instance = Template(
86+
structure_field=StructureField(
87+
title=["Custom Title"],
88+
description=["Custom Description"],
89+
document=["Custom Article"],
90+
main_points=["Custom Main Points"],
91+
categories=["Custom Categories"],
92+
tags=["Custom Tags"],
93+
),
94+
) # Create fully customized structured reminders.
95+
96+
response = prompt_instance.template(
97+
template=GEMMA_TEMPLATE,
98+
user_template=USER_TEMPLATE,
99+
instruction_template=INSTRUCTION_TEMPLATE,
100+
structure_template=STRUCTURE_TEMPLATE,
101+
title="Gemma open models",
102+
description="Gemma: Introducing new state-of-the-art open models.",
103+
document="Gemma open models are built from the same research and technology as Gemini models. Gemma 2 comes in 2B, 9B and 27B and Gemma 1 comes in 2B and 7B sizes.",
104+
main_points=["Main point 1", "Main point 2"],
105+
categories=["Artificial Intelligence", "Gemma"],
106+
tags=["AI", "LLM", "Google"],
107+
output="A new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety.",
108+
max_hidden_words=.1, # set 0 if you don't want to hide words.
109+
min_chars_length=2, # Minimum character of a word, used to create unigrams, bigrams, and trigrams. Default is 2.
110+
max_chars_length=0, # Maximum character of a word, used to create unigrams, bigrams and trigrams.. Default is 0.
111+
) # remove kwargs if not used.
112+
print(response)
113+
```
114+
115+
### Output:
116+
117+
```text
118+
<start_of_turn>user
119+
120+
You are a multilingual professional writer.
121+
122+
Rewrite the text with a more engaging and creative tone. Use vivid imagery, descriptive language, and a conversational style to captivate the reader.
123+
124+
# Role:
125+
You are a highly skilled professional content writer, linguistic analyst, and multilingual expert specializing in structured writing and advanced text processing.
126+
127+
# Task:
128+
Your primary objectives are:
129+
1. Your primary task is to rewrite the provided content into a more structured, professional format that maintains its original intent and meaning.
130+
2. Enhance vocabulary comprehension by analyzing text with unigrams (single words), bigrams (two words), and trigrams (three words).
131+
3. Ensure your response adheres strictly to the prescribed structure format.
132+
4. Respond in the primary language of the input text unless alternative instructions are explicitly given.
133+
134+
# Additional Expectations:
135+
1. Provide a rewritten, enhanced version of the input text, ensuring professionalism, clarity, and improved structure.
136+
2. Focus on multilingual proficiency, using complex vocabulary, grammar to improve your responses.
137+
3. Preserve the context and cultural nuances of the original text when rewriting.
138+
139+
Topics: Artificial Intelligence, Gemma
140+
Keywords: AI, LLM, Google
141+
142+
# Text Analysis:
143+
Example 1: Unigrams (single words)
144+
and => English
145+
built => English
146+
from => English
147+
the => English
148+
research => English
149+
Text Analysis 3: These are common English words, indicating the text is in English.
150+
151+
Example 2: Bigrams (two words)
152+
technology as => English
153+
Text Analysis 2: Frequent bigrams in Vietnamese confirm the language context.
154+
155+
Example 3: Trigrams (three words)
156+
technology as Gemini => English
157+
Text Analysis 3: Trigrams further validate the linguistic analysis and the necessity to respond in English.
158+
159+
# Conclusion of Text Analysis:
160+
The linguistic analysis confirms the text is predominantly in English. Consequently, the response should be structured and written in English to align with the original text and context.
161+
162+
# Response Structure Format:
163+
You must follow the response structure:
164+
**Custom Title (Title):** Rewrite the title to make it concise, memorable, and optimized for SEO.
165+
**Custom Description (Description):** Write description of the article in one or two sentences while focusing on reader benefits and engage curiosity.
166+
**Custom Article (Article):** Rewrite this content to be SEO-friendly. Include relevant tags, optimize the title and subheadings, and ensure the text flows naturally for search engines and readers.
167+
**Custom Main Points (Main Points):** Simplify the original key points to make them clearer and more reader-friendly.
168+
**Custom Categories (Categories):** Assign appropriate categories to the article based text or target audience.
169+
**Custom Tags (Tags):** Create tags to include relevant keywords. Ensure the tags align with popular search queries.
170+
171+
By adhering to this format, the response will maintain linguistic integrity while enhancing professionalism, structure and alignment with user expectations.
172+
173+
# Text:
174+
Gemma open models are built from _____ same research _____ technology as Gemini models. Gemma 2 comes in 2B, 9B _____ 27B and Gemma 1 comes in 2B and 7B sizes.
175+
176+
<end_of_turn>
177+
<start_of_turn>model
178+
179+
## **Custom Title**:
180+
### Gemma open models
181+
182+
## **Custom Description**:
183+
Gemma: Introducing new state-of-the-art open models.
184+
185+
## **Custom Article**:
186+
A new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety.
187+
188+
## **Custom Main Points**:
189+
- Main point 1
190+
- Main point 2
191+
192+
## **Custom Categories**:
193+
- Artificial Intelligence
194+
- Gemma
195+
196+
## **Custom Tags**:
197+
- AI
198+
- LLM
199+
- Google<end_of_turn>
200+
201+
```
202+
203+
## Load Dataset
204+
Returns: Dataset: A Hugging Face Dataset or DatasetDict object containing the processed prompts.
205+
206+
**Load Dataset from local file path**
207+
```python
208+
prompt_instance = Template()
209+
data_dict = [
210+
{
211+
"id": "JnZJolR76_u2",
212+
"title": "Sample title",
213+
"description": "Sample description",
214+
"document": "Sample document",
215+
"categories": ["Topic 1", "Topic 2"],
216+
"tags": ["Tag 1", "Tag 2"],
217+
"output": "Sample output",
218+
"main_points": ["Main point 1", "Main point 2"],
219+
}
220+
]
221+
dataset = prompt_instance.load_dataset(data_dict, output_format='text') # enum: text, gpt, alpaca
222+
print(dataset['text'][0])
223+
```
224+
225+
**Load Dataset from HuggingFace**
226+
```python
227+
dataset = gemma_template.load_dataset(
228+
"your_huggingface_dataset",
229+
output_format='gpt', # enum: text, gpt, alpaca
230+
instruction_template=INSTRUCTION_TEMPLATE, # Template for instruction the user prompt.
231+
structure_template=STRUCTURE_TEMPLATE, # Template for structuring the user prompt.
232+
max_hidden_ratio=10, # Percentage of documents that need to be word masked. Min: 0, Max: 1. Default: 0.
233+
# Replace 10% of words in the input document with '_____'.
234+
# Use int to extract the correct number of words. The `max_hidden_ratio` parameter must be greater than 0.
235+
max_hidden_words=.1,
236+
min_chars_length=2, # Minimum character of a word, used to create unigrams, bigrams, and trigrams. Default is 2.
237+
max_chars_length=8 # Maximum character of a word, used to create unigrams, bigrams and trigrams. Default is 0.
238+
)
239+
```

gemma_template/__version__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@
33
__url__ = "https://github.com/thewebscraping/gemma-template"
44
__author__ = "Tu Pham"
55
__author_email__ = "[email protected]"
6-
__version__ = "0.0.1"
7-
__license__ = "MIT"
6+
__version__ = "0.1.0"
7+
__license__ = "Apache-2.0"

gemma_template/constants.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
<start_of_turn>model
2121
2222
{model_template}<end_of_turn>
23+
2324
"""
2425

2526
GEMMA_PROMPT_TEMPLATE = """<start_of_turn>user

gemma_template/models.py

Lines changed: 45 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -197,13 +197,16 @@ class Template(BaseTemplate):
197197
... tags=["AI", "LLM", "Google"],
198198
... document="Gemma open models are built from the same research and technology as Gemini models. Gemma 2 comes in 2B, 9B and 27B and Gemma 1 comes in 2B and 7B sizes.",
199199
... output="A new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety.",
200+
... max_hidden_words=.1, # set 0 if you don't want to hide words.
201+
... min_chars_length=2, # Minimum character of a word, used to create unigrams, bigrams, and trigrams. Default is 2.
202+
... max_chars_length=0, # Maximum character of a word, used to create unigrams, bigrams and trigrams. Default is 0.
200203
... ) # remove kwargs if not used.
201204
>>> print(response)
202205
<start_of_turn>user
203206
204207
You are a multilingual professional writer.
205208
206-
Rewrite the text with a more engaging and creative tone. Use vivid imagery, descriptive language, and a conversational style to captivate the reader.
209+
Rewrite the text to be more search engine friendly. Incorporate relevant keywords naturally, improve readability, and ensure it aligns with SEO best practices.
207210
208211
# Role:
209212
You are a highly skilled professional content writer, linguistic analyst, and multilingual expert specializing in structured writing and advanced text processing.
@@ -233,7 +236,6 @@ class Template(BaseTemplate):
233236
Text Analysis 3: These are common English words, indicating the text is in English.
234237
235238
Example 2: Bigrams (two words)
236-
comes in => English
237239
technology as => English
238240
Text Analysis 2: Frequent bigrams in Vietnamese confirm the language context.
239241
@@ -246,8 +248,8 @@ class Template(BaseTemplate):
246248
247249
# Response Structure Format:
248250
You must follow the response structure:
249-
**Custom Title (Title):** Rewrite the title to reflect the main keyword and topic.
250-
**Custom Description (Description):** Rewrite the description with a bold claim or statistic to grab attention.
251+
**Custom Title (Title):** Rewrite the title to make it concise, memorable, and optimized for SEO.
252+
**Custom Description (Description):** Write description of the article in one or two sentences while focusing on reader benefits and engage curiosity.
251253
**Custom Article (Article):** Transform this text into a formal, professional tone suitable for business communication or an academic audience. Focus on improving vocabulary, grammar, and structure.
252254
**Custom Main Points (Main Points):** Summarize the main ideas into concise, actionable key points for added context to make them more engaging.
253255
**Custom Categories (Categories):** Rewrite categories to align with industry standards or popular topics.
@@ -256,11 +258,33 @@ class Template(BaseTemplate):
256258
By adhering to this format, the response will maintain linguistic integrity while enhancing professionalism, structure and alignment with user expectations.
257259
258260
# Text:
259-
Gemma open models are built from the same research and technology as Gemini models. Gemma 2 comes in 2B, 9B and 27B and Gemma 1 comes in 2B and 7B sizes.
261+
Gemma open models are built from the same research _____ technology as Gemini models. Gemma 2 comes in 2B, 9B _____ 27B and Gemma 1 comes in 2B _____ 7B sizes.
260262
261263
<end_of_turn>
262264
<start_of_turn>model
263265
266+
## **Custom Title**:
267+
### Gemma open models
268+
269+
## **Custom Description**:
270+
Gemma: Introducing new state-of-the-art open models.
271+
272+
## **Custom Article**:
273+
A new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety.
274+
275+
## **Custom Main Points**:
276+
- Main point 1
277+
- Main point 2
278+
279+
## **Custom Categories**:
280+
- Artificial Intelligence
281+
- Gemma
282+
283+
## **Custom Tags**:
284+
- AI
285+
- LLM
286+
- Google<end_of_turn>
287+
264288
""" # noqa: E501
265289

266290
_structure_items: dict[str, tuple[str, str, str]] = {}
@@ -394,7 +418,7 @@ def load_dataset(
394418
instruction_template (Optional[TemplateTypes]):
395419
Template for including specific instructions in the prompts.
396420
structure_template (Optional[TemplateTypes]):
397-
Template for structuring the response content.
421+
Template for structuring the user prompt.
398422
output_format (Union[str, Literal["text", "alpaca", "gpt"]]):
399423
Specifies the format for the generated prompts. Default is "text".
400424
eos_token_str (Optional[str]):
@@ -407,7 +431,7 @@ def load_dataset(
407431
min_chars_length (int):
408432
Minimum character of a word, used to create unigrams, bigrams, and trigrams. Default is 2.
409433
max_chars_length (int):
410-
Maximum character of a word, used to create unigrams, bigrams and trigrams.. Default is 0.
434+
Maximum character of a word, used to create unigrams, bigrams and trigrams. Default is 0.
411435
max_concurrency (int):
412436
Maximum number of concurrent threads for processing data. Default is 4.
413437
**kwargs: Additional parameters, including:
@@ -447,6 +471,8 @@ async def create_task(config, hidden_count: int = 0):
447471
config.update(dict(min_chars_length=min_chars_length, max_chars_length=max_chars_length))
448472
if max_hidden_ratio > 0 and hidden_count < max_hidden_count:
449473
config["max_hidden_words"] = max_hidden_words
474+
else:
475+
config["max_hidden_words"] = 0
450476

451477
if output_format == "alpaca":
452478
items.append(
@@ -622,7 +648,7 @@ def get_user_kwargs(
622648
n_words=n_words,
623649
language=language,
624650
bullet_style=bullet_style,
625-
is_hidden=bool(kwargs.get("max_hidden_words")),
651+
is_masked=bool(kwargs.get("max_hidden_words")),
626652
)
627653
if isinstance(instruction_template, Callable):
628654
instruction_template_str = instruction_template(
@@ -823,14 +849,16 @@ def generate_model_prompt(
823849
""" # noqa: E501
824850

825851
output_document = kwargs.get("output", "")
826-
if isinstance(structure_template, Callable):
852+
if isinstance(structure_template, (str, Callable)):
853+
kwargs["document"] = output_document
827854
if isinstance(structure_template, Callable):
828-
output_document = structure_template(self._structure_items, **kwargs)
855+
if isinstance(structure_template, Callable):
856+
self._structure_items = structure_template(structure_data, **kwargs)
829857

830-
elif isinstance(structure_template, str):
831-
output_document = self._formatting_structure_model_fn(
832-
self._structure_items, bullet_style, **kwargs
833-
)
858+
else:
859+
output_document = self._formatting_structure_model_fn(
860+
self._structure_items, bullet_style, **kwargs
861+
)
834862

835863
return output_document.strip() + eos_token_str
836864

@@ -880,7 +908,7 @@ def to_text(
880908
bigrams=user_kwargs.get("bigrams", []) or [],
881909
trigrams=user_kwargs.get("trigrams", []) or [],
882910
language=user_kwargs.get("language"),
883-
is_hidden=bool(user_kwargs.get("is_hidden")),
911+
is_masked=bool(user_kwargs.get("is_masked")),
884912
)
885913

886914
def to_alpaca(
@@ -909,7 +937,7 @@ def to_alpaca(
909937
bigrams=user_kwargs.get("bigrams", []) or [],
910938
trigrams=user_kwargs.get("trigrams", []) or [],
911939
language=user_kwargs.get("language"),
912-
is_hidden=bool(user_kwargs.get("is_hidden")),
940+
is_masked=bool(user_kwargs.get("is_masked")),
913941
)
914942

915943
def to_openai(
@@ -938,7 +966,7 @@ def to_openai(
938966
bigrams=user_kwargs.get("bigrams", []) or [],
939967
trigrams=user_kwargs.get("trigrams", []) or [],
940968
language=user_kwargs.get("language"),
941-
is_hidden=bool(user_kwargs.get("is_hidden")),
969+
is_masked=bool(user_kwargs.get("is_masked")),
942970
)
943971

944972
def _get_prompts(

setup.cfg

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[metadata]
22
long_description = file: README.md
33
long_description_content_type = text/markdown
4-
license = MIT
4+
license = Apache2
55
license_file = LICENSE
66
python_requires = >=3.9
77
install_requires =
@@ -14,7 +14,7 @@ classifiers =
1414
Development Status :: 4 - Beta
1515
Intended Audience :: Developers
1616
Environment :: Web Environment
17-
License :: OSI Approved :: Apache License 2.0
17+
License :: OSI Approved :: Apache Software License
1818
Natural Language :: English
1919
Operating System :: OS Independent
2020
Programming Language :: Python

0 commit comments

Comments
 (0)