-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: Using the HuggingFace tokenizer #4329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,9 +6,7 @@ | |
| @date:2024/4/18 15:28 | ||
| @desc: | ||
| """ | ||
| from typing import List, Dict | ||
|
|
||
| from langchain_core.messages import BaseMessage, get_buffer_string | ||
| from typing import Dict | ||
|
|
||
| from common.config.tokenizer_manage_config import TokenizerManage | ||
| from models_provider.base_model_provider import MaxKBBaseModel | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The provided Python code has a minor issue that you may want to address:
Here's an updated version with the redundant import removed: @@ -6,12 +6,7 @@
@date:2024/4/18 15:28
@desc:
"""
-import typing
-from typing import List, Dict
from common.config.tokenizer_manage_config import TokenizerManage
from models_provider.base_model_provider import MaxKBBaseModel
# Remove the redundant import of 'List'This change makes your code cleaner and eliminates unnecessary redundancy. If there are no other issues, this will suffice. |
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| { | ||
| "architectures": [ | ||
| "BertForMaskedLM" | ||
| ], | ||
| "attention_probs_dropout_prob": 0.1, | ||
| "gradient_checkpointing": false, | ||
| "hidden_act": "gelu", | ||
| "hidden_dropout_prob": 0.1, | ||
| "hidden_size": 768, | ||
| "initializer_range": 0.02, | ||
| "intermediate_size": 3072, | ||
| "layer_norm_eps": 1e-12, | ||
| "max_position_embeddings": 512, | ||
| "model_type": "bert", | ||
| "num_attention_heads": 12, | ||
| "num_hidden_layers": 12, | ||
| "pad_token_id": 0, | ||
| "position_embedding_type": "absolute", | ||
| "transformers_version": "4.6.0.dev0", | ||
| "type_vocab_size": 2, | ||
| "use_cache": true, | ||
| "vocab_size": 28996 | ||
| } |
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| {"do_lower_case": false, "model_max_length": 512} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review and Suggestions
Imports:
Base Directory:
os.path.joinalong withPath()simplifies path handling and ensures cross-platform compatibility.MKTokenizer Class:
tokenizemethod can be simplified to just returningtokenizer.encode(text)without wrapping it inside another list.TokenizerManage Class:
Tokenizer.MKTokenizerusing the loaded tokenizer within theget_tokenizermethod.File Paths:
__file__are correctly calculated. The current absolute path (BASE_DIR) should point to the directory containing this file, not its parent or grandparent directories twice.Static Method Considerations:
Revised Code:
By applying these improvements, the code will be cleaner, more maintainable, and adhere more closely to best practices in Python programming paradigms.