-
Notifications
You must be signed in to change notification settings - Fork 849
Remove input_ids arg from dataset filter #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -62,7 +62,19 @@ | |
| "id": "JruG4avjKgkz" | ||
| }, | ||
| "outputs": [], | ||
| "source": "%%capture\nimport os, re\nif \"COLAB_\" not in \"\".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r'[\\d]{1,}\\.[\\d]{1,}', str(torch.__version__)).group(0)\n xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, \"0.0.34\")\n !pip install sentencepiece protobuf \"datasets==4.3.0\" \"huggingface_hub>=0.34.0\" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install transformers==5.1.0\n!pip install --no-deps trl==0.22.2" | ||
| "source": [ | ||
| "%%capture\n", | ||
| "import os, re\n", | ||
| "if \"COLAB_\" not in \"\".join(os.environ.keys()):\n", | ||
| " !pip install unsloth # Do this in local & cloud setups\n", | ||
| "else:\n", | ||
| " import torch; v = re.match(r'[\\d]{1,}\\.[\\d]{1,}', str(torch.__version__)).group(0)\n", | ||
| " xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, \"0.0.34\")\n", | ||
| " !pip install sentencepiece protobuf \"datasets==4.3.0\" \"huggingface_hub>=0.34.0\" hf_transfer\n", | ||
| " !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n", | ||
| "!pip install transformers==5.1.0\n", | ||
| "!pip install --no-deps trl==0.22.2" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
|
|
@@ -602,7 +614,7 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 12, | ||
| "execution_count": null, | ||
| "metadata": { | ||
| "colab": { | ||
| "base_uri": "https://localhost:8080/" | ||
|
|
@@ -623,7 +635,7 @@ | |
| } | ||
| ], | ||
| "source": [ | ||
| "dataset[\"N\"] = dataset[\"Messages\"].apply(lambda x: len(tokenizer.apply_chat_template(x)['input_ids']))\n", | ||
| "dataset[\"N\"] = dataset[\"Messages\"].apply(lambda x: len(tokenizer.apply_chat_template(x)))\n", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change mirrors the one in |
||
| "\n", | ||
| "dataset = dataset.loc[dataset[\"N\"] <= max_seq_length/2].copy()\n", | ||
| "dataset.shape" | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from
len(tokenizer.apply_chat_template(x)['input_ids'])tolen(tokenizer.apply_chat_template(x))alters how the sequence lengthNis calculated. In the context ofmax_seq_lengthfor language models,Ntypically refers to the number of tokens, not characters.If
tokenizer.apply_chat_templateis configured to return a string (e.g.,tokenize=False), the original code would have raised aTypeError. The current change would then calculate the character length.If
tokenizer.apply_chat_templateis configured to return a dictionary with token IDs (e.g.,tokenize=True), the current change would incorrectly calculate the number of keys in the dictionary instead of the token count.To ensure
Naccurately reflects the token count for filtering againstmax_seq_length, it's crucial to explicitly tokenize the input and then access theinput_ids.