Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions nb/Qwen3_MoE.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -57,12 +57,24 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {
"id": "JruG4avjKgkz"
},
"outputs": [],
"source": "%%capture\nimport os, re\nif \"COLAB_\" not in \"\".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r'[\\d]{1,}\\.[\\d]{1,}', str(torch.__version__)).group(0)\n xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, \"0.0.34\")\n !pip install sentencepiece protobuf \"datasets==4.3.0\" \"huggingface_hub>=0.34.0\" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install transformers==5.1.0\n!pip install --no-deps trl==0.22.2"
"source": [
"%%capture\n",
"import os, re\n",
"if \"COLAB_\" not in \"\".join(os.environ.keys()):\n",
" !pip install unsloth # Do this in local & cloud setups\n",
"else:\n",
" import torch; v = re.match(r'[\\d]{1,}\\.[\\d]{1,}', str(torch.__version__)).group(0)\n",
" xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, \"0.0.34\")\n",
" !pip install sentencepiece protobuf \"datasets==4.3.0\" \"huggingface_hub>=0.34.0\" hf_transfer\n",
" !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n",
"!pip install transformers==5.1.0\n",
"!pip install --no-deps trl==0.22.2"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -641,7 +653,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand All @@ -662,7 +674,7 @@
}
],
"source": [
"dataset[\"N\"] = dataset[\"Messages\"].apply(lambda x: len(tokenizer.apply_chat_template(x)['input_ids']))\n",
"dataset[\"N\"] = dataset[\"Messages\"].apply(lambda x: len(tokenizer.apply_chat_template(x)))\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The change from len(tokenizer.apply_chat_template(x)['input_ids']) to len(tokenizer.apply_chat_template(x)) alters how the sequence length N is calculated. In the context of max_seq_length for language models, N typically refers to the number of tokens, not characters.

If tokenizer.apply_chat_template is configured to return a string (e.g., tokenize=False), the original code would have raised a TypeError. The current change would then calculate the character length.
If tokenizer.apply_chat_template is configured to return a dictionary with token IDs (e.g., tokenize=True), the current change would incorrectly calculate the number of keys in the dictionary instead of the token count.

To ensure N accurately reflects the token count for filtering against max_seq_length, it's crucial to explicitly tokenize the input and then access the input_ids.

dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply_chat_template(x, tokenize=True)['input_ids']))

"\n",
"dataset = dataset.loc[dataset[\"N\"] <= max_seq_length/2].copy()\n",
"dataset.shape"
Expand Down
18 changes: 15 additions & 3 deletions nb/TinyQwen3_MoE.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,19 @@
"id": "JruG4avjKgkz"
},
"outputs": [],
"source": "%%capture\nimport os, re\nif \"COLAB_\" not in \"\".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r'[\\d]{1,}\\.[\\d]{1,}', str(torch.__version__)).group(0)\n xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, \"0.0.34\")\n !pip install sentencepiece protobuf \"datasets==4.3.0\" \"huggingface_hub>=0.34.0\" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install transformers==5.1.0\n!pip install --no-deps trl==0.22.2"
"source": [
"%%capture\n",
"import os, re\n",
"if \"COLAB_\" not in \"\".join(os.environ.keys()):\n",
" !pip install unsloth # Do this in local & cloud setups\n",
"else:\n",
" import torch; v = re.match(r'[\\d]{1,}\\.[\\d]{1,}', str(torch.__version__)).group(0)\n",
" xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, \"0.0.34\")\n",
" !pip install sentencepiece protobuf \"datasets==4.3.0\" \"huggingface_hub>=0.34.0\" hf_transfer\n",
" !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n",
"!pip install transformers==5.1.0\n",
"!pip install --no-deps trl==0.22.2"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -602,7 +614,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand All @@ -623,7 +635,7 @@
}
],
"source": [
"dataset[\"N\"] = dataset[\"Messages\"].apply(lambda x: len(tokenizer.apply_chat_template(x)['input_ids']))\n",
"dataset[\"N\"] = dataset[\"Messages\"].apply(lambda x: len(tokenizer.apply_chat_template(x)))\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change mirrors the one in nb/Qwen3_MoE.ipynb. The removal of ['input_ids'] from tokenizer.apply_chat_template(x) implies a shift from token count to character count for determining N. Given that max_seq_length in LLM training usually refers to token length, this change could lead to incorrect dataset filtering if token length was the original intent. Please clarify if the intention is now to filter by character length, or if tokenize=True and ['input_ids'] should be explicitly used to ensure token-based filtering.

dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply_chat_template(x, tokenize=True)['input_ids']))

"\n",
"dataset = dataset.loc[dataset[\"N\"] <= max_seq_length/2].copy()\n",
"dataset.shape"
Expand Down