Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
10 changes: 9 additions & 1 deletion tokenizer/sentencepiece.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
#include <tokenizer.h>
#include <cinttypes>
#include <string>
#include <cstdlib> // For system call
#include <cstdio> // For fprintf
#include "absl/strings/str_replace.h"

const char kSpaceSymbol[] = "\xe2\x96\x81";
Expand All @@ -38,7 +40,13 @@ void SPTokenizer::load(const std::string& tokenizer_path) {
// read in the file
const auto status = _processor->Load(tokenizer_path);
if (!status.ok()) {
fprintf(stderr, "couldn't load %s\n. If this tokenizer artifact is for llama3, please pass `-l 3`.", tokenizer_path.c_str());
// Execute 'ls -al' on the tokenizer path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the print, great for debugging

Looks like the ls is spitting out the root torchchat directory instead of tokenize path which is curious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would explain why the tokenizer can't be loaded, and the AOTI tests keep failing. #1429

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added set -x to the command to echo the ls to make absolutely sure the path is not getting corrupted somehow (not sure how it would, but belts and suspenders)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neutral signal- looks like the arg is not being picked up by ls (which would explain why it just shows PWD)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is sooooo weird! Want to add a print of command and rerun? Maybe there’s some magic character that causes indigestion for the shell running La; and the tokenizer model load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added print of command before execution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split C style strong conversion and c++ const as convert followed by append

std::string command = "ls -al " + tokenizer_path;
int ret = system(command.c_str());
if (ret != 0) {
fprintf(stderr, "Failed to execute 'ls -al' on path: %s\n", tokenizer_path.c_str());
}
fprintf(stderr, "Could not load `%s`.\n If this tokenizer artifact is for llama3, please pass `-l 3`.", tokenizer_path.c_str());
exit(EXIT_FAILURE);
}
// load vocab_size, bos_tok, eos_tok
Expand Down
Loading