Skip to content

Conversation

@pancak3
Copy link
Contributor

@pancak3 pancak3 commented Sep 13, 2025

/resolve #196

@pancak3
Copy link
Contributor Author

pancak3 commented Sep 15, 2025

Peview with LibreChat with 46e5d1e as backend:

output

@mayabar @irar2, this PR is big. I refactored a lot while implementing the custom dataset. Please don't hesitate to let me know where to fix or improve.

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

You probably have observed that if the prompt does not hit the dataset, it needs to go through all the keys, which takes some time, thus affecting the desired ttft if configured. However, I passed this ball to another PR. Discussion is needed to decide how to solve this. My initial thoughts are creating a goroutine with the desired ttft as a timeout. If time is out, return random text instead. However, users who go with custom datasets may think this lookup time is minor. It is complicated : )

@pancak3 pancak3 marked this pull request as ready for review September 15, 2025 13:06
@mayabar
Copy link
Collaborator

mayabar commented Sep 16, 2025

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

Passed the question forward, I'll keep you updated

@mayabar
Copy link
Collaborator

mayabar commented Sep 16, 2025

Hi Qifan,
We briefly went over this PR and have some concerns about the custom dataset implementation. To make sure we understood the concept correctly, could you explain in a couple of sentences the main idea behind how a generated text is selected.
Additionally, do you have some utility for converting the HF dataset into the format you are using? And why not simply store it in a text file and load it into a map in a memory?

@pancak3
Copy link
Contributor Author

pancak3 commented Sep 16, 2025

Hi Qifan, We briefly went over this PR and have some concerns about the custom dataset implementation. To make sure we understood the concept correctly, could you explain in a couple of sentences the main idea behind how a generated text is selected. Additionally, do you have some utility for converting the HF dataset into the format you are using? And why not simply store it in a text file and load it into a map in a memory?

@mayabar
Hi Maya,

Yes I expected concerns. Perhaps I should have discussed this before the implementation.

For a request, a hash is generated, full messages for chat and whatever prompt for text completion. The hash is used to lookup responses. If no response found, or number of tokens is greater than max tokens, the desired number of tokens will be used to lookup responses. The “desired number of tokens” is generated using existing code.

If there are responses found in dataset, randomly select one from them. Otherwise, randomly generate using existing implementation.

And why not simply store it in a text file and load it into a map in a memory?

The main purpose of using a prepared dataset is for performance, with some other considerations.

  1. Reduce load time and avoid repeating of computation. Using hash to lookup responses requires all the conversations have hashes. If hashes are generated when llmd-sim starts, it takes linear time. The computation will also be repeated every time starting on every instances. This seems not eco to me.
  2. Considerations for memory capacity. Loading all dataset to consume memory of the host machine, thus may make llmd-sim heavy. Relatively, disk is easier to obtain. Especially for users who work with giant datasets. But this is a trade off; memory is faster.
  3. Others, decoupling the functionality for the datasets preparation. There are different datasets when they have very different structures. Splitting the converting in another place seems better for repo management, to me.

do you have some utility for converting the HF dataset into the format you are using?

Yes, the utility is with the dataset in the huggingface repo. Briefly, it creates sha256 on full messages, and save result in sqlite db. The file is called main.ipynb. There is readme as well.


To wrap up, the implementation is mainly for performance. I can add a config to let user decide the behaviour, with “loading dataset” in memory. Let me know what you think.

@nilig
Copy link
Collaborator

nilig commented Sep 16, 2025

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

Passed the question forward, I'll keep you updated

@pancak3 Thanks a lot for setting this up and for preparing the dataset + README — really helpful!
That said, there’s a naming conflict with the current Hugging Face org name. Could I kindly ask you to rename it to avoid confusion with the official llm-d project.
@chcost @smarterclayton @vitabortnikov @elevran

@pancak3
Copy link
Contributor Author

pancak3 commented Sep 17, 2025

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

Passed the question forward, I'll keep you updated

@pancak3 Thanks a lot for setting this up and for preparing the dataset + README — really helpful! That said, there’s a naming conflict with the current Hugging Face org name. Could I kindly ask you to rename it to avoid confusion with the official llm-d project. @chcost @smarterclayton @vitabortnikov @elevran

Sure, I am waiting for Maya's team's decision on whether to merge the use of the remote dataset in this feature. After the decision, we will know whether to keep the hf org I created. If not, I will delete it. Otherwise, the dataset repo may need to be created in the official llmd hf org. We will see.

May I have the link to the official HF org? I failed to find it, so I created llm-d to occupy. Let me know if I should transfer ownership instead. Thanks.

@pancak3
Copy link
Contributor Author

pancak3 commented Sep 17, 2025

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

Passed the question forward, I'll keep you updated

@pancak3 Thanks a lot for setting this up and for preparing the dataset + README — really helpful! That said, there’s a naming conflict with the current Hugging Face org name. Could I kindly ask you to rename it to avoid confusion with the official llm-d project. @chcost @smarterclayton @vitabortnikov @elevran

Sure, I am waiting for Maya's team's decision on whether to merge the use of the remote dataset in this feature. After the decision, we will know whether to keep the hf org I created. If not, I will delete it. Otherwise, the dataset repo may need to be created in the official llmd hf org. We will see.
May I have the link to the official HF org? I failed to find it, so I created llm-d to occupy. Let me know if I should transfer ownership instead. Thanks.

Thanks for clarifying. At the moment there isn’t an official Hugging Face org for llm-d, but to avoid confusion with the project’s official name we kindly ask that you rename the organization you created.

Deleted, https://huggingface.co/organizations/llm-d
imo, it's better to create the hf org with the control of the community, asap. It can be difficult to regain access if someone else claims it.

Signed-off-by: Qifan Deng <[email protected]>
@pancak3
Copy link
Contributor Author

pancak3 commented Sep 23, 2025

@mayabar please let me know how you'd like to proceed with this PR when you have a moment. Thanks!

@mayabar
Copy link
Collaborator

mayabar commented Sep 25, 2025

@pancak3 Hi Qifan, sorry for low responsiveness, we have a holidays period ;) Going to review your changes

@pancak3
Copy link
Contributor Author

pancak3 commented Sep 25, 2025

@pancak3 Hi Qifan, sorry for low responsiveness, we have a holidays period ;) Going to review your changes

Sorry, I did not know. No worries! Let's discuss this after your holiday. Enjoy🏖️

@mayabar
Copy link
Collaborator

mayabar commented Sep 25, 2025

I'm working today, we can discuss any opened issues

Signed-off-by: Qifan Deng <[email protected]>
@mayabar
Copy link
Collaborator

mayabar commented Sep 25, 2025

@pancak3 I just noticed that I forgot to finish review and all my comments were in pending state for last 3 days 🤦‍♀️

@pancak3
Copy link
Contributor Author

pancak3 commented Sep 25, 2025

@pancak3 I just noticed that I forgot to finish review and all my comments were in pending state for last 3 days 🤦‍♀️

It happens. :) This PR is big, so let me know when you've finished the review, and then I'll start to address them. Thanks!

@mayabar
Copy link
Collaborator

mayabar commented Sep 25, 2025

I had pending comments, submitted them about 1 hour ago

@pancak3 pancak3 requested a review from mayabar September 26, 2025 09:59
@pancak3
Copy link
Contributor Author

pancak3 commented Sep 26, 2025

Hi Maya, I have resolved the above comments. Also added a config dataset-in-memory to switch whether to load the dataset in memory or not. Please review, thanks!

Copy link
Collaborator

@mayabar mayabar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pancak3 thanks for your updates, I added some comments

Signed-off-by: Qifan Deng <[email protected]>
Signed-off-by: Qifan Deng <[email protected]>
@mayabar
Copy link
Collaborator

mayabar commented Sep 30, 2025

/lgtm
/approve

@github-actions github-actions bot added the lgtm label Sep 30, 2025
@github-actions github-actions bot merged commit 1adeeb3 into llm-d:main Sep 30, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support custom datasets

3 participants