Use custom dataset as response source #200

pancak3 · 2025-09-13T11:54:18Z

/resolve #196

Signed-off-by: Qifan Deng <[email protected]>

…o dataset Signed-off-by: Qifan Deng <[email protected]>

Signed-off-by: Qifan Deng <[email protected]>

pancak3 · 2025-09-15T13:06:29Z

Peview with LibreChat with 46e5d1e as backend:

@mayabar @irar2, this PR is big. I refactored a lot while implementing the custom dataset. Please don't hesitate to let me know where to fix or improve.

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

You probably have observed that if the prompt does not hit the dataset, it needs to go through all the keys, which takes some time, thus affecting the desired ttft if configured. However, I passed this ball to another PR. Discussion is needed to decide how to solve this. My initial thoughts are creating a goroutine with the desired ttft as a timeout. If time is out, return random text instead. However, users who go with custom datasets may think this lookup time is minor. It is complicated : )

Signed-off-by: Qifan Deng <[email protected]>

mayabar · 2025-09-16T10:22:25Z

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

Passed the question forward, I'll keep you updated

Signed-off-by: Qifan Deng <[email protected]>

mayabar · 2025-09-16T11:32:26Z

Hi Qifan,
We briefly went over this PR and have some concerns about the custom dataset implementation. To make sure we understood the concept correctly, could you explain in a couple of sentences the main idea behind how a generated text is selected.
Additionally, do you have some utility for converting the HF dataset into the format you are using? And why not simply store it in a text file and load it into a map in a memory?

pancak3 · 2025-09-16T12:10:59Z

Hi Qifan, We briefly went over this PR and have some concerns about the custom dataset implementation. To make sure we understood the concept correctly, could you explain in a couple of sentences the main idea behind how a generated text is selected. Additionally, do you have some utility for converting the HF dataset into the format you are using? And why not simply store it in a text file and load it into a map in a memory?

@mayabar
Hi Maya,

Yes I expected concerns. Perhaps I should have discussed this before the implementation.

For a request, a hash is generated, full messages for chat and whatever prompt for text completion. The hash is used to lookup responses. If no response found, or number of tokens is greater than max tokens, the desired number of tokens will be used to lookup responses. The “desired number of tokens” is generated using existing code.

If there are responses found in dataset, randomly select one from them. Otherwise, randomly generate using existing implementation.

And why not simply store it in a text file and load it into a map in a memory?

The main purpose of using a prepared dataset is for performance, with some other considerations.

Reduce load time and avoid repeating of computation. Using hash to lookup responses requires all the conversations have hashes. If hashes are generated when llmd-sim starts, it takes linear time. The computation will also be repeated every time starting on every instances. This seems not eco to me.
Considerations for memory capacity. Loading all dataset to consume memory of the host machine, thus may make llmd-sim heavy. Relatively, disk is easier to obtain. Especially for users who work with giant datasets. But this is a trade off; memory is faster.
Others, decoupling the functionality for the datasets preparation. There are different datasets when they have very different structures. Splitting the converting in another place seems better for repo management, to me.

do you have some utility for converting the HF dataset into the format you are using?

Yes, the utility is with the dataset in the huggingface repo. Briefly, it creates sha256 on full messages, and save result in sqlite db. The file is called main.ipynb. There is readme as well.

To wrap up, the implementation is mainly for performance. I can add a config to let user decide the behaviour, with “loading dataset” in memory. Let me know what you think.

nilig · 2025-09-16T16:56:08Z

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

Passed the question forward, I'll keep you updated

@pancak3 Thanks a lot for setting this up and for preparing the dataset + README — really helpful!
That said, there’s a naming conflict with the current Hugging Face org name. Could I kindly ask you to rename it to avoid confusion with the official llm-d project.
@chcost @smarterclayton @vitabortnikov @elevran

pancak3 · 2025-09-17T01:22:45Z

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

Passed the question forward, I'll keep you updated

@pancak3 Thanks a lot for setting this up and for preparing the dataset + README — really helpful! That said, there’s a naming conflict with the current Hugging Face org name. Could I kindly ask you to rename it to avoid confusion with the official llm-d project. @chcost @smarterclayton @vitabortnikov @elevran

Sure, I am waiting for Maya's team's decision on whether to merge the use of the remote dataset in this feature. After the decision, we will know whether to keep the hf org I created. If not, I will delete it. Otherwise, the dataset repo may need to be created in the official llmd hf org. We will see.

May I have the link to the official HF org? I failed to find it, so I created llm-d to occupy. Let me know if I should transfer ownership instead. Thanks.

pancak3 · 2025-09-17T14:11:13Z

I also created Hugging Face organization to host the converted dataset. Let me know which accounts should own it. Thanks!
The org is created to host a prepared SQLite dataset file to avoid repeating the dataset conversion. I have written a simple readme there. Please check.

Passed the question forward, I'll keep you updated

@pancak3 Thanks a lot for setting this up and for preparing the dataset + README — really helpful! That said, there’s a naming conflict with the current Hugging Face org name. Could I kindly ask you to rename it to avoid confusion with the official llm-d project. @chcost @smarterclayton @vitabortnikov @elevran

Sure, I am waiting for Maya's team's decision on whether to merge the use of the remote dataset in this feature. After the decision, we will know whether to keep the hf org I created. If not, I will delete it. Otherwise, the dataset repo may need to be created in the official llmd hf org. We will see.
May I have the link to the official HF org? I failed to find it, so I created llm-d to occupy. Let me know if I should transfer ownership instead. Thanks.

Thanks for clarifying. At the moment there isn’t an official Hugging Face org for llm-d, but to avoid confusion with the project’s official name we kindly ask that you rename the organization you created.

Deleted, https://huggingface.co/organizations/llm-d
imo, it's better to create the hf org with the control of the community, asap. It can be difficult to regain access if someone else claims it.

Signed-off-by: Qifan Deng <[email protected]>

pancak3 · 2025-09-23T04:27:19Z

@mayabar please let me know how you'd like to proceed with this PR when you have a moment. Thanks!

mayabar · 2025-09-25T05:34:05Z

@pancak3 Hi Qifan, sorry for low responsiveness, we have a holidays period ;) Going to review your changes

pancak3 · 2025-09-25T05:36:18Z

@pancak3 Hi Qifan, sorry for low responsiveness, we have a holidays period ;) Going to review your changes

Sorry, I did not know. No worries! Let's discuss this after your holiday. Enjoy🏖️

mayabar · 2025-09-25T05:36:53Z

I'm working today, we can discuss any opened issues

Signed-off-by: Qifan Deng <[email protected]>

README.md

pkg/common/config.go

pkg/dataset/custom_dataset.go

README.md

pkg/dataset/custom_dataset.go

pkg/dataset/dataset.go

pkg/dataset/custom_dataset.go

mayabar · 2025-09-25T07:06:20Z

@pancak3 I just noticed that I forgot to finish review and all my comments were in pending state for last 3 days 🤦‍♀️

pancak3 · 2025-09-25T07:26:23Z

@pancak3 I just noticed that I forgot to finish review and all my comments were in pending state for last 3 days 🤦‍♀️

It happens. :) This PR is big, so let me know when you've finished the review, and then I'll start to address them. Thanks!

mayabar · 2025-09-25T08:07:21Z

I had pending comments, submitted them about 1 hour ago

Signed-off-by: Qifan Deng <[email protected]>

pancak3 · 2025-09-26T10:00:56Z

Hi Maya, I have resolved the above comments. Also added a config dataset-in-memory to switch whether to load the dataset in memory or not. Please review, thanks!

mayabar

@pancak3 thanks for your updates, I added some comments

pkg/dataset/custom_dataset.go

pkg/dataset/custom_dataset_test.go

Signed-off-by: Qifan Deng <[email protected]>

mayabar · 2025-09-30T08:13:23Z

/lgtm
/approve

pancak3 added 2 commits September 13, 2025 21:53

Show config in yaml

97b4230

Signed-off-by: Qifan Deng <[email protected]>

load or download response dataset

b001ac6

Signed-off-by: Qifan Deng <[email protected]>

pancak3 force-pushed the dev/dataset branch from 395619f to a7f0bea Compare September 13, 2025 14:55

pancak3 added 15 commits September 14, 2025 15:41

Init dataset when sim starts and show downloading speed of url

b4adccc

Signed-off-by: Qifan Deng <[email protected]>

Fix tests and init dataset when loading sim

1112b0d

Signed-off-by: Qifan Deng <[email protected]>

Move dataset init to startSim

23b9325

Signed-off-by: Qifan Deng <[email protected]>

Change db structure and add test cases

1840d68

Signed-off-by: Qifan Deng <[email protected]>

fix test

ce81267

Signed-off-by: Qifan Deng <[email protected]>

remove duplicates in request.go

44100d5

Signed-off-by: Qifan Deng <[email protected]>

Move token generation to simulator

5a144f1

Signed-off-by: Qifan Deng <[email protected]>

Generate tokens instead of strings

a4cd9a8

Signed-off-by: Qifan Deng <[email protected]>

Move dataset.go to common

ac5a575

Signed-off-by: Qifan Deng <[email protected]>

Refactor: abstract dataset and move response generation from common t…

c56728a

…o dataset Signed-off-by: Qifan Deng <[email protected]>

fix dataset tests

39a9d24

Signed-off-by: Qifan Deng <[email protected]>

add tests for custom dataset

6cf6eff

Signed-off-by: Qifan Deng <[email protected]>

fix custom dataset test case

46076d6

Signed-off-by: Qifan Deng <[email protected]>

Remove unnecessary config

e1a555a

Signed-off-by: Qifan Deng <[email protected]>

Add cli arg of dataset path and url, also update readme

2356d80

Signed-off-by: Qifan Deng <[email protected]>

pancak3 marked this pull request as ready for review September 15, 2025 13:06

pancak3 added 2 commits September 16, 2025 11:45

Return random from dataset if prmopt hash does not hit

51c6f49

Signed-off-by: Qifan Deng <[email protected]>

Respect maxTokens

789a4f2

Signed-off-by: Qifan Deng <[email protected]>

pancak3 force-pushed the dev/dataset branch from 51209a7 to 5b39826 Compare September 16, 2025 10:51

Resolve conflicts and fix test case

1baeb9e

Signed-off-by: Qifan Deng <[email protected]>

pancak3 force-pushed the dev/dataset branch from 5b39826 to 037a910 Compare September 16, 2025 11:22

Ignore test temp folder

b5518a7

Signed-off-by: Qifan Deng <[email protected]>

pancak3 force-pushed the dev/dataset branch from 2a4c2c9 to b5518a7 Compare September 23, 2025 04:25

Update README

e13352b

Signed-off-by: Qifan Deng <[email protected]>

mayabar requested changes Sep 25, 2025

View reviewed changes

pancak3 added 7 commits September 26, 2025 16:14

flat config

5d59137

Signed-off-by: Qifan Deng <[email protected]>

Use ctx in main

7d81913

Signed-off-by: Qifan Deng <[email protected]>

Update readme and dataset downloading logic

d93174f

Signed-off-by: Qifan Deng <[email protected]>

Pass logger when init dataset

fc6e2fa

Signed-off-by: Qifan Deng <[email protected]>

Improve progress logging, show it every 5 seconds or 10%

63c184e

Signed-off-by: Qifan Deng <[email protected]>

Use in memory database

e46ae87

Signed-off-by: Qifan Deng <[email protected]>

Use backup api to load dataset from disk to memory

ad88ac3

Signed-off-by: Qifan Deng <[email protected]>

pancak3 requested a review from mayabar September 26, 2025 09:59

mayabar reviewed Sep 28, 2025

View reviewed changes

pkg/dataset/custom_dataset.go Show resolved Hide resolved

pkg/dataset/custom_dataset_test.go Outdated Show resolved Hide resolved

pancak3 added 3 commits September 29, 2025 20:52

Remove duplicated log of Server starting

addb352

Signed-off-by: Qifan Deng <[email protected]>

use klog

eb591ef

Signed-off-by: Qifan Deng <[email protected]>

update readme

b3ac9f5

Signed-off-by: Qifan Deng <[email protected]>

mayabar approved these changes Sep 30, 2025

View reviewed changes

github-actions bot added the lgtm label Sep 30, 2025

github-actions bot approved these changes Sep 30, 2025

View reviewed changes

github-actions bot merged commit 1adeeb3 into llm-d:main Sep 30, 2025
4 checks passed

Use custom dataset as response source #200

Use custom dataset as response source #200

Uh oh!

Conversation

pancak3 commented Sep 13, 2025

Uh oh!

pancak3 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayabar commented Sep 16, 2025

Uh oh!

mayabar commented Sep 16, 2025

Uh oh!

pancak3 commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nilig commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pancak3 commented Sep 17, 2025

Uh oh!

pancak3 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pancak3 commented Sep 23, 2025

Uh oh!

mayabar commented Sep 25, 2025

Uh oh!

pancak3 commented Sep 25, 2025

Uh oh!

mayabar commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mayabar commented Sep 25, 2025

Uh oh!

pancak3 commented Sep 25, 2025

Uh oh!

mayabar commented Sep 25, 2025

Uh oh!

pancak3 commented Sep 26, 2025

Uh oh!

mayabar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mayabar commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pancak3 commented Sep 15, 2025 •

edited

Loading

pancak3 commented Sep 16, 2025 •

edited

Loading

nilig commented Sep 16, 2025 •

edited

Loading

pancak3 commented Sep 17, 2025 •

edited

Loading