How to train a model online with TRL (without pre-existing dataset)? #2822

SepehrDehdashtian · 2025-02-10T18:08:23Z

SepehrDehdashtian
Feb 10, 2025

Hello,
I’m trying to fine-tune a model (e.g., DeepSeek-R1-Distill-Llama-8B) in an online RL style, without having a static dataset beforehand. Instead, I want to generate prompts on the fly, sample responses from my policy, compute a reward at the end of each response, and update the policy accordingly.

I am new to RL fine-tuning of LLMs, so I might be missing something obvious. Any guidance or clarification would be greatly appreciated!

Problem Setup

Each episode starts with a template prompt.
At each step within an episode, the context from previous steps is added to the next step’s prompt.
The model receives a reward only after completing the full response at the end of each step.

A pseudo-code for illustration

for episode in range(num_episodes):
    context = "..."  # Initial template prompt
    
    for step in range(episode_length):
        # 1. Generate a prompt (adding previous context)
        prompt_text = context  
        query_ids = tokenizer.encode(prompt_text, return_tensors="pt")

        # 2. Generate a response from the current policy
        response_ids = ppo_trainer.generate(query_ids, ...)[0]

        response_text = tokenizer.decode(response_ids)

        # 3. Update context with the new response
        context += "\n" + response_text  

        # 4. Compute reward
         reward = env_step(response_text)  # returns something like 0.91
            
        # 5. Try to run a PPO step
        stats = ppo_trainer.step([query_ids[0]], [response_ids], [reward])

My Questions

Is there a recommended way to do RL in an “online/offline loop,” where the data is generated at each iteration, previous steps affect future prompts, and the reward is given only after completing the full response?
Does TRL have an example that demonstrates on-the-fly prompt generation -> step-wise response accumulation -> final reward -> single-step PPO updates?
What are the best workarounds if ppo_trainer requires a dataset? Should I maintain a buffer of generated queries/responses and then construct a dataset dynamically?

Since I’m still learning about RL fine-tuning for LLMs, I’d appreciate any resources or explanations that clarify the best approach to achieve this in TRL.

Thanks in advance.

eryawww · 2025-04-07T10:36:38Z

eryawww
Apr 7, 2025

I'm experiencing a similar issue. As far as I know, the ppo_trainer.step() function is available in version 0.11. Starting from version 0.12, the only entry point seems to be the ppo_trainer.train() which is an offline style. I'm not sure why everything was grouped into one large function.

A possible solution might be to use the older version 0.11. I hope someone has a better solution.

EDIT:
I think the discussion in this issues will answer some question #3270

0 replies

gogo2464 · 2025-05-23T00:42:58Z

gogo2464
May 23, 2025

Hi @SepehrDehdashtian , I understand your wish of cli unix philo (no wget in the raw cli), and I have a great news for you!!

I developped a PR to support local dataset at https://github.com/huggingface/trl/pull/3470/files . It has not been merged yet. But in a few lines I got to test successfully on my how hard disk a chatbot training. It took 9 seconds for 3 prompt training in json.

It currently only supports sft. Do not hesistate to clone the specific branch if you can not wait for the local dataset.

Best regards.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to train a model online with TRL (without pre-existing dataset)? #2822

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to train a model online with TRL (without pre-existing dataset)? #2822

Uh oh!

SepehrDehdashtian Feb 10, 2025

Problem Setup

A pseudo-code for illustration

My Questions

Replies: 2 comments

Uh oh!

Uh oh!

eryawww Apr 7, 2025

Uh oh!

gogo2464 May 23, 2025

SepehrDehdashtian
Feb 10, 2025

eryawww
Apr 7, 2025

gogo2464
May 23, 2025