[Distributed] Implement universal batch_decode & decode_in_flight for llama2 & llama3, with deterministic or multinomial (topk) decoding (handle both sentencepiece (llama2) and tiktoken (llama3)) #1234

lessw2020 · 2024-09-30T19:51:51Z

This PR:
1 - updates the batch_decode_next_tokens in a way that handles both llama2 and llama3 with their respective tokenizers. Thus we have a single universal decoding for in flight.

2 - adds a temperature option to enable non-deterministic (creative) decoding, using topk and multinomial selection.

3 - update decode_in_flight, again to be compat with both llama2 and llama3.

4 - minor tweaks to using zip for final display and move decode_in_flight to _decode_in_flight with other global functions for ease of reference.

Tested with both llama2 and llama3:
example multiprompt with llama2:

Prompt: What is Snow? 
09-30 12:23:45.151 - dist_run:548 - Response: 
Snow is a form of frozen water that falls from the sky during winter months. It is created when water vapor in the air freezes into ice crystals, which then 
09-30 12:23:45.151 - dist_run:547 - Prompt: Who is Santa Claus? 
09-30 12:23:45.151 - dist_run:548 - Response: 
Santa Claus is a legendary figure associated with Christmas and the tradition of gift-giving during the holiday season. The character is based

pytorch-bot · 2024-09-30T19:51:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1234

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1e2eec0 with merge base 7ad9ba2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

metascroy · 2024-09-30T20:04:18Z

Rebase to fix failing tochao_experimental check

kwen2501

Thanks for the new feature! LGTM. Just some minor comments.

kwen2501 · 2024-10-01T18:50:23Z

dist_run.py

 ) -> torch.Tensor:
    """
-    Decode the next token for each prompt in the batch.
+    Decode the next token for each prompt in the batch. Adds temperature option for non-deterministic decoding.


I wonder if torchchat's generate also have the temperature option? Shall we think about how to connect with generate in next steps?

kwen2501 · 2024-10-01T18:54:06Z

dist_run.py

+    if temperature != 1.0:
+        next_token_logits = next_token_logits / temperature


nit: can we do the division unconditionally?

kwen2501 · 2024-10-01T18:55:36Z

dist_run.py

-    return next_token
+    batch_size, seq_len, vocab_size = output.shape
+
+    if step != -1:


nit: can step == -1 be represented by pos = [] or pos = [0, 0, ...]? (saving one argument)

kwen2501 · 2024-10-01T18:57:27Z

dist_run.py

+        top_k = min(topk, vocab_size)  # Ensure top-k is not greater than vocab size
+        top_k_logits, top_k_indices = torch.topk(next_token_logits, k=top_k, dim=-1)
+        probs = torch.softmax(top_k_logits, dim=-1)
+        next_token_indices = torch.multinomial(probs, num_samples=1).squeeze(-1)
+        next_tokens = top_k_indices.gather(
+            -1, next_token_indices.unsqueeze(-1)
+        ).squeeze(-1)


nit: do you mind adding more comments here for the multinomial, gather, squeeze and unsqueeze ops?

kwen2501 · 2024-10-01T18:59:00Z

dist_run.py

    # New token generated each iteration
    # need a row dimension for each prompt in the batch
    new_token = torch.zeros(batch_size, 1, device=device, dtype=torch.int64)
+    logger.info(f"{color.green}{new_token.shape=}, {new_token=}{color.reset}")


nit: for debugging only?

kwen2501 · 2024-10-01T19:00:31Z

dist_run.py

        if not args.disable_in_flight_decode:
-            decode_in_flight(new_token)
+            _decode_in_flight(new_token, tokenizer, tp_rank)


nit: put tp_rank to the if condition?

kwen2501 · 2024-10-01T19:01:13Z

dist_run.py

            # Decode the output
            if pp_rank == last_pp_rank:
-                new_token = _batch_decode_next_tokens(output, 0)
+                # logger.info(f"{color.red}Decoding...{output.shape=}{color.reset}")


nit: remove log?

kwen2501 · 2024-10-01T19:04:32Z

dist_run.py

+        # Decode the output as comprehension instead of loop
+        responses = [tokenizer.decode(sequence) for sequence in res_list]


Hmm, did the previous code not work in case of variable length? Just curious.
response = tokenizer.decode(res_list)

kwen2501 · 2024-10-01T19:08:40Z

dist_run.py

+        next_token_logits = output[:, 0, :]
+    else:
+        # get the logits for each prompt at the specified positions
+        next_token_logits = output[torch.arange(batch_size), torch.tensor(pos) - 1]


Hmm, why "-1"?
From this function's perspective, if the caller has given the position, should it just faithfully decode that position?
(I understand that this can be run right if providing prompt_length instead of promt_length -1 at callsite.)

lessw2020 · 2024-10-02T00:40:12Z

will revisit to update re: comments. Landing for now so we can work on tp fix.

lessw2020 added 10 commits September 28, 2024 16:27

working multi-prompt same lengths

f4cdbf8

working multi-prompt multiple lengths

4195913

tighten up results decoding and display

1c7368f

improve batch_decode_next_tokens

a512141

update _decode_in_flight

41d61a8

move prompt outside of main, auto-update batch size based on prompt

315a023

faster batch_decode_next_tokens, add topk/temperature option

8cba3d1

ruff format and check

844e908

simplify decode step, remove old comments

3b550f1

add explanatory comment on topk min check

285860e

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 30, 2024

Merge branch 'main' into lessw2020/demo_metrics

1e2eec0

lessw2020 requested a review from kwen2501 September 30, 2024 20:09

kwen2501 approved these changes Oct 1, 2024

View reviewed changes

lessw2020 merged commit 77bac00 into main Oct 2, 2024
52 checks passed

		if temperature != 1.0:
		next_token_logits = next_token_logits / temperature

		# Decode the output as comprehension instead of loop
		responses = [tokenizer.decode(sequence) for sequence in res_list]

[Distributed] Implement universal batch_decode & decode_in_flight for llama2 & llama3, with deterministic or multinomial (topk) decoding (handle both sentencepiece (llama2) and tiktoken (llama3)) #1234

[Distributed] Implement universal batch_decode & decode_in_flight for llama2 & llama3, with deterministic or multinomial (topk) decoding (handle both sentencepiece (llama2) and tiktoken (llama3)) #1234

Uh oh!

Conversation

lessw2020 commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1234

✅ No Failures

Uh oh!

metascroy commented Sep 30, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lessw2020 commented Oct 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lessw2020 commented Sep 30, 2024 •

edited

Loading

pytorch-bot bot commented Sep 30, 2024 •

edited

Loading

kwen2501 Oct 1, 2024 •

edited

Loading