Update on "Fix Cuda out of memory issue for eager runner"

helunwencser · helunwencser · commit ebcd34be7722 · 2024-11-16T12:30:41.000-08:00
This PR updates the eager runner to disable grad and save memory usage. It also update the prompt format to not include bos. Differential Revision: [D65962743](https://our.internmc.facebook.com/intern/diff/D65962743/) [ghstack-poisoned]
diff --git a/examples/models/llama/runner/generation.py b/examples/models/llama/runner/generation.py
@@ -199,7 +199,7 @@ def chat_completion(
                 temperature=temperature,
                 top_p=top_p,
                 echo=True,
-                pos_base=len(tokens),
+                pos_base=len(tokens) - 1 if len(tokens) > 0 else 0
             )
             tokens.extend(new_tokens)
             prompt = input("Me: ")

Original file line number	Diff line number	Diff line change
`@@ -199,7 +199,7 @@ def chat_completion(`
`199`	`199`	`temperature=temperature,`
`200`	`200`	`top_p=top_p,`
`201`	`201`	`echo=True,`
`202`		`- pos_base=len(tokens),`
	`202`	`+ pos_base=len(tokens) - 1 if len(tokens) > 0 else 0`
`203`	`203`	`)`
`204`	`204`	`tokens.extend(new_tokens)`
`205`	`205`	`prompt = input("Me: ")`