You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<strong>Figure 4:</strong> Qualitative and quantitative examples of Verbalized Sampling on creative writing, dialogue simulation, and enumerative open-ended QA.
868
+
<strong>Figure 3:</strong> Qualitative and quantitative examples of Verbalized Sampling on creative writing, dialogue simulation, and enumerative open-ended QA.
Our comprehensive experiments on multiple tasks demonstrate that Verbalized Sampling significantly improves the diversity-quality trade-off across tasks and model families,
812
-
without compromising factual accuracy and safety.
873
+
Motivated by the theoretical understanding of mode collapse, we propose Verbalized Sampling (VS).
874
+
Through comprehensive experiments across multiple tasks, we demonstrate that this approach significantly improves the diversity-quality trade-off across
875
+
model families without compromising factual accuracy and safety.
813
876
</p>
814
877
<p>
815
-
As shown in Figure 4, for <strong>story writing</strong>, VS improves the output diversity.
816
-
For <strong>dialogue simulation</strong>, VS simulates the donation amount distribution much closer to the human distribution, and generates more realistic persuasion behaviors.
817
-
On the task of <strong>enumerative open-ended QA</strong>, we ask the model to "generate US states". We first query a pretraining corpus (RedPajama) to establish a "reference" distribution of US
818
-
state names in the pretraining data. The verbalized probability distribution generated by VS, when averaged over 10 trials, closely aligns with this reference pretraining distribution (KL=0.12).
819
-
In contrast, direct prompting collapses into a few modes, repeatedly outputting states like California and Texas.
878
+
<ulclassName="list-disc pl-6 space-y-2">
879
+
<li>
880
+
For <strong>story writing</strong>, VS improves the output diversity.
881
+
</li>
882
+
<li>
883
+
For <strong>dialogue simulation</strong>, VS simulates the donation amount distribution much closer to the human distribution, and generates more realistic persuasion behaviors.
884
+
</li>
885
+
<li>
886
+
For <strong>enumerative open-ended QA</strong>, we ask the model to "generate US states".
887
+
The verbalized probability distribution generated by VS, when averaged over 10 trials, closely aligns with the reference pretraining distribution (queried from RedPajama).
888
+
In contrast, direct prompting collapses into a few modes, repeatedly outputting states like California and Texas.
We observe an <strong>emergent trend</strong> where larger models benefit more from VS. Figure 5 shows the diversity gain over the direct prompting which suffers from mode collapse.
907
+
We also observe an <strong>emergent trend</strong> where larger models benefit more from VS. Figure 5 shows the diversity gain over the direct prompting which suffers from mode collapse.
837
908
Across all VS variants, larger models (GPT-4.1, Gemini-2.5-Pro) achieve diversity gains 1.5 to 2 times greater than smaller models (GPT-4.1-Mini, Gemini-2.5-Flash).
838
909
</p>
839
910
</div>
@@ -853,7 +924,7 @@ export default function HomePage() {
0 commit comments