AI-Crash-Course/paper_analysis/Toolformer/original_content.txt at main · mtr7x/AI-Crash-Course · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick Jane Dwivedi-Yu Roberto DessìyRoberta Raileanu
Maria Lomeli Luke Zettlemoyer Nicola Cancedda Thomas Scialom
Meta AI ResearchyUniversitat Pompeu Fabra
Abstract
Language models (LMs) exhibit remarkable
abilities to solve new tasks from just a few
examples or textual instructions, especially at
scale. They also, paradoxically, struggle with
basic functionality, such as arithmetic or fac-
tual lookup, where much simpler and smaller
models excel. In this paper, we show that
LMs can teach themselves to use external tools
via simple APIs and achieve the best of both
worlds. We introduce Toolformer , a model
trained to decide which APIs to call, when to
call them, what arguments to pass, and how to
best incorporate the results into future token
prediction. This is done in a self-supervised
way, requiring nothing more than a handful of
demonstrations for each API. We incorporate
a range of tools, including a calculator, a Q&A
system, a search engine, a translation system,
and a calendar. Toolformer achieves substan-
tially improved zero-shot performance across
a variety of downstream tasks, often competi-
tive with much larger models, without sacriﬁc-
ing its core language modeling abilities.
1 Introduction
Large language models achieve impressive zero-
and few-shot results on a variety of natural lan-
guage processing tasks (Brown et al., 2020; Chowd-
hery et al., 2022, i.a.) and show several emergent
capabilities (Wei et al., 2022). However, all of
these models have several inherent limitations that
can at best be partially addressed by further scal-
ing. These limitations include an inability to access
up-to-date information on recent events (Komeili
et al., 2022) and the related tendency to hallucinate
facts (Maynez et al., 2020; Ji et al., 2022), difﬁcul-
ties in understanding low-resource languages (Lin
et al., 2021), a lack of mathematical skills to per-
form precise calculations (Patel et al., 2021) and an
unawareness of the progression of time (Dhingra
et al., 2022).
The New England Journal of Medicine is a registered
trademark of  [QA(“Who is the publisher of The New
England Journal of Medicine?”) → Massachusetts
Medical Society]  the MMS.
Out of 1400 participants, 400 (or [Calculator(400 / 1400)
→ 0.29]  29%) passed the test.
The name derives from “la tortuga”, the Spanish word for
[MT(“tortuga”) → turtle]  turtle.
The Brown Act is California’s law  [WikiSearch(“Brown
Act”) → The Ralph M. Brown Act is an act of the
California State Legislature that guarantees the public's
right to attend and participate in meetings of local
legislative bodies.]  that requires legislative bodies, like
city councils, to hold their meetings open to the public. Figure 1: Exemplary predictions of Toolformer. The
model autonomously decides to call different APIs
(from top to bottom: a question answering system,
a calculator, a machine translation system, and a
Wikipedia search engine) to obtain information that is
useful for completing a piece of text.
A simple way to overcome these limitations of
today’s language models is to give them the abil-
ity to use external tools such as search engines,
calculators, or calendars. However, existing ap-
proaches either rely on large amounts of human
annotations (Komeili et al., 2022; Thoppilan et al.,
2022) or limit tool use to task-speciﬁc settings only
(e.g., Gao et al., 2022; Parisi et al., 2022), hinder-
ing a more widespread adoption of tool use in LMs.
Therefore, we propose Toolformer , a model that
learns to use tools in a novel way, which fulﬁlls the
following desiderata:
•The use of tools should be learned in a
self-supervised way without requiring large
amounts of human annotations . This is impor-arXiv:2302.04761v1  [cs.CL]  9 Feb 2023
x1: i-1  = Pittsburgh is
             also known as
   xi: n = the Steel City x* = Pittsburgh is
        also known as
        [QA(What …?
        → Steel City)]
        the Steel City. ci1 = What other name is
         Pittsburgh known by?
ci2 = Which country is
         Pittsburgh in? ri1 = Steel City
ri2 = United States Li( ci1 → Steel City )
 < min( Li( ci1 → ε), Li(ε))
Li( ci2 → United States )
 > min( Li( ci2 → ε), Li(ε))1
Sample API Calls 2
Execute API Calls 3
Filter API Calls LM Dataset LM Dataset
with API Calls Figure 2: Key steps in our approach, illustrated for a question answering tool: Given an input text x, we ﬁrst
sample a position iand corresponding API call candidates c1
i;c2
i;:::;ck
i. We then execute these API calls and
ﬁlter out all calls which do not reduce the loss Liover the next tokens. All remaining API calls are interleaved
with the original text, resulting in a new text x.
tant not only because of the costs associated
with such annotations, but also because what
humans ﬁnd useful may be different from
what a model ﬁnds useful.
•The LM should not lose any of its generality
and should be able to decide for itself when
andhow to use which tool. In contrast to
existing approaches, this enables a much more
comprehensive use of tools that is not tied to
speciﬁc tasks.
Our approach for achieving these goals is based
on the recent idea of using large LMs with in-
context learning (Brown et al., 2020) to generate
entire datasets from scratch (Schick and Schütze,
2021b; Honovich et al., 2022; Wang et al., 2022):
Given just a handful of human-written examples
of how an API can be used, we let a LM annotate
a huge language modeling dataset with potential
API calls. We then use a self-supervised loss to
determine which of these API calls actually help
the model in predicting future tokens. Finally, we
ﬁnetune the LM itself on the API calls that it con-
siders useful. As illustrated in Figure 1, through
this simple approach, LMs can learn to control a va-
riety of tools, and to choose for themselves which
tool to use when and how.
As our approach is agnostic of the dataset be-
ing used, we can apply it to the exact same dataset
that was used to pretrain a model in the ﬁrst place.
This ensures that the model does not lose any
of its generality and language modeling abilities.
We conduct experiments on a variety of differ-
ent downstream tasks, demonstrating that after
learning to use tools, Toolformer, which is based
on a pretrained GPT-J model (Wang and Komat-
suzaki, 2021) with 6.7B parameters, achieves much
stronger zero-shot results, clearly outperforming a
much larger GPT-3 model (Brown et al., 2020) andseveral other baselines on various tasks.
2 Approach
Our aim is to equip a language model Mwith the
ability to use different tools by means of API calls.
We require that inputs and outputs for each API
can be represented as text sequences. This allows
seamless insertion of API calls into any given text,
using special tokens to mark the start and end of
each such call.
We represent each API call as a tuple c= (ac;ic)
whereacis the name of the API and icis the cor-
responding input. Given an API call cwith a cor-
responding result r, we denote the linearized se-
quences of the API call not including and including
its result, respectively, as:
e(c) =<API>ac(ic) </API>
e(c;r) =<API>ac(ic)!r</API>
where “ <API> ”, “</API> ” and “!” are special
tokens.1Some examples of linearized API calls
inserted into text sequences are shown in Figure 1.
Given a datasetC=fx1;:::; xjCjgof plain
texts, we ﬁrst convert this dataset into a dataset
Caugmented with API calls. This is done in three
steps, illustrated in Figure 2: First, we exploit the
in-context learning ability of Mto sample a large
number of potential API calls. We then execute
these API calls and ﬁnally check whether the ob-
tained responses are helpful for predicting future
tokens; this is used as a ﬁltering criterion. After
ﬁltering, we merge API calls for different tools,
resulting in the augmented dataset C, and ﬁnetune
1In practice, we use the token sequences “ [”, “]” and
“->” to represent “ <API> ”, “</API> ” and “!”, respec-
tively. This enables our approach to work without modifying
the existing LM’s vocabulary. For reasons of readability, we
still refer to them as “ <API> ”, “</API> ” and “!” through-
out this section.
Your task is to add calls to a Question Answering API to a
piece of text. The questions should help you get
information required to complete the text. You can call the
API by writing "[QA(question)]" where "question" is the
question you want to ask. Here are some examples of API
calls:
Input:  Joe Biden was born in Scranton, Pennsylvania.
Output:  Joe Biden was born in  [QA("Where was Joe
Biden born?")]  Scranton, [QA("In which state is
Scranton?")]  Pennsylvania.
Input:  Coca-Cola, or Coke, is a carbonated soft drink
manufactured by the Coca-Cola Company.
Output: Coca-Cola, or [QA("What other name is
Coca-Cola known by?")]  Coke, is a carbonated soft drink
manufactured by [QA("Who manufactures Coca-Cola?")]
the Coca-Cola Company.
Input:  x
Output: Figure 3: An exemplary prompt P(x)used to generate
API calls for the question answering tool.
Mitself on this dataset. Each of these steps is
described in more detail below.
Sampling API Calls For each API, we write a
promptP(x)that encourages the LM to anno-
tate an example x=x1;:::;x nwith API calls.
An example of such a prompt for a question an-
swering tool is shown in Figure 3; all prompts
used are shown in Appendix A.2. Let pM(zn+1j
z1;:::;z n)be the probability that Massigns to
tokenzn+1as a continuation for the sequence
z1;:::;z n. We ﬁrst sample up to kcandidate posi-
tions for doing API calls by computing, for each
i2f1;:::;ng, the probability
pi=pM(<API>jP(x);x1:i1)
thatMassigns to starting an API call at position
i. Given a sampling threshold s, we keep all po-
sitionsI=fijpi>sg; if there are more than k
such positions, we only keep the top k.
For each position i2I, we then obtain up to m
API callsc1
i;:::;cm
iby sampling from Mgiven the
sequence [P(x);x1;:::;x i1;<API> ]as a preﬁx
and</API> as an end-of-sequence token.2
2We discard all examples where Mdoes not generate the
</API> token.Executing API Calls As a next step, we execute
all API calls generated by Mto obtain the corre-
sponding results. How this is done depends entirely
on the API itself – for example, it can involve call-
ing another neural network, executing a Python
script or using a retrieval system to perform search
over a large corpus. The response for each API call
cineeds to be a single text sequence ri.
Filtering API Calls Letibe the position of the
API callciin the sequence x=x1;:::;x n, and let
ribe the response from the API. Further, given a
sequence (wiji2N)ofweights , let
Li(z) =nX
j=iwjilogpM(xjjz;x1:j1)
be the weighted cross entropy loss for Mover the
tokensxi;:::;x nif the model is preﬁxed with z.
We compare two different instantiations of this loss:
L+
i=Li(e(ci;ri))
L
i= min (Li(");Li(e(ci;")))
where"denotes an empty sequence. The former is
the weighted loss over all tokens xi;:::;x nif the
API call and its result are given to Mas a preﬁx;3
the latter is the minimum of the losses obtained
from (i) doing no API call at all and (ii) doing an
API call, but not providing the response. Intuitively,
an API call is helpful to Mif providing it with both
the input andthe output of this call makes it easier
for the model to predict future tokens, compared to
not receiving the API call at all, or receiving only
its input. Given a ﬁltering threshold f, we thus
only keep API calls for which
L
iL+
if
holds, i.e., adding the API call and its result reduces
the loss by at least f, compared to not doing any
API call or obtaining no result from it.
Model Finetuning After sampling and ﬁltering
calls for all APIs, we ﬁnally merge the remaining
API calls and interleave them with the original
inputs. That is, for an input text x=x1;:::;x n
with a corresponding API call and result (ci;ri)at
positioni, we construct the new sequence x=
3We provide e(ci;ri)as a preﬁx instead of inserting it at
positionibecauseMis not yet ﬁnetuned on any examples
containing API calls, so inserting it in the middle of xwould
interrupt the ﬂow and not align with patterns in the pretraining
corpus, thus hurting perplexity.
x1:i1;e(ci;ri);xi:n; we proceed analogously for
texts with multiple API calls. Doing this for all x2
Cresults in the new dataset Caugmented with API
calls. We use this new dataset to ﬁnetune M, using
a standard language modeling objective. Crucially,
apart from inserted API calls the augmented dataset
Ccontains the exact same texts as C, the original
dataset. As a consequence, ﬁnetuning MonC
exposes it to the same content as ﬁnetuning on C.
Moreover, as API calls are inserted in exactly those
positions and with exactly those inputs that help
Mpredict future tokens, ﬁnetuning on Cenables
the language model to decide when and how to use
which tool, based purely on its own feedback.
Inference When generating text with Mafter
ﬁnetuning with our approach, we perform regular
decoding until Mproduces the “!” token, indicat-
ing that it next expects the response for an API call.
At this point, we interrupt the decoding process,
call the appropriate API to get a response, and con-
tinue the decoding process after inserting both the
response and the </API> token.
3 Tools
We explore a variety of tools to address different
shortcomings of regular LMs. The only constraints
we impose on these tools is that (i) both their inputs
and outputs can be represented as text sequences,
and (ii) we can obtain a few demonstrations of
their intended use. Concretely, we explore the fol-
lowing ﬁve tools: a question answering system, a
Wikipedia search engine, a calculator, a calendar,
and a machine translation system. Some examples
of potential calls and return strings for the APIs
associated with each of these tools are shown in
Table 1. We brieﬂy discuss all tools below; further
details can be found in Appendix A.
Question Answering Our ﬁrst tool is a question
answering system based on another LM that can an-
swer simple factoid questions. Speciﬁcally, we use
Atlas (Izacard et al., 2022), a retrieval-augmented
LM ﬁnetuned on Natural Questions (Kwiatkowski
et al., 2019).
Calculator As a second tool, we use a calculator
that can perform simple numeric calculations; we
only support the four basic arithmetic operations.
Results are always rounded to two decimal places.
Wikipedia Search Our third tool is a search en-
gine that, given a search term, returns short textsnippets from Wikipedia. Compared to our ques-
tion answering tool, this search enables a model
to get more comprehensive information on a sub-
ject, but requires it to extract the relevant parts by
itself. As our search engine, we use a BM25 re-
triever (Robertson et al., 1995; Baeza-Yates et al.,
1999) that indexes the Wikipedia dump from KILT
(Petroni et al., 2021).
Machine Translation System Our fourth tool is
a machine translation system based on a LM that
can translate a phrase from any language into En-
glish. More concretely, we use the 600M parameter
NLLB (Costa-jussà et al., 2022) as our multilingual
machine translation model that works for 200 lan-
guages (including low-resource ones). The source
language is automatically detected using the fast-
Textclassiﬁer (Joulin et al., 2016), while the target
language is always set to English.
Calendar Our ﬁnal tool is a calendar API that,
when queried, returns the current date without tak-
ing any input. This provides temporal context for
predictions that require some awareness of time.
4 Experiments
We investigate whether our approach enables a
model to use tools without any further supervision
and to decide for itself when and how to call which
of the available tools. To test this, we select a vari-
ety of downstream tasks where we assume at least
one of the considered tools to be useful, and evalu-
ate performance in zero-shot settings (Section 4.2).
Beyond that, we also ensure that our approach does
not hurt the model’s core language modeling abili-
ties; we verify this by looking at perplexity on two
language modeling datasets (Section 4.3). Finally,
we investigate how the ability to learn using tools
is affected by model size (Section 4.4).
4.1 Experimental Setup
Dataset Generation Throughout all of our ex-
periments, we use a subset of CCNet (Wenzek et al.,
2020) as our language modeling dataset Cand GPT-
J (Wang and Komatsuzaki, 2021) as our language
modelM. To reduce the computational cost of
annotatingCwith API calls, we deﬁne heuristics
for some APIs to get a subset of Cfor which API
calls are more likely to be helpful than for an av-
erage text. For example, we only consider texts
for the calculator tool if they contain at least three
numbers. Details of the heuristics used are given in
API Name Example Input Example Output
Question Answering Where was the Knights
of Columbus founded?New Haven, Connecticut
Wikipedia Search Fishing Reel Types Spin ﬁshing > Spin ﬁshing is distinguished between ﬂy ﬁshing and bait
cast ﬁshing by the type of rod and reel used. There are two types of reels
used when spin ﬁshing, the open faced reel and the closed faced reel.
Calculator 27 + 4 * 2 35
Calendar " Today is Monday, January 30, 2023.
Machine Translation sûreté nucléaire nuclear safety
Table 1: Examples of inputs and outputs for all APIs used.
Number of Examples
API f= 0:5f= 1:0f= 2:0
Question Answering 51,987 18,526 5,135
Wikipedia Search 207,241 60,974 13,944
Calculator 3,680 994 138
Calendar 61,811 20,587 3,007
Machine Translation 3,156 1,034 229
Table 2: Number of examples with API calls in Cfor
different values of our ﬁltering threshold f.
Appendix A. For obtaining CfromC, we perform
all steps described in Section 2 and additionally
ﬁlter out all examples for which all API calls were
eliminated in the ﬁltering step.4For the weighting
function, we use
wt=~wtP
s2N~wswith ~wt= max(0;10:2t)
to make sure that API calls happen close to where
the information provided by the API is actually
helpful for the model. The thresholds sandfare
chosen individually for each tool to ensure a sufﬁ-
ciently larger number of examples; see Appendix A
for details. Table 2 shows relevant statistics of our
ﬁnal dataset augmented with API calls.
Model Finetuning We ﬁnetune MonCusing
a batch size of 128 and a learning rate of 1105
with linear warmup for the ﬁrst 10% of training.
Details of our ﬁnetuning procedure are given in
Appendix B.
Baseline Models Throughout the remainder of
this section, we mainly compare the following mod-
els:
4While this ﬁltering alters the distribution of training exam-
ples, we assume that the remaining examples are close enough
to the original distribution so that M’s language modeling
abilities remain unaffected. This assumption is empirically
validated in Section 4.3.•GPT-J : A regular GPT-J model without any
ﬁnetuning.
•GPT-J + CC : GPT-J ﬁnetuned on C, our sub-
set of CCNet without any API calls.
•Toolformer : GPT-J ﬁnetuned on C, our sub-
set of CCNet augmented with API calls.
•Toolformer (disabled) : The same model as
Toolformer, but API calls are disabled during
decoding.5
For most tasks, we additionally compare to OPT
(66B) (Zhang et al., 2022) and GPT-36(175B)
(Brown et al., 2020), two models that are about
10 and 25 times larger than our other baseline mod-
els, respectively.
4.2 Downstream Tasks
We evaluate all models on a variety of downstream
tasks. In all cases, we consider a prompted zero-
shot setup – i.e., models are instructed to solve
each task in natural language, but we do not pro-
vide any in-context examples. This is in contrast
to prior work on tool use (e.g., Gao et al., 2022;
Parisi et al., 2022), where models are provided
with dataset-speciﬁc examples of how a tool can be
used to solve a concrete task. We choose the more
challenging zero-shot setup as we are interested
in seeing whether Toolformer works in precisely
those cases where a user does not specify in ad-
vance which tools should be used in which way for
solving a speciﬁc problem.
We use standard greedy decoding, but with one
modiﬁcation for Toolformer: We let the model start
an API call not just when <API> is the most likely
5This is achieved by manually setting the probability of
the<API> token to 0.
6We use the original davinci variant that is not ﬁnetuned
on any instructions.
token, but whenever it is one of the kmost likely
tokens. For k= 1, this corresponds to regular
greedy decoding; we instead use k= 10 to in-
crease the disposition of our model to make use of
the APIs that it has access to. At the same time,
we only at most one API call per input to make
sure the model does not get stuck in a loop where
it constantly calls APIs without producing any ac-
tual output. The effect of these modiﬁcations is
explored in Section 5.
4.2.1 LAMA
We evaluate our models on the SQuAD, Google-
RE and T-REx subsets of the LAMA benchmark
(Petroni et al., 2019). For each of these subsets, the
task is to complete a short statement with a miss-
ing fact (e.g., a date or a place). As LAMA was
originally designed to evaluate masked language
models (e.g., Devlin et al., 2019), we ﬁlter out ex-
amples where the mask token is not the ﬁnal token,
so that the remaining examples can be processed
in a left-to-right fashion. To account for different
tokenizations and added complexity from not in-
forming the model that a single word is required,
we use a slightly more lenient evaluation criterion
than exact match and simply check whether the
correct word is within the ﬁrst ﬁve words predicted
by the model. As LAMA is based on statements
obtained directly from Wikipedia, we prevent Tool-
former from using the Wikipedia Search API to
avoid giving it an unfair advantage.
Results for all models can be seen in Table 3.
All GPT-J models without tool use achieve similar
performance. Crucially, Toolformer clearly outper-
forms these baseline models, improving upon the
best baseline by 11.7, 5.2 and 18.6 points, respec-
tively. It also clearly outperforms OPT (66B) and
GPT-3 (175B), despite both models being much
larger. This is achieved because the model inde-
pendently decides to ask the question answering
tool for the required information in almost all cases
(98.1%); for only very few examples, it uses a dif-
ferent tool (0.7%) or no tool at all (1.2%).
4.2.2 Math Datasets
We test mathematical reasoning abilities on ASDiv
(Miao et al., 2020), SV AMP (Patel et al., 2021) and
the MAWPS benchmark (Koncel-Kedziorski et al.,
2016). We again account for the fact that we test
all models in a zero-shot setup by using a more
lenient evaluation criterion: As the required output
is always a number, we simply check for the ﬁrstModel SQuAD Google-RE T-REx
GPT-J 17.8 4.9 31.9
GPT-J + CC 19.2 5.6 33.2
Toolformer (disabled) 22.1 6.3 34.9
Toolformer 33.8 11.5 53.5
OPT (66B) 21.6 2.9 30.1
GPT-3 (175B) 26.8 7.0 39.8
Table 3: Results on subsets of LAMA. Toolformer uses
the question answering tool for most examples, clearly
outperforming all baselines of the same size and achiev-
ing results competitive with GPT-3 (175B).
Model ASDiv SVAMP MAWPS
GPT-J 7.5 5.2 9.9
GPT-J + CC 9.6 5.0 9.3
Toolformer (disabled) 14.8 6.3 15.0
Toolformer 40.4 29.4 44.0
OPT (66B) 6.0 4.9 7.9
GPT-3 (175B) 14.0 10.0 19.8
Table 4: Results for various benchmarks requiring
mathematical reasoning. Toolformer makes use of the
calculator tool for most examples, clearly outperform-
ing even OPT (66B) and GPT-3 (175B).
number predicted by the model.7
Table 4 shows results for all benchmarks. While
GPT-J and GPT-J + CC perform about the same,
Toolformer achieves stronger results even when
API calls are disabled. We surmise that this is be-
cause the model is ﬁnetuned on many examples
of API calls and their results, improving its own
mathematical capabilities. Nonetheless, allowing
the model to make API calls more than doubles per-
formance for all tasks, and also clearly outperforms
the much larger OPT and GPT-3 models. This is
because across all benchmarks, for 97.9% of all
examples the model decides to ask the calculator
tool for help.
4.2.3 Question Answering
We look at Web Questions (Berant et al., 2013),
Natural Questions (Kwiatkowski et al., 2019) and
TriviaQA (Joshi et al., 2017), the three question an-
swering datasets considered by Brown et al. (2020).
For evaluation, we check whether the ﬁrst 20 words
predicted by a model contain the correct answer
instead of requiring an exact match. For Tool-
former, we disable the question answering tool as
7An exception to this is if the model’s prediction contains
an equation (e.g., “The correct answer is 5+3=8”), in which
case we consider the ﬁrst number after the “=” sign to be its
prediction.
Model WebQS NQ TriviaQA
GPT-J 18.5 12.8 43.9
GPT-J + CC 18.4 12.2 45.6
Toolformer (disabled) 18.9 12.6 46.7
Toolformer 26.3 17.7 48.8
OPT (66B) 18.6 11.4 45.7
GPT-3 (175B) 29.0 22.6 65.9
Table 5: Results for various question answering dataset.
Using the Wikipedia search tool for most examples,
Toolformer clearly outperforms baselines of the same
size, but falls short of GPT-3 (175B).
this would make solving the tasks trivial, especially
given that the underlying QA system was ﬁnetuned
on Natural Questions.
Results are shown in Table 5. Once again,
Toolformer clearly outperforms all other models
based on GPT-J, this time mostly relying on the
Wikipedia search API (99.3%) to ﬁnd relevant in-
formation. However, Toolformer still lags behind
the much larger GPT-3 (175B) model. This is likely
due to both the simplicity of our search engine (in
many cases, it returns results that are clearly not
a good match for a given query) and the inability
of Toolformer to interact with it, e.g., by refor-
mulating its query if results are not helpful or by
browsing through multiple of the top results. We
believe that adding this functionality is an exciting
direction for future work.
4.2.4 Multilingual Question Answering
We evaluate Toolformer and all baseline models
on MLQA (Lewis et al., 2019), a multilingual
question-answering benchmark. A context para-
graph for each question is provided in English,
while the question can be in Arabic, German, Span-
ish, Hindi, Vietnamese, or Simpliﬁed Chinese. In
order to solve the task, the model needs to be able
to understand both the paragraph and the question,
so it may beneﬁt from translating the question into
English. Our evaluation metric is the percentage of
times the model’s generation, capped at 10 words,
contains the correct answer.
Results are shown in Table 6. Using API calls
consistently improves Toolformer’s performance
for all languages, suggesting that it has learned to
make use of the machine translation tool. Depend-
ing on the language, this tool is used for 63.8%
to 94.9% of all examples; the only exception to
this is Hindi, for which the machine translation
tool is used in only 7.3% of cases. However, Tool-Model Es De Hi Vi Zh Ar
GPT-J 15.2 16.5 1.3 8.2 18.2 8.2
GPT-J + CC 15.7 14.9 0.5 8.3 13.7 4.6
Toolformer (disabled) 19.8 11.9 1.2 10.1 15.0 3.1
Toolformer 20.6 13.5 1.410.6 16.8 3.7
OPT (66B) 0.3 0.1 1.1 0.2 0.7 0.1
GPT-3 (175B) 3.4 1.1 0.1 1.7 17.7 0.1
GPT-J (All En) 24.3 27.0 23.9 23.3 23.1 23.6
GPT-3 (All En) 24.7 27.2 26.1 24.9 23.6 24.0
Table 6: Results on MLQA for Spanish (Es), German
(De), Hindi (Hi), Vietnamese (Vi), Chinese (Zh) and
Arabic (Ar). While using the machine translation tool
to translate questions is helpful across all languages,
further pretraining on CCNet deteriorates performance;
consequently, Toolformer does not consistently outper-
form GPT-J. The ﬁnal two rows correspond to models
that are given contexts and questions in English.
former does not consistently outperform vanilla
GPT-J. This is mainly because for some languages,
ﬁnetuning on CCNet deteriorates performance; this
might be due to a distribution shift compared to
GPT-J’s original pretraining data.
OPT and GPT-3 perform surprisingly weak
across all languages, mostly because they fail to
provide an answer in English despite being in-
structed to do so. A potential reason for GPT-J not
suffering from this problem is that it was trained on
more multilingual data than both OPT and GPT-3,
including the EuroParl corpus (Koehn, 2005; Gao
et al., 2020). As an upper bound, we also evaluate
GPT-J and GPT-3 on a variant of MLQA where
both the context and the question are provided in
English. In this setup, GPT-3 performs better than
all other models, supporting our hypothesis that
its subpar performance on MLQA is due to the
multilingual aspect of the task.
4.2.5 Temporal Datasets
To investigate the calendar API’s utility, we eval-
uate all models on TEMPLAMA (Dhingra et al.,
2022) and a new dataset that we call DATESET .
TEMPLAMA is a dataset built from Wikidata that
contains cloze queries about facts that change with
time (e.g., “Cristiano Ronaldo plays for ___”)
as well as the correct answer for the years be-
tween 2010 and 2020. DATESET , described in
Appendix D, is also generated through a series
of templates, but populated using a combination
of random dates/durations (e.g., “What day of the
week was it 30 days ago?”). Critically, knowing the
current date is required to answer these questions.
Model T EMPLAMA D ATESET
GPT-J 13.7 3.9
GPT-J + CC 12.9 2.9
Toolformer (disabled) 12.7 5.9
Toolformer 16.3 27.3
OPT (66B) 14.5 1.3
GPT-3 (175B) 15.5 0.8
Table 7: Results for the temporal datasets. Toolformer
outperforms all baselines, but does not make use of the
calendar tool for T EMPLAMA.
For both tasks, we use the same evaluation as for
the original LAMA dataset.
Results shown in Table 7 illustrate that Tool-
former outperforms all baselines for both TEM-
PLAMA andDATESET . However, closer inspec-
tion shows that improvements on TEMPLAMA
can not be attributed to the calendar tool, which is
only used for 0.2% of all examples, but mostly to
the Wikipedia search and question answering tools,
which Toolformer calls the most. This makes sense
given that named entities in TEMPLAMA are often
so speciﬁc and rare that even knowing the exact
date alone would be of little help. The best course
of action for this dataset – ﬁrst querying the calen-
dar API to get the current date, and then querying
the question answering system with this date – is
not only prohibited by our restriction of using at
most one API call per example, but also hard to
learn for Toolformer given that all API calls in its
training data are sampled independently.
ForDATESET , on the other hand, the consider-
able improvement of Toolformer compared to other
models can be fully accredited to the calendar tool,
which it makes use of for 54.8% of all examples.
4.3 Language Modeling
In addition to verifying improved performance on
various downstream tasks, we also want to ensure
that language modeling performance of Toolformer
does not degrade through our ﬁnetuning with API
calls. To this end, we evaluate our models on
two language modeling datasets: WikiText (Mer-
ity et al., 2017) and a subset of 10,000 randomly
selected documents from CCNet (Wenzek et al.,
2020) that were not used during training. Perplex-
ities of various models are shown in Table 8. As
one would expect, ﬁnetuning on CCNet leads to
slightly improved performance on a different CC-
Net subset, but it slightly deteriorates performance
on WikiText, presumably because the original pre-Model WikiText CCNet
GPT-J 9.9 10.6
GPT-J + CC 10.3 10.5
Toolformer (disabled) 10.3 10.5
Table 8: Perplexities of different models on WikiText
and our validation subset of CCNet. Adding API calls
comes without a cost in terms of perplexity for lan-
guage modeling without any API calls.
training data for GPT-J is more similar to Wiki-
Text than our randomly selected subset of CCNet.
Most importantly, however, training on C(our
dataset annotated with API calls) does not lead to
an increase in perplexity compared to training on
Cwhen API calls are disabled at inference time.8
4.4 Scaling Laws
We investigate how the ability to ask external tools
for help affects performance as we vary the size
of our LM. To this end, we apply our approach
not just to GPT-J, but also to four smaller mod-
els from the GPT-2 family (Radford et al., 2019),
with 124M, 355M, 775M and 1.6B parameters, re-
spectively. We do so using only a subset of three
tools: the question answering system, the calcula-
tor, and the Wikipedia search engine. Apart from
this, we follow the experimental setup described in
Section 4.1.
Figure 4 shows that the ability to leverage the
provided tools only emerges at around 775M pa-
rameters: smaller models achieve similar perfor-
mance both with and without tools. An exception
to this is the Wikipedia search engine used mostly
for QA benchmarks; we hypothesize that this is
because the API is comparably easy to use. While
models become better at solving tasks without API
calls as they grow in size, their ability to make good
use of the provided API improves at the same time.
As a consequence, there remains a large gap be-
tween predictions with and without API calls even
for our biggest model.
5 Analysis
Decoding Strategy We investigate the effect of
our modiﬁed decoding strategy introduced in Sec-
tion 4.2, where instead of always generating the
8We do not evaluate the perplexity of Toolformer with
API calls enabled as computing the probability pM(xtj
x1;:::;x t1)of tokenxtgivenx1;:::;x t1would require
marginalizing over all potential API calls that the model could
make at position t, which is intractable.
051015202530
0200040006000Model Parameters (M)LAMA
 Toolformer Toolformer (disabled) GPT30510152025303540
0200040006000Model Parameters (M)QA Benchmarks
051015202530
0200040006000Model Parameters (M)Math BenchmarksFigure 4: Average performance on LAMA, our math benchmarks and our QA benchmarks for GPT-2 models of
different sizes and GPT-J ﬁnetuned with our approach, both with and without API calls. While API calls are not
helpful to the smallest models, larger models learn how to make good use of them. Even for bigger models, the
gap between model predictions with and without API calls remains high.
most likely token, we generate the <API> token
if it is one of the kmost likely tokens. Table 9
shows performance on the T-REx subset of LAMA
and on WebQS for different values of k. As ex-
pected, increasing kleads to the model doing API
calls for more examples – from 40.3% and 8.5%
withk= 1(i.e., regular greedy decoding) to 98.1%
and 100% for k= 10 . While for T-REx, there is
already a clear improvement in performance with
greedy decoding, on WebQS our model only starts
to make a substantial number of API calls as we
slightly increase k. Interestingly, for k= 1 the
model is calibrated to some extent: It decides to
call APIs for examples that it would perform partic-
ularly badly on without making API calls. This can
be seen from the fact that performance on examples
where it decides notto make an API call (44.3 and
19.9) is higher than average performance if no API
calls are made at all (34.9 and 18.9). However, this
calibration is lost for higher values of k.
Data Quality We qualitatively analyze some
API calls generated with our approach for different
APIs. Table 10 shows some examples of texts from
CCNet augmented with API calls, as well as the
corresponding score L
iL+
ithat is used as a ﬁl-
tering criterion, and whether the API calls made by
the model are intuitively useful in the given context.
As can be seen, high values of L
iL+
itypically
correspond to useful API calls, whereas low values
correspond to API calls that do not provide any in-
formation that is useful for predicting future tokens.
There are some exceptions, e.g., an API call forT-REx WebQS
k All AC NC % All AC NC %
0 34.9 – 34.9 0.0 18.9 – 18.9 0.0
1 47.8 53.0 44.3 40.3 19.3 17.1 19.9 8.5
3 52.9 58.0 29.0 82.8 26.3 26.5 6.6 99.3
10 53.5 54.0 22.5 98.1 26.3 26.4 – 100.0
Table 9: Toolformer results on the T-REx subset of
LAMA and on WebQS for different values of kused
during decoding. Numbers shown are overall perfor-
mance (All), performance on the subset where the
model decides to make an API call (AC) and all re-
maining examples (NC), as well as the percentage of
examples for which the model decides to call an API
(%).
“Fast train success” in the fourth example that does
not give any relevant information but still reduces
perplexity. However, some amount of noise in the
API calls that are not ﬁltered can actually be useful
as it forces the model ﬁnetuned on Cto not always
blindly follow the results of each call it makes.
6 Related Work
Language Model Pretraining There are various
approaches that augment language models with
some form of additional textual information during
pretraining, including various forms of metadata
(Keskar et al., 2019), HTML tags (Aghajanyan
et al., 2021), Wikipedia markup (Schick et al.,
2022), or related texts obtained from an informa-
tion retrieval system (Guu et al., 2020; Borgeaud
et al., 2021; Izacard et al., 2022). For all of these
Example L
iL+
iUseful
The Flodden Window (a war memorial dedicated to The Middleton Archers), in the Grade I-listed
Church of St Leonard in Middleton is said to be the oldest war memorial in the United King-
dom. <API> WikiSearch(War memorial Flodden) !Battle of Flodden > Commemoration >
The stained-glass Flodden Window in Middleton Parish Church [. . . ] was constructed by Sir
Richard Assheton in memory of the Battle of Flodden and the archers from Middleton who
fought in it. </API> Sir Richard Assheton of Middleton (who built St Leonard) was granted
knighthood [. . . ]5.49 3
Note: The WL will be open on Friday, <API> Calendar()!Today is Thursday, March 9, 2017.
</API> March 10, and Sunday, March 19 for regular hours.2.11 3
The Nile has an approximate length of <API> QA(What is the approximate length of the Nile?)
!6,853 km </API> 6,853 kilometers, the White Nile being its main source.2.08 3
If Venus had an atmosphere similar to Earth’s then you would expect Venus’ mean temperature to be
499 K (1.74 x 287) rather than 735 K which is <API> Calculator(735 / 499) !1.47</API> 1.47
(735 / 499) times hotter than it should be.1.59 3
You are here: Home / Featured / Catch this fast train to success! <API> WikiSearch(Fast train
success)!Fast Train > It also peaked at #23 on the Canadian CHUM singles chart, on June
26, 1971. The success of this single established Myles Goodwyn as the band’s main songwriter,
and made it possible for April Wine to record a second album. </API> Don’t wait weeks and
[:::]0.92 7
Os Melhores Escolas em Jersey 2020 <API> MT(Os Melhores Escolas em Jersey) !The Best
Schools in Jersey </API> On this page you can search for Universities, Colleges and Business
schools in Jersey0.70 3
Enjoy these pictures from the <API> Calendar()!Today is Friday, April 19, 2013. </API>
Easter Egg Hunt.0.33 3
85 patients (23%) were hospitalised alive and admitted to a hospital ward. Of them, <API> Calcula-
tor(85 / 23)!3.70</API> 65% had a cardiac aetiology [:::]0.02 7
But hey, after the <API> Calendar()!Today is Saturday, June 25, 2011. </API> Disneyland
ﬁasco with the ﬁre drill, I think it’s safe to say Chewey won’t let anyone die in a ﬁre.0.41 7
The last time I was with <API> QA(Who was last time I was with?) !The Last Time </API>
him I asked what he likes about me and he said he would tell me one day.1.23 7
Table 10: Examples of API calls for different tools, sorted by the value of L
iL+
ithat is used as a ﬁltering
criterion. High values typically correspond to API calls that are intuitively useful for predicting future tokens.
approaches, additional information is always pro-
vided, regardless of whether it is helpful or not. In
contrast, Toolformer learns for itself to explicitly
asks for the right information.
Tool Use Several approaches aim to equip LMs
with the ability to use external tools such as search
engines (Komeili et al., 2022; Thoppilan et al.,
2022; Lazaridou et al., 2022; Shuster et al., 2022;
Yao et al., 2022), web browsers (Nakano et al.,
2021), calculators (Cobbe et al., 2021; Thoppilan
et al., 2022), translation systems (Thoppilan et al.,
2022) and Python interpreters (Gao et al., 2022).
The way these models learn to use tools can roughly
be divided into two approaches: Either they rely on
large amounts of human supervision (Komeili et al.,
2022; Nakano et al., 2021; Thoppilan et al., 2022)
or they work by prompting the language model in
a few-shot setup tailored towards a speciﬁc task
where it is known a priori which tools needs to beused (Gao et al., 2022; Lazaridou et al., 2022; Yao
et al., 2022). In contrast, the self-supervised nature
of Toolformer enables it to learn how and when to
use tools without requiring a speciﬁc prompt that
shows task-speciﬁc examples of how a tool could
be used. Perhaps most closely related to our work
is TALM (Parisi et al., 2022), an approach that
uses a similar self-supervised objective for teach-
ing a model to use a calculator and a search engine,
but explores this only in settings where a model is
ﬁnetuned for downstream tasks.
Bootstrapping The idea of using self-training
and bootstrapping techniques to improve models
has been investigated in various contexts, rang-
ing from word sense disambiguation (Yarowsky,
1995), relation extraction (Brin, 1999; Agichtein
and Gravano, 2000), parsing (McClosky et al.,
2006; Reichart and Rappoport, 2007), sequence
generation (He et al., 2020), few-shot text classi-
ﬁcation (Schick and Schütze, 2021a) and retrieval
(Izacard and Grave, 2021) to reasoning (Zelikman
et al., 2022). In a similar spirit to these approaches,
Toolformer is trained on its own predictions after
applying a perplexity-based ﬁltering step.
7 Limitations
While our approach enables LMs to learn how to
use a variety of tools in a self-supervised way, there
are some clear limitations to what can be achieved
with our method in its current form. One such limi-
tation is the inability of Toolformer to use tools in a
chain (i.e., using the output of one tool as an input
for another tool). This is due to the fact that API
calls for each tool are generated independently; as a
consequence, there are no examples of chained tool
use in the ﬁnetuning dataset. Our current approach
also does not allow the LM to use a tool in an in-
teractive way – especially for tools such as search
engines, that could potentially return hundreds of
different results, enabling a LM to browse through
these results or to reﬁne its search query in a simi-
lar spirit to Nakano et al. (2021) can be crucial for
certain applications. Beyond this, we found models
trained with Toolformer to often be sensitive to the
exact wording of their input when deciding whether
or not to call an API; this is perhaps unsurprising
given that LMs are known to be very sensitive to
the prompt they are provided with in both zero-
and few-shot settings (Jiang et al., 2020; Schick
and Schütze, 2021a). Depending on the tool, our
method is also very sample-inefﬁcient; for example,
processing more than a million documents results
in only a few thousand examples of useful calls
to the calculator API. A potential solution to this
problem might be to iteratively apply our approach,
similar to how this is done in related bootstrapping
approaches (Schick and Schütze, 2021a; Izacard
and Grave, 2021; Parisi et al., 2022). Finally, when
deciding whether or not to make an API call, Tool-
former currently does not take into account the
tool-dependent, computational cost incurred from
making an API call.
8 Conclusion
We have introduced Toolformer, a language model
that learns in a self-supervised way how to use
different tools such as search engines, calculators,
and translation systems via simple API calls. This
is done by ﬁnetuning on a large number of sampled
API calls that are ﬁltered based on whether theyreduce perplexity on future tokens. Toolformer
considerably improves zero-shot performance of a
6.7B parameter GPT-J model, enabling it to even
outperform a much larger GPT-3 model on a range
of different downstream tasks.
References
Armen Aghajanyan, Dmytro Okhonko, Mike Lewis,
Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke Zettle-
moyer. 2021. Htlm: Hyper-text pre-training and
prompting of language models.
Eugene Agichtein and Luis Gravano. 2000. Snowball:
Extracting relations from large plain-text collections.
InProceedings of the Fifth ACM Conference on Dig-
ital Libraries , DL ’00, page 85–94, New York, NY ,
USA. Association for Computing Machinery.
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al.
1999. Modern information retrieval , volume 463.
ACM press New York.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy
Liang. 2013. Semantic parsing on Freebase from
question-answer pairs. In Proceedings of the 2013
Conference on Empirical Methods in Natural Lan-
guage Processing , pages 1533–1544, Seattle, Wash-
ington, USA. Association for Computational Lin-
guistics.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
mann, Trevor Cai, Eliza Rutherford, Katie Millican,
George van den Driessche, Jean-Baptiste Lespiau,
Bogdan Damoc, Aidan Clark, Diego de Las Casas,
Aurelia Guy, Jacob Menick, Roman Ring, Tom Hen-
nigan, Saffron Huang, Loren Maggiore, Chris Jones,
Albin Cassirer, Andy Brock, Michela Paganini, Ge-
offrey Irving, Oriol Vinyals, Simon Osindero, Karen
Simonyan, Jack W. Rae, Erich Elsen, and Laurent
Sifre. 2021. Improving language models by retriev-
ing from trillions of tokens.
Sergey Brin. 1999. Extracting patterns and relations
from the world wide web. In The World Wide Web
and Databases , pages 172–183, Berlin, Heidelberg.
Springer Berlin Heidelberg.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-
V oss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In
Advances in Neural Information Processing Systems ,
volume 33, pages 1877–1901. Curran Associates,
Inc.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob
Austin, Michael Isard, Guy Gur-Ari, Pengcheng
Yin, Toju Duke, Anselm Levskaya, Sanjay Ghe-
mawat, Sunipa Dev, Henryk Michalewski, Xavier