@@ -151,7 +151,12 @@ <h1 class="text-nowrap mt-5">
151151 🚩The < i > First</ i > Benchmark for Long-Context Code Understanding.🚩< br />
152152 </ div >
153153 < div class ="d-flex flex-row justify-content-center gap-3 ">
154- < a href ="https://github.com/evalplus/repoqa "
154+ < a href ="https://arxiv.org/abs/2406.06025 "
155+ > < img
156+ src ="https://img.shields.io/badge/arXiv-2406.06025-b31b1b.svg?style=for-the-badge "
157+ alt ="arxiv "
158+ class ="img-fluid " /> </ a
159+ > < a href ="https://github.com/evalplus/repoqa "
155160 > < img
156161 src ="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white "
157162 alt ="github "
@@ -167,9 +172,8 @@ <h1 class="text-nowrap mt-5">
167172 < div class ="container-fluid d-flex flex-row flex-nowrap ">
168173 < div class ="container-fluid d-flex flex-column align-items-center ">
169174 < p >
170- < b > 🔊 The goal of RepoQA:</ b > is to create a series of long-context
171- code understanding tasks to challenge chat/instruction models for
172- code:
175+ RepoQA aims to create a series of long-context code understanding tasks
176+ to challenge chat/instruction models for code:
173177 </ p >
174178 < ul >
175179 < li >
@@ -299,17 +303,13 @@ <h2 id="faq" class="text-nowrap mt-5">🙋🏻♀️ FAQ</h2>
299303 < h3 id ="yet-another " class ="text-nowrap mt-5 ">
300304 Just yet another needle test?
301305 </ h3 >
302- No. Here are some notes:
303306 < ul >
304307 < li >
305- < b > SNF != RepoQA, SNF \in RepoQA:</ b > Yes, SNF is a variant of
306- needle test, but SNF != RepoQA. SNF is a start point and
307- elementary test:
308+ SNF is a variant of needle test and is part of RepoQA as the elementary test:
308309 < b
309310 > if a model can't pass SNF, don't expect it to pass more
310311 challenging tasks.</ b
311312 >
312- We will build more challenging tasks in the future.
313313 </ li >
314314 < li >
315315 Unlike vanilla needle tests which use single test to perform fully
@@ -339,25 +339,8 @@ <h3 id="limit" class="text-nowrap mt-5">Known limitations</h3>
339339 </ ul >
340340 < h2 id ="sponsor " class ="text-nowrap mt-5 "> 🤗 Acknowledgment</ h2 >
341341 < p >
342- Running long-context evaluations can be costly -- we thank
343- < a href ="https://deepmind.google/ "> Google DeepMind</ a >
344- and
345- < a href ="https://openai.com/form/researcher-access-program/ "
346- > OpenAI Researcher Access Program</ a
347- >
348- for their generous API credits!
349- </ p >
350- < p >
351- Meanwhile, note that RepoQA is a transparent research project
352- started by students at UIUC. We assure the reproducibility and
353- fairness of the evaluation as well as the indenpendence of our
354- benchmark design that none of these will be optimized or compromised
355- for models from specific organizations. The outputs and results of
356- evaluated models can be found at our
357- < a
358- href ="https://github.com/evalplus/repoqa/releases/tag/dev-results "
359- > GitHub release page</ a
360- > .
342+ Part of the compute is generously provided by < a href ="https://deepmind.google/ "> Google DeepMind</ a >
343+ and < a href ="https://wandb.ai/site "> Weights & Biases</ a > .
361344 </ p >
362345 </ div >
363346 </ div >
0 commit comments