You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+55-1Lines changed: 55 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -270,13 +270,67 @@ For convenience of future use, we provide the script `visualize_easy_and_hard_da
270
270
# LLM Querying (`/gpu-flopbench/llm-querying`)
271
271
272
272
Once the datasets are created, we can begin making LLM queries.
273
+
We designed the `run_llm_queries.py` script to accept multiple input arguments to control the type of experiment you want to run.
274
+
It is dependent on a few environment variables, depending on your LLM service provider.
275
+
We tested only with [OpenRouter](https://openrouter.ai/) and [Microsoft Azure AI](https://ai.azure.com/) to perform our runs.
276
+
Before continuing, we want to note that these queries can easily cost hundreds of dollars depending on the model used.
277
+
Feel free to interrupt the scripts as they run to check your balances and cost of queries; we provide an estimate based on current 2025 prices which you can manually update in the `io_cost.py` file.
278
+
279
+
## OpenRouter Querying
280
+
For OpenRouter querying to work, please set the following environment variable:
281
+
```
282
+
export OPENAI_API_KEY=sk-or-v1-b5e0bed80...
283
+
```
284
+
This should be the API key you get from the OpenRouter UI.
285
+
286
+
Then you can run the following commands:
287
+
```
288
+
python3 ./run_llm_queries.py --skipConfirm --modelName openai/gpt-5-mini --numTrials 3 --verbose 2>&1 | tee -a ./gpt-5-mini-easy-simplePrompt.log
289
+
290
+
python3 ./run_llm_queries.py --skipConfirm --modelName openai/gpt-5-mini --numTrials 3 --verbose --hardDataset 2>&1 | tee -a ./gpt-5-mini-hard-simplePrompt.log
291
+
```
292
+
The first command will do runs with the *easy* dataset, using the `gpt-5-mini` model, while the second command will use the *hard* data subset.
293
+
The queries and outputs will be logged to the specified `*.log` file, while the full Langgraph conversations will be stored in the `./checkpoints` directory.
294
+
This process typically takes 10+ hours for one script, so please leave it running overnight or with a babysitter.
295
+
It is inherently serial, making one query at a time, so you could run both scripts simultaneously to cut down on wait times.
296
+
297
+
There is a restart mechanism in place (in case of an unexpected crash).
298
+
We suggest re-running the script after doing a first pass collection, this is due to the fact that sometimes the OpenRouter requests time out or completely fail and thus need to be re-run.
299
+
300
+
NOTE: we set a limit on the maximum query time of 2 minutes.
301
+
If a model doesn't return, we consider it a failure.
302
+
2 minutes is quite reasonable given that a user probably wouldn't wait that long for a response anyways.
303
+
304
+
## Microsoft Azure Querying
305
+
For Azure querying to work we need to supply a particular environment variable:
306
+
```
307
+
AZURE_OPENAI_API_KEY=...
308
+
```
309
+
310
+
We can then similarly run the *easy* and *hard* data collection scripts for `o3-mini` as follows:
Be sure to replace the `--provider_url` with the corresponding URL from your Azure AI account.
318
+
You will also need to provide the corresponding `--api_version` from the Azure link.
319
+
Although the models we test above have a hard-coded `--top_p` and `--temp` arguments, these are just provided as-is so the Azure API would allow us to connect and run.
320
+
Any other values would return invalid request errors.
321
+
273
322
323
+
## Results Visualization / Tabulation
274
324
325
+
The final results can be visualized using the `visualizeSQLResults.ipynb` notebook.
326
+
This will calculate the Matthews Correlation Coefficient (MCC) and Mean Absolute Log Error (MALE) errors of the predictions.
327
+
It creates the plots shown in our paper, along with varying additional visualizations that aid in data analysis.
275
328
329
+
<br/><br/> <br/><br/>
276
330
277
331
278
332
279
-
# Solo (no Docker) Instructions
333
+
# Solo (no Docker) Building & CUDA Profiling Instructions
280
334
281
335
Below is a list of instructions for reproducing what is done in the above Docker container, but instead on your own system.
282
336
This is primarily for those that don't want any overhead in GPU profiling through a Docker container (i.e: more accurate profiling results).
0 commit comments