Skip to content

Commit c9f34c3

Browse files
committed
updated readme instructions
1 parent f1bf9d9 commit c9f34c3

File tree

1 file changed

+55
-1
lines changed

1 file changed

+55
-1
lines changed

README.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -270,13 +270,67 @@ For convenience of future use, we provide the script `visualize_easy_and_hard_da
270270
# LLM Querying (`/gpu-flopbench/llm-querying`)
271271

272272
Once the datasets are created, we can begin making LLM queries.
273+
We designed the `run_llm_queries.py` script to accept multiple input arguments to control the type of experiment you want to run.
274+
It is dependent on a few environment variables, depending on your LLM service provider.
275+
We tested only with [OpenRouter](https://openrouter.ai/) and [Microsoft Azure AI](https://ai.azure.com/) to perform our runs.
276+
Before continuing, we want to note that these queries can easily cost hundreds of dollars depending on the model used.
277+
Feel free to interrupt the scripts as they run to check your balances and cost of queries; we provide an estimate based on current 2025 prices which you can manually update in the `io_cost.py` file.
278+
279+
## OpenRouter Querying
280+
For OpenRouter querying to work, please set the following environment variable:
281+
```
282+
export OPENAI_API_KEY=sk-or-v1-b5e0bed80...
283+
```
284+
This should be the API key you get from the OpenRouter UI.
285+
286+
Then you can run the following commands:
287+
```
288+
python3 ./run_llm_queries.py --skipConfirm --modelName openai/gpt-5-mini --numTrials 3 --verbose 2>&1 | tee -a ./gpt-5-mini-easy-simplePrompt.log
289+
290+
python3 ./run_llm_queries.py --skipConfirm --modelName openai/gpt-5-mini --numTrials 3 --verbose --hardDataset 2>&1 | tee -a ./gpt-5-mini-hard-simplePrompt.log
291+
```
292+
The first command will do runs with the *easy* dataset, using the `gpt-5-mini` model, while the second command will use the *hard* data subset.
293+
The queries and outputs will be logged to the specified `*.log` file, while the full Langgraph conversations will be stored in the `./checkpoints` directory.
294+
This process typically takes 10+ hours for one script, so please leave it running overnight or with a babysitter.
295+
It is inherently serial, making one query at a time, so you could run both scripts simultaneously to cut down on wait times.
296+
297+
There is a restart mechanism in place (in case of an unexpected crash).
298+
We suggest re-running the script after doing a first pass collection, this is due to the fact that sometimes the OpenRouter requests time out or completely fail and thus need to be re-run.
299+
300+
NOTE: we set a limit on the maximum query time of 2 minutes.
301+
If a model doesn't return, we consider it a failure.
302+
2 minutes is quite reasonable given that a user probably wouldn't wait that long for a response anyways.
303+
304+
## Microsoft Azure Querying
305+
For Azure querying to work we need to supply a particular environment variable:
306+
```
307+
AZURE_OPENAI_API_KEY=...
308+
```
309+
310+
We can then similarly run the *easy* and *hard* data collection scripts for `o3-mini` as follows:
311+
```
312+
python3 ./run_llm_queries.py --useAzure --api_version 2025-01-01-preview --provider_url https://galor-m8yvytc2-swedencentral.cognitiveservices.azure.com --skipConfirm --modelName o3-mini --numTrials 3 --top_p 1.0 --temp 1.0 --verbose 2>&1 | tee -a ./o3-mini-simplePrompt-easyDataset.log
313+
314+
python3 ./run_llm_queries.py --useAzure --api_version 2025-01-01-preview --provider_url https://galor-m8yvytc2-swedencentral.cognitiveservices.azure.com --skipConfirm --modelName o3-mini --numTrials 3 --top_p 1.0 --temp 1.0 --verbose --hardDataset 2>&1 | tee -a ./o3-mini-simplePrompt-hardDataset.log
315+
```
316+
317+
Be sure to replace the `--provider_url` with the corresponding URL from your Azure AI account.
318+
You will also need to provide the corresponding `--api_version` from the Azure link.
319+
Although the models we test above have a hard-coded `--top_p` and `--temp` arguments, these are just provided as-is so the Azure API would allow us to connect and run.
320+
Any other values would return invalid request errors.
321+
273322

323+
## Results Visualization / Tabulation
274324

325+
The final results can be visualized using the `visualizeSQLResults.ipynb` notebook.
326+
This will calculate the Matthews Correlation Coefficient (MCC) and Mean Absolute Log Error (MALE) errors of the predictions.
327+
It creates the plots shown in our paper, along with varying additional visualizations that aid in data analysis.
275328

329+
<br/><br/> <br/><br/>
276330

277331

278332

279-
# Solo (no Docker) Instructions
333+
# Solo (no Docker) Building & CUDA Profiling Instructions
280334

281335
Below is a list of instructions for reproducing what is done in the above Docker container, but instead on your own system.
282336
This is primarily for those that don't want any overhead in GPU profiling through a Docker container (i.e: more accurate profiling results).

0 commit comments

Comments
 (0)