You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -18,10 +18,10 @@ To use API models (e.g., GPT-4v, Gemini-Pro-V) for inference, you must set up AP
18
18
19
19
> **Note:** Some datasets require an LLM as a Judge and have default evaluation models configured (see *Extra Notes*). You also need to configure the corresponding APIs when evaluating these datasets.
20
20
21
-
You can place the required keys in `$VLMEvalKit/.env` or set them directly as environment variables. If you choose to create a `.env` file, the content should look like this:
21
+
You can place the required keys in `$SciEvalKit/.env` or set them directly as environment variables. If you choose to create a `.env` file, the content should look like this:
22
22
23
23
```bash
24
-
# .env file, place it under $VLMEvalKit
24
+
# .env file, place it under $SciEvalKit
25
25
26
26
# --- API Keys for Proprietary VLMs ---
27
27
# QwenVL APIs
@@ -62,9 +62,9 @@ Fill in your keys where applicable. These API keys will be automatically loaded
62
62
63
63
## Step 1: Configuration
64
64
65
-
**VLM Configuration:** All VLMs are configured in `vlmeval/config.py`. For some VLMs (e.g., MiniGPT-4, LLaVA-v1-7B), additional configuration is required (setting the code/model weight root directory in the config file).
65
+
**VLM Configuration:** All VLMs are configured in `scieval/config.py`. For some VLMs (e.g., MiniGPT-4, LLaVA-v1-7B), additional configuration is required (setting the code/model weight root directory in the config file).
66
66
67
-
When evaluating, you should use the model name specified in `supported_VLM` in `vlmeval/config.py`. Ensure you can successfully run inference with the VLM before starting the evaluation.
67
+
When evaluating, you should use the model name specified in `supported_VLM` in `scieval/config.py`. Ensure you can successfully run inference with the VLM before starting the evaluation.
68
68
69
69
**Check Command:**
70
70
@@ -76,12 +76,12 @@ vlmutil check {MODEL_NAME}
76
76
77
77
## Step 2: Evaluation
78
78
79
-
We use `run.py` for evaluation. You can use `$VLMEvalKit/run.py` or create a soft link to the script.
79
+
We use `run.py` for evaluation. You can use `$SciEvalKit/run.py` or create a soft link to the script.
80
80
81
81
### Basic Arguments
82
82
83
-
*`--data` (list[str]): Set the dataset names supported in VLMEvalKit (refer to `vlmeval/dataset/__init__.py` or use `vlmutil dlist all` to check).
84
-
*`--model` (list[str]): Set the VLM names supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`).
83
+
*`--data` (list[str]): Set the dataset names supported in SciEvalKit (refer to `scieval/dataset/__init__.py` or use `vlmutil dlist all` to check).
84
+
*`--model` (list[str]): Set the VLM names supported in SciEvalKit (defined in `supported_VLM` in `scieval/config.py`).
*`--judge` (str): Specify the evaluation model for datasets that require model-based evaluation.
145
145
* If not specified, the configured default model will be used.
146
-
* The model can be a VLM supported in VLMEvalKit or a custom model.
146
+
* The model can be a VLM supported in SciEvalKit or a custom model.
147
147
*`--judge-args` (str): Arguments for the judge model (in JSON string format).
148
148
* You can pass parameters like `temperature`, `max_tokens` when specifying the judge via `--judge`.
149
149
* Specific args depend on the model initialization class (e.g., `scieval.api.gpt.OpenAIWrapper`).
@@ -186,7 +186,7 @@ Some datasets have specific requirements during evaluation:
186
186
* If you want to use a model other than the default GPT-4o, you must specify `base_url` and `api_key` separately (defaults to `OPENAI_API_KEY`, `OPENAI_API_BASE` in env).
187
187
***AstroVisBench:**
188
188
***Environment:** Must download dependencies following the official guide and set the `AstroVisBench_Env` environment variable.
189
-
***Python Env:** Due to complex dependencies, it is recommended to create a separate environment, install VLMEvalKit dependencies, and then install the official dependencies to avoid conflicts.
189
+
***Python Env:** Due to complex dependencies, it is recommended to create a separate environment, install SciEvalKit dependencies, and then install the official dependencies to avoid conflicts.
190
190
***Concurrency:** Default concurrency is 4. Can be changed via `--judge-args '{"max_workers": <nums>}'`.
191
191
***Judge Model:** Requires Claude 3.5 Sonnet for evaluation. Ensure `ANTHROPIC_API_KEY` is set.
192
192
@@ -210,7 +210,7 @@ The following datasets use specific models as default Judges:
210
210
211
211
If the model output for a benchmark does not match expectations, it might be due to incorrect prompt construction.
212
212
213
-
In VLMEvalKit, each dataset class has a `build_prompt()` function. For example, `ImageMCQDataset.build_prompt()` combines hint, question, and options into a standard format:
213
+
In SciEvalKit, each dataset class has a `build_prompt()` function. For example, `ImageMCQDataset.build_prompt()` combines hint, question, and options into a standard format:
214
214
215
215
```text
216
216
HINT
@@ -222,15 +222,15 @@ B. Option B
222
222
Please select the correct answer from the options above.
223
223
```
224
224
225
-
VLMEvalKit also supports **Model-Level custom prompt building** via `model.build_prompt()`.
225
+
SciEvalKit also supports **Model-Level custom prompt building** via `model.build_prompt()`.
0 commit comments