Skip to content

Commit 1b42650

Browse files
you-n-gSunsetWolf
andauthored
feat: data improve, support parquet (#1966)
* refactor: relocate CLI modules to qlib.cli and update references * refactor: introduce read_as_df and rename csv_path to data_path * lint * refactor: rename csv_path to data_path and use QSettings.provider_uri * fix pylint error * fix get_data command * add comments to CI yaml * update docs --------- Co-authored-by: Linlang <[email protected]>
1 parent 78b77e3 commit 1b42650

File tree

21 files changed

+105
-62
lines changed

21 files changed

+105
-62
lines changed

.github/workflows/test_qlib_from_pip.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ jobs:
6060
brew unlink libomp
6161
brew install libomp.rb
6262
63+
# When the new version is released it should be changed to:
64+
# python -m qlib.cli.data qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn
6365
- name: Downloads dependencies data
6466
run: |
6567
cd ..

.github/workflows/test_qlib_from_source.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ jobs:
104104
- name: Test workflow by config (install from source)
105105
run: |
106106
python -m pip install numba
107-
python qlib/workflow/cli.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
107+
python qlib/cli/run.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
108108
109109
- name: Unit tests with Pytest
110110
uses: nick-fields/retry@v2

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -229,10 +229,10 @@ Load and prepare data by running the following code:
229229
### Get with module
230230
```bash
231231
# get 1d data
232-
python -m qlib.run.get_data qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn
232+
python -m qlib.cli.data qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn
233233
234234
# get 1min data
235-
python -m qlib.run.get_data qlib_data --target_dir ~/.qlib/qlib_data/cn_data_1min --region cn --interval 1min
235+
python -m qlib.cli.data qlib_data --target_dir ~/.qlib/qlib_data/cn_data_1min --region cn --interval 1min
236236
237237
```
238238

@@ -329,7 +329,7 @@ We recommend users to prepare their own data if they have a high-quality dataset
329329
3. At this point you are in the docker environment and can run the qlib scripts. An example:
330330
```bash
331331
>>> python scripts/get_data.py qlib_data --name qlib_data_simple --target_dir ~/.qlib/qlib_data/cn_data --interval 1d --region cn
332-
>>> python qlib/workflow/cli.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
332+
>>> python qlib/cli/run.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
333333
```
334334
4. Exit the container
335335
```bash
@@ -359,7 +359,7 @@ Qlib provides a tool named `qrun` to run the whole workflow automatically (inclu
359359
```
360360
If users want to use `qrun` under debug mode, please use the following command:
361361
```bash
362-
python -m pdb qlib/workflow/cli.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
362+
python -m pdb qlib/cli/run.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
363363
```
364364
The result of `qrun` is as follows, please refer to [docs](https://qlib.readthedocs.io/en/latest/component/strategy.html#result) for more explanations about the result.
365365

docs/component/data.rst

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -108,10 +108,10 @@ Automatic update of daily frequency data
108108

109109

110110

111-
Converting CSV Format into Qlib Format
112-
--------------------------------------
111+
Converting CSV and Parquet Format into Qlib Format
112+
--------------------------------------------------
113113

114-
``Qlib`` has provided the script ``scripts/dump_bin.py`` to convert **any** data in CSV format into `.bin` files (``Qlib`` format) as long as they are in the correct format.
114+
``Qlib`` has provided the script ``scripts/dump_bin.py`` to convert **any** data in CSV or Parquet format into `.bin` files (``Qlib`` format) as long as they are in the correct format.
115115

116116
Besides downloading the prepared demo data, users could download demo data directly from the Collector as follows for reference to the CSV format.
117117
Here are some example:
@@ -126,17 +126,17 @@ for 1min data:
126126
127127
python scripts/data_collector/yahoo/collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1min --region CN --start 2021-05-20 --end 2021-05-23 --delay 0.1 --interval 1min --limit_nums 10
128128
129-
Users can also provide their own data in CSV format. However, the CSV data **must satisfies** following criterions:
129+
Users can also provide their own data in CSV or Parquet format. However, the data **must satisfies** following criterions:
130130

131-
- CSV file is named after a specific stock *or* the CSV file includes a column of the stock name
131+
- CSV or Parquet file is named after a specific stock *or* the CSV or Parquet file includes a column of the stock name
132132

133-
- Name the CSV file after a stock: `SH600000.csv`, `AAPL.csv` (not case sensitive).
133+
- Name the CSV or Parquet file after a stock: `SH600000.csv`, `AAPL.csv` or `SH600000.parquet`, `AAPL.parquet` (not case sensitive).
134134

135-
- CSV file includes a column of the stock name. User **must** specify the column name when dumping the data. Here is an example:
135+
- CSV or Parquet file includes a column of the stock name. User **must** specify the column name when dumping the data. Here is an example:
136136

137137
.. code-block:: bash
138138
139-
python scripts/dump_bin.py dump_all ... --symbol_field_name symbol
139+
python scripts/dump_bin.py dump_all ... --symbol_field_name symbol --file_suffix <.csv or .parquet>
140140
141141
where the data are in the following format:
142142

@@ -146,11 +146,11 @@ Users can also provide their own data in CSV format. However, the CSV data **mus
146146
| SH600000 | 120 |
147147
+-----------+-------+
148148

149-
- CSV file **must** include a column for the date, and when dumping the data, user must specify the date column name. Here is an example:
149+
- CSV or Parquet file **must** include a column for the date, and when dumping the data, user must specify the date column name. Here is an example:
150150

151151
.. code-block:: bash
152152
153-
python scripts/dump_bin.py dump_all ... --date_field_name date
153+
python scripts/dump_bin.py dump_all ... --date_field_name date --file_suffix <.csv or .parquet>
154154
155155
where the data are in the following format:
156156

@@ -163,23 +163,23 @@ Users can also provide their own data in CSV format. However, the CSV data **mus
163163
+---------+------------+-------+------+----------+
164164

165165

166-
Supposed that users prepare their CSV format data in the directory ``~/.qlib/csv_data/my_data``, they can run the following command to start the conversion.
166+
Supposed that users prepare their CSV or Parquet format data in the directory ``~/.qlib/my_data``, they can run the following command to start the conversion.
167167

168168
.. code-block:: bash
169169
170-
python scripts/dump_bin.py dump_all --csv_path ~/.qlib/csv_data/my_data --qlib_dir ~/.qlib/qlib_data/my_data --include_fields open,close,high,low,volume,factor
170+
python scripts/dump_bin.py dump_all --data_path ~/.qlib/my_data --qlib_dir ~/.qlib/qlib_data/ --include_fields open,close,high,low,volume,factor --file_suffix <.csv or .parquet>
171171
172172
For other supported parameters when dumping the data into `.bin` file, users can refer to the information by running the following commands:
173173

174174
.. code-block:: bash
175175
176-
python dump_bin.py dump_all --help
176+
python scripts/dump_bin.py dump_all --help
177177
178-
After conversion, users can find their Qlib format data in the directory `~/.qlib/qlib_data/my_data`.
178+
After conversion, users can find their Qlib format data in the directory `~/.qlib/qlib_data/`.
179179

180180
.. note::
181181

182-
The arguments of `--include_fields` should correspond with the column names of CSV files. The columns names of dataset provided by ``Qlib`` should include open, close, high, low, volume and factor at least.
182+
The arguments of `--include_fields` should correspond with the column names of CSV or Parquet files. The columns names of dataset provided by ``Qlib`` should include open, close, high, low, volume and factor at least.
183183

184184
- `open`
185185
The adjusted opening price
@@ -195,7 +195,7 @@ After conversion, users can find their Qlib format data in the directory `~/.qli
195195
The Restoration factor. Normally, ``factor = adjusted_price / original_price``, `adjusted price` reference: `split adjusted <https://www.investopedia.com/terms/s/splitadjusted.asp>`_
196196

197197
In the convention of `Qlib` data processing, `open, close, high, low, volume, money and factor` will be set to NaN if the stock is suspended.
198-
If you want to use your own alpha-factor which can't be calculate by OCHLV, like PE, EPS and so on, you could add it to the CSV files with OHCLV together and then dump it to the Qlib format data.
198+
If you want to use your own alpha-factor which can't be calculate by OCHLV, like PE, EPS and so on, you could add it to the CSV or Parquet files with OHCLV together and then dump it to the Qlib format data.
199199

200200
Checking the health of the data
201201
-------------------------------

docs/component/workflow.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ If users want to use ``qrun`` under debug mode, please use the following command
110110

111111
.. code-block:: bash
112112
113-
python -m pdb qlib/workflow/cli.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
113+
python -m pdb qlib/cli/run.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
114114
115115
.. note::
116116

docs/developer/how_to_build_image.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ How to use qlib images
5252
.. code-block:: bash
5353
5454
>>> python scripts/get_data.py qlib_data --name qlib_data_simple --target_dir ~/.qlib/qlib_data/cn_data --interval 1d --region cn
55-
>>> python qlib/workflow/cli.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
55+
>>> python qlib/cli/run.py examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158.yaml
5656
5757
3. Exit the container
5858

examples/rl_order_execution/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ This folder comprises an example of Reinforcement Learning (RL) workflows for or
77
### Get Data
88

99
```
10-
python -m qlib.run.get_data qlib_data qlib_data --target_dir ./data/bin --region hs300 --interval 5min
10+
python -m qlib.cli.data qlib_data --target_dir ./data/bin --region hs300 --interval 5min
1111
```
1212

1313
### Generate Pickle-Style Data

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,4 +103,4 @@ packages = [
103103
]
104104

105105
[project.scripts]
106-
qrun = "qlib.workflow.cli:run"
106+
qrun = "qlib.cli.run:run"

qlib/backtest/exchange.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -897,6 +897,7 @@ def _calc_trade_info_by_order(
897897
# if we don't know current position, we choose to sell all
898898
# Otherwise, we clip the amount based on current position
899899
if position is not None:
900+
# TODO: make the trading shortable
900901
current_amount = (
901902
position.get_stock_amount(order.stock_id) if position.check_stock(order.stock_id) else 0
902903
)
File renamed without changes.

0 commit comments

Comments
 (0)