Skip to content

Commit 82f1ef2

Browse files
DRAFT add Data Health Checker (#1574)
* #854 implement first data health checker draft * #854 added support for qlib's data format, implemented factor check, reformatted summary * adaptation current dataset * format with black * add data health check to docs * fix sphinx error * fix pylint error * update code * format with black * format with pylint --------- Co-authored-by: Linlang <[email protected]>
1 parent 186512f commit 82f1ef2

File tree

3 files changed

+264
-0
lines changed

3 files changed

+264
-0
lines changed

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,16 @@ We recommend users to prepare their own data if they have a high-quality dataset
264264
* *trading_date*: start of trading day
265265
* *end_date*: end of trading day(not included)
266266
267+
### Checking the health of the data
268+
* We provide a script to check the health of the data, you can run the following commands to check whether the data is healthy or not.
269+
```
270+
python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data
271+
```
272+
* Of course, you can also add some parameters to adjust the test results, such as this.
273+
```
274+
python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data --missing_data_num 30055 --large_step_threshold_volume 94485 --large_step_threshold_price 20
275+
```
276+
* If you want more information about `check_data_health`, please refer to the [documentation](https://qlib.readthedocs.io/en/latest/component/data.html#checking-the-health-of-the-data).
267277
268278
<!--
269279
- Run the initialization code and get stock data:

docs/component/data.rst

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,57 @@ After conversion, users can find their Qlib format data in the directory `~/.qli
197197
In the convention of `Qlib` data processing, `open, close, high, low, volume, money and factor` will be set to NaN if the stock is suspended.
198198
If you want to use your own alpha-factor which can't be calculate by OCHLV, like PE, EPS and so on, you could add it to the CSV files with OHCLV together and then dump it to the Qlib format data.
199199

200+
Checking the health of the data
201+
-------------------------------
202+
203+
``Qlib`` provides a script to check the health of the data.
204+
205+
- The main points to check are as follows
206+
207+
- Check if any data is missing in the DataFrame.
208+
209+
- Check if there are any large step changes above the threshold in the OHLCV columns.
210+
211+
- Check if any of the required columns (OLHCV) are missing in the DataFrame.
212+
213+
- Check if the 'factor' column is missing in the DataFrame.
214+
215+
- You can run the following commands to check whether the data is healthy or not.
216+
217+
for daily data:
218+
.. code-block:: bash
219+
220+
python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data
221+
222+
for 1min data:
223+
.. code-block:: bash
224+
225+
python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data_1min --freq 1min
226+
227+
- Of course, you can also add some parameters to adjust the test results.
228+
229+
- The available parameters are these.
230+
231+
- freq: Frequency of data.
232+
233+
- large_step_threshold_price: Maximum permitted price change
234+
235+
- large_step_threshold_volume: Maximum permitted volume change.
236+
237+
- missing_data_num: Maximum value for which data is allowed to be null.
238+
239+
- You can run the following commands to check whether the data is healthy or not.
240+
241+
for daily data:
242+
.. code-block:: bash
243+
244+
python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data --missing_data_num 30055 --large_step_threshold_volume 94485 --large_step_threshold_price 20
245+
246+
for 1min data:
247+
.. code-block:: bash
248+
249+
python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data --freq 1min --missing_data_num 35806 --large_step_threshold_volume 3205452000000 --large_step_threshold_price 0.91
250+
200251
Stock Pool (Market)
201252
-------------------
202253

scripts/check_data_health.py

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
from loguru import logger
2+
import os
3+
from typing import Optional
4+
5+
import fire
6+
import pandas as pd
7+
import qlib
8+
from tqdm import tqdm
9+
10+
from qlib.data import D
11+
12+
13+
class DataHealthChecker:
14+
"""Checks a dataset for data completeness and correctness. The data will be converted to a pd.DataFrame and checked for the following problems:
15+
- any of the columns ["open", "high", "low", "close", "volume"] are missing
16+
- any data is missing
17+
- any step change in the OHLCV columns is above a threshold (default: 0.5 for price, 3 for volume)
18+
- any factor is missing
19+
"""
20+
21+
def __init__(
22+
self,
23+
csv_path=None,
24+
qlib_dir=None,
25+
freq="day",
26+
large_step_threshold_price=0.5,
27+
large_step_threshold_volume=3,
28+
missing_data_num=0,
29+
):
30+
assert csv_path or qlib_dir, "One of csv_path or qlib_dir should be provided."
31+
assert not (csv_path and qlib_dir), "Only one of csv_path or qlib_dir should be provided."
32+
33+
self.data = {}
34+
self.problems = {}
35+
self.freq = freq
36+
self.large_step_threshold_price = large_step_threshold_price
37+
self.large_step_threshold_volume = large_step_threshold_volume
38+
self.missing_data_num = missing_data_num
39+
40+
if csv_path:
41+
assert os.path.isdir(csv_path), f"{csv_path} should be a directory."
42+
files = [f for f in os.listdir(csv_path) if f.endswith(".csv")]
43+
for filename in tqdm(files, desc="Loading data"):
44+
df = pd.read_csv(os.path.join(csv_path, filename))
45+
self.data[filename] = df
46+
47+
elif qlib_dir:
48+
qlib.init(provider_uri=qlib_dir)
49+
self.load_qlib_data()
50+
51+
def load_qlib_data(self):
52+
instruments = D.instruments(market="all")
53+
instrument_list = D.list_instruments(instruments=instruments, as_list=True, freq=self.freq)
54+
required_fields = ["$open", "$close", "$low", "$high", "$volume", "$factor"]
55+
for instrument in instrument_list:
56+
df = D.features([instrument], required_fields, freq=self.freq)
57+
df.rename(
58+
columns={
59+
"$open": "open",
60+
"$close": "close",
61+
"$low": "low",
62+
"$high": "high",
63+
"$volume": "volume",
64+
"$factor": "factor",
65+
},
66+
inplace=True,
67+
)
68+
self.data[instrument] = df
69+
print(df)
70+
71+
def check_missing_data(self) -> Optional[pd.DataFrame]:
72+
"""Check if any data is missing in the DataFrame."""
73+
result_dict = {
74+
"instruments": [],
75+
"open": [],
76+
"high": [],
77+
"low": [],
78+
"close": [],
79+
"volume": [],
80+
}
81+
for filename, df in self.data.items():
82+
missing_data_columns = df.isnull().sum()[df.isnull().sum() > self.missing_data_num].index.tolist()
83+
if len(missing_data_columns) > 0:
84+
result_dict["instruments"].append(filename)
85+
result_dict["open"].append(df.isnull().sum()["open"])
86+
result_dict["high"].append(df.isnull().sum()["high"])
87+
result_dict["low"].append(df.isnull().sum()["low"])
88+
result_dict["close"].append(df.isnull().sum()["close"])
89+
result_dict["volume"].append(df.isnull().sum()["volume"])
90+
91+
result_df = pd.DataFrame(result_dict).set_index("instruments")
92+
if not result_df.empty:
93+
return result_df
94+
else:
95+
logger.info(f"✅ There are no missing data.")
96+
return None
97+
98+
def check_large_step_changes(self) -> Optional[pd.DataFrame]:
99+
"""Check if there are any large step changes above the threshold in the OHLCV columns."""
100+
result_dict = {
101+
"instruments": [],
102+
"col_name": [],
103+
"date": [],
104+
"pct_change": [],
105+
}
106+
for filename, df in self.data.items():
107+
affected_columns = []
108+
for col in ["open", "high", "low", "close", "volume"]:
109+
if col in df.columns:
110+
pct_change = df[col].pct_change(fill_method=None).abs()
111+
threshold = self.large_step_threshold_volume if col == "volume" else self.large_step_threshold_price
112+
if pct_change.max() > threshold:
113+
large_steps = pct_change[pct_change > threshold]
114+
result_dict["instruments"].append(filename)
115+
result_dict["col_name"].append(col)
116+
result_dict["date"].append(large_steps.index.to_list()[0][1].strftime("%Y-%m-%d"))
117+
result_dict["pct_change"].append(pct_change.max())
118+
affected_columns.append(col)
119+
120+
result_df = pd.DataFrame(result_dict).set_index("instruments")
121+
if not result_df.empty:
122+
return result_df
123+
else:
124+
logger.info(f"✅ There are no large step changes in the OHLCV column above the threshold.")
125+
return None
126+
127+
def check_required_columns(self) -> Optional[pd.DataFrame]:
128+
"""Check if any of the required columns (OLHCV) are missing in the DataFrame."""
129+
required_columns = ["open", "high", "low", "close", "volume"]
130+
result_dict = {
131+
"instruments": [],
132+
"missing_col": [],
133+
}
134+
for filename, df in self.data.items():
135+
if not all(column in df.columns for column in required_columns):
136+
missing_required_columns = [column for column in required_columns if column not in df.columns]
137+
result_dict["instruments"].append(filename)
138+
result_dict["missing_col"] += missing_required_columns
139+
140+
result_df = pd.DataFrame(result_dict).set_index("instruments")
141+
if not result_df.empty:
142+
return result_df
143+
else:
144+
logger.info(f"✅ The columns (OLHCV) are complete and not missing.")
145+
return None
146+
147+
def check_missing_factor(self) -> Optional[pd.DataFrame]:
148+
"""Check if the 'factor' column is missing in the DataFrame."""
149+
result_dict = {
150+
"instruments": [],
151+
"missing_factor_col": [],
152+
"missing_factor_data": [],
153+
}
154+
for filename, df in self.data.items():
155+
if "000300" in filename or "000903" in filename or "000905" in filename:
156+
continue
157+
if "factor" not in df.columns:
158+
result_dict["instruments"].append(filename)
159+
result_dict["missing_factor_col"].append(True)
160+
if df["factor"].isnull().all():
161+
if filename in result_dict["instruments"]:
162+
result_dict["missing_factor_data"].append(True)
163+
else:
164+
result_dict["instruments"].append(filename)
165+
result_dict["missing_factor_col"].append(False)
166+
result_dict["missing_factor_data"].append(True)
167+
168+
result_df = pd.DataFrame(result_dict).set_index("instruments")
169+
if not result_df.empty:
170+
return result_df
171+
else:
172+
logger.info(f"✅ The `factor` column already exists and is not empty.")
173+
return None
174+
175+
def check_data(self):
176+
check_missing_data_result = self.check_missing_data()
177+
check_large_step_changes_result = self.check_large_step_changes()
178+
check_required_columns_result = self.check_required_columns()
179+
check_missing_factor_result = self.check_missing_factor()
180+
if (
181+
check_large_step_changes_result is not None
182+
or check_large_step_changes_result is not None
183+
or check_required_columns_result is not None
184+
or check_missing_factor_result is not None
185+
):
186+
print(f"\nSummary of data health check ({len(self.data)} files checked):")
187+
print("-------------------------------------------------")
188+
if isinstance(check_missing_data_result, pd.DataFrame):
189+
logger.warning(f"There is missing data.")
190+
print(check_missing_data_result)
191+
if isinstance(check_large_step_changes_result, pd.DataFrame):
192+
logger.warning(f"The OHLCV column has large step changes.")
193+
print(check_large_step_changes_result)
194+
if isinstance(check_required_columns_result, pd.DataFrame):
195+
logger.warning(f"Columns (OLHCV) are missing.")
196+
print(check_required_columns_result)
197+
if isinstance(check_missing_factor_result, pd.DataFrame):
198+
logger.warning(f"The factor column does not exist or is empty")
199+
print(check_missing_factor_result)
200+
201+
202+
if __name__ == "__main__":
203+
fire.Fire(DataHealthChecker)

0 commit comments

Comments
 (0)