Bug Description
Service Version: 0.9.0
The discrete feature values in the gcformat sample data generated by the OpenMLDB SQL feature extraction script are inconsistent with those calculated by the PICO script.
Expected Behavior
Current incorrect format: label| slot:sign:origin-value
Correct format: label index| slot:sign:origin-value
Relation Case
OpenMLDB SQL Feature Extraction Example:
0| 1:0:1 2:4599670039981440374 3:6365000770384461703 4:0:93.200000
1| 1:0:2 2:5613161932270271752 3:-1384602352766124944 4:0:93.075000
0| 1:0:3 2:4599670039981440374 3:-6239076729344379818 4:0:92.893000
PICO Feature Extraction Example:
0 0| 2:-8773247204422130117:1 3:4042412524814531440 4:6048373541161169225 5:4681710344575317709:0x1.74ccccccccccdp6
1 1| 2:-8773247204422130117:2 3:6142047291687075953 4:1461111459061395210 5:4681710344575317709:0x1.744cccccccccdp6
0 2| 2:-8773247204422130117:3 3:4042412524814531440 4:3353218529862650678 5:4681710344575317709:0x1.73926e978d4fep6
Steps to Reproduce
- data schema:
id[Int],age[Int],job[String],cons_price_idx[Double],y[Int]
- PICO Feature Extraction Script:
target_y = binary_label(y)
f_id = continuous(id)
f_age = discrete(age)
f_job = discrete(job)
f_cons_price_idx = continuous(cons_price_idx)
- OpenMLDB SQL Feature Extraction Script:
select gcformat(
binary_label(bool(y)),
continuous(id),
discrete(age),
discrete(job),
continuous(cons_price_idx)
) as instance from main_table