Skip to content

Commit e132f12

Browse files
committed
Merge branch 'realese/v0.5.0'
2 parents e549140 + 73f3ae2 commit e132f12

File tree

16 files changed

+369
-45
lines changed

16 files changed

+369
-45
lines changed

.readthedocs.yml

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,27 @@
44
# Required
55
version: 2
66

7+
# Build documentation in the docs/ directory with Sphinx
8+
build:
9+
os: ubuntu-20.04
10+
tools:
11+
python: "3.8"
12+
# jobs:
13+
# pre_build:
14+
# - cp -r notebooks docs/
15+
716
# Build documentation in the docs/ directory with Sphinx
817
sphinx:
18+
builder: html
919
configuration: docs/conf.py
10-
11-
# Build documentation with MkDocs
12-
#mkdocs:
13-
# configuration: mkdocs.yml
20+
fail_on_warning: false
1421

1522
# Optionally build your docs in additional formats such as PDF and ePub
1623
formats:
1724
- htmlzip
1825

1926
# Optionally set the version of Python and requirements required to build your docs
2027
python:
21-
version: 3.7
2228
install:
2329
- requirements: docs/requirements.txt
2430
- requirements: requirements.txt

Readme.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -265,5 +265,5 @@ Tags
265265

266266
**RU**: аплифт моделирование, Uplift модель
267267

268-
**ZH**: 隆起建模,因果推断,因果效应,因果关系,个人治疗效应,真正的电梯,净电梯
268+
**ZH**: uplift增量建模, 因果推断, 因果效应, 因果关系, 个体干预因果效应, 真实增量, 净增量, 增量建模
269269

38.8 KB
Loading

docs/api/metrics/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,5 @@
1717
./response_rate_by_percentile
1818
./treatment_balance_curve
1919
./average_squared_deviation
20+
./max_prof_uplift
2021
./make_uplift_scorer
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
**********************************************
2+
`sklift.metrics <./>`_.max_prof_uplift
3+
**********************************************
4+
5+
.. autofunction:: sklift.metrics.metrics.max_prof_uplift

docs/changelog.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,28 @@
88
* 🔨 something that previously didn’t work as documented – or according to reasonable expectations – should now work.
99
* ❗️ you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
1010

11+
## Version 0.5.0
12+
13+
### [sklift.models](https://www.uplift-modeling.com/en/v0.5.0/api/models/index.html)
14+
15+
* 🔥 Add [ClassTransformationReg](https://www.uplift-modeling.com/en/v0.5.0/api/models.html#sklift.models.models.TwoModels) model by [@mcullan](https://github.com/mcullan) and [@ElisovaIra](https://github.com/ElisovaIra).
16+
* 🔨 Add the ability to process a series with different indexes in the [TwoModels](https://www.uplift-modeling.com/en/v0.5.0/api/models.html#sklift.models.models.TwoModels) by [@flashlight101](https://github.com/flashlight101).
17+
18+
### [sklift.metrics](https://www.uplift-modeling.com/en/v0.5.0/api/index/metrics.html)
19+
20+
* 🔥 Add new metric [Maximum profit uplift measure](https://www.uplift-modeling.com/en/v0.5.0/api/metrics/max_prof_uplift.html) by [@rooti123](https://github.com/rooti123).
21+
22+
### [sklift.datasets](https://www.uplift-modeling.com/en/v0.5.0/api/datasets/index.html)
23+
24+
* 💥 Add cheker based on hash for all datasets by [@flashlight101](https://github.com/flashlight101)
25+
* 📝 Add [scheme](https://www.uplift-modeling.com/en/v0.5.0/api/datasets/fetch_x5.html) of x5 dataframes.
26+
27+
### Miscellaneous
28+
* 📝 Improve Chinise tags by [@00helloworld](https://github.com/00helloworld)
29+
1130
## Version 0.4.1
1231

13-
### [sklift.datasets](https://www.uplift-modeling.com/en/v0.4.0/api/datasets/index.html)
32+
### [sklift.datasets](https://www.uplift-modeling.com/en/v0.4.1/api/datasets/index.html)
1433

1534
* 🔨 Fix bug in dataset links.
1635
* 📝 Add about a company section

docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -153,4 +153,4 @@ Tags
153153

154154
**RU**: аплифт моделирование, Uplift модель
155155

156-
**ZH**: 隆起建模,因果推断,因果效应,因果关系,个人治疗效应,真正的电梯,净电梯
156+
**ZH**: uplift增量建模, 因果推断, 因果效应, 因果关系, 个体干预因果效应, 真实增量, 净增量, 增量建模

docs/requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
sphinx-autobuild
2-
sphinx_rtd_theme
1+
sphinx==5.1.1
2+
sphinx-rtd-theme==1.0.0
33
myst-parser
44
sphinxcontrib-bibtex

sklift/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '0.4.1'
1+
__version__ = '0.5.0'

sklift/datasets/datasets.py

Lines changed: 87 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import os
22
import shutil
3+
import hashlib
34

45
import pandas as pd
56
import requests
@@ -95,6 +96,11 @@ def _get_data(data_home, url, dest_subdir, dest_filename, download_if_missing,
9596
raise IOError("Dataset missing")
9697
return dest_path
9798

99+
def _get_file_hash(csv_path):
100+
with open(csv_path, 'rb') as file_to_check:
101+
data = file_to_check.read()
102+
return hashlib.md5(data).hexdigest()
103+
98104

99105
def clear_data_dir(path=None):
100106
"""Delete all the content of the data home cache.
@@ -170,11 +176,19 @@ def fetch_lenta(data_home=None, dest_subdir=None, download_if_missing=True, retu
170176
:func:`.fetch_megafon`: Load and return the MegaFon Uplift Competition dataset (classification).
171177
"""
172178

173-
url = 'https://sklift.s3.eu-west-2.amazonaws.com/lenta_dataset.csv.gz'
174-
filename = url.split('/')[-1]
175-
csv_path = _get_data(data_home=data_home, url=url, dest_subdir=dest_subdir,
179+
lenta_metadata = {
180+
'url': 'https://sklift.s3.eu-west-2.amazonaws.com/lenta_dataset.csv.gz',
181+
'hash': '6ab28ff0989ed8b8647f530e2e86452f'
182+
}
183+
184+
filename = lenta_metadata['url'].split('/')[-1]
185+
csv_path = _get_data(data_home=data_home, url=lenta_metadata['url'], dest_subdir=dest_subdir,
176186
dest_filename=filename,
177187
download_if_missing=download_if_missing)
188+
189+
if _get_file_hash(csv_path) != lenta_metadata['hash']:
190+
raise ValueError(f"The {filename} file is broken,\
191+
please clean the directory with the clean_data_dir function, and run the function again")
178192

179193
target_col = 'response_att'
180194
treatment_col = 'group'
@@ -262,11 +276,24 @@ def fetch_x5(data_home=None, dest_subdir=None, download_if_missing=True):
262276
263277
:func:`.fetch_megafon`: Load and return the MegaFon Uplift Competition dataset (classification).
264278
"""
265-
url_train = 'https://sklift.s3.eu-west-2.amazonaws.com/uplift_train.csv.gz'
266-
file_train = url_train.split('/')[-1]
267-
csv_train_path = _get_data(data_home=data_home, url=url_train, dest_subdir=dest_subdir,
279+
280+
x5_metadata = {
281+
'url_train': 'https://sklift.s3.eu-west-2.amazonaws.com/uplift_train.csv.gz',
282+
'url_clients': 'https://sklift.s3.eu-west-2.amazonaws.com/clients.csv.gz',
283+
'url_purchases': 'https://sklift.s3.eu-west-2.amazonaws.com/purchases.csv.gz',
284+
'uplift_hash': '2720bbb659daa9e0989b2777b6a42d19',
285+
'clients_hash': 'b9cdeb2806b732771de03e819b3354c5',
286+
'purchases_hash': '48d2de13428e24e8b61d66fef02957a8'
287+
}
288+
file_train = x5_metadata['url_train'].split('/')[-1]
289+
csv_train_path = _get_data(data_home=data_home, url=x5_metadata['url_train'], dest_subdir=dest_subdir,
268290
dest_filename=file_train,
269291
download_if_missing=download_if_missing)
292+
293+
if _get_file_hash(csv_train_path) != x5_metadata['uplift_hash']:
294+
raise ValueError(f"The {file_train} file is broken,\
295+
please clean the directory with the clean_data_dir function, and run the function again")
296+
270297
train = pd.read_csv(csv_train_path)
271298
train_features = list(train.columns)
272299

@@ -277,19 +304,27 @@ def fetch_x5(data_home=None, dest_subdir=None, download_if_missing=True):
277304

278305
train = train.drop([target_col, treatment_col], axis=1)
279306

280-
url_clients = 'https://sklift.s3.eu-west-2.amazonaws.com/clients.csv.gz'
281-
file_clients = url_clients.split('/')[-1]
282-
csv_clients_path = _get_data(data_home=data_home, url=url_clients, dest_subdir=dest_subdir,
307+
file_clients = x5_metadata['url_clients'].split('/')[-1]
308+
csv_clients_path = _get_data(data_home=data_home, url=x5_metadata['url_clients'], dest_subdir=dest_subdir,
283309
dest_filename=file_clients,
284310
download_if_missing=download_if_missing)
311+
312+
if _get_file_hash(csv_clients_path) != x5_metadata['clients_hash']:
313+
raise ValueError(f"The {file_clients} file is broken,\
314+
please clean the directory with the clean_data_dir function, and run the function again")
315+
285316
clients = pd.read_csv(csv_clients_path)
286317
clients_features = list(clients.columns)
287318

288-
url_purchases = 'https://sklift.s3.eu-west-2.amazonaws.com/purchases.csv.gz'
289-
file_purchases = url_purchases.split('/')[-1]
290-
csv_purchases_path = _get_data(data_home=data_home, url=url_purchases, dest_subdir=dest_subdir,
319+
file_purchases = x5_metadata['url_purchases'].split('/')[-1]
320+
csv_purchases_path = _get_data(data_home=data_home, url=x5_metadata['url_purchases'], dest_subdir=dest_subdir,
291321
dest_filename=file_purchases,
292322
download_if_missing=download_if_missing)
323+
324+
if _get_file_hash(csv_clients_path) != x5_metadata['purchases_hash']:
325+
raise ValueError(f"The {file_purchases} file is broken,\
326+
please clean the directory with the clean_data_dir function, and run the function again")
327+
293328
purchases = pd.read_csv(csv_purchases_path)
294329
purchases_features = list(purchases.columns)
295330

@@ -391,16 +426,27 @@ def fetch_criteo(target_col='visit', treatment_col='treatment', data_home=None,
391426
raise ValueError(f"The target_col must be an element of {target_cols + ['all']}. "
392427
f"Got value target_col={target_col}.")
393428

429+
criteo_metadata = {
430+
'url': '',
431+
'criteo_hash': ''
432+
}
433+
394434
if percent10:
395-
url = 'https://criteo-bucket.s3.eu-central-1.amazonaws.com/criteo10.csv.gz'
435+
criteo_metadata['url'] = 'https://criteo-bucket.s3.eu-central-1.amazonaws.com/criteo10.csv.gz'
436+
criteo_metadata['criteo_hash'] = 'fe159bcee2cea57548e48eb2d7d5d00c'
396437
else:
397-
url = "https://criteo-bucket.s3.eu-central-1.amazonaws.com/criteo.csv.gz"
438+
criteo_metadata['url'] = "https://criteo-bucket.s3.eu-central-1.amazonaws.com/criteo.csv.gz"
439+
criteo_metadata['criteo_hash'] = 'd2236769ef69e9be52556110102911ec'
398440

399-
filename = url.split('/')[-1]
400-
csv_path = _get_data(data_home=data_home, url=url, dest_subdir=dest_subdir,
441+
filename = criteo_metadata['url'].split('/')[-1]
442+
csv_path = _get_data(data_home=data_home, url=criteo_metadata['url'], dest_subdir=dest_subdir,
401443
dest_filename=filename,
402444
download_if_missing=download_if_missing)
403445

446+
if _get_file_hash(csv_path) != criteo_metadata['criteo_hash']:
447+
raise ValueError(f"The {filename} file is broken,\
448+
please clean the directory with the clean_data_dir function, and run the function again")
449+
404450
dtypes = {
405451
'exposure': 'Int8',
406452
'treatment': 'Int8',
@@ -497,11 +543,19 @@ def fetch_hillstrom(target_col='visit', data_home=None, dest_subdir=None, downlo
497543
raise ValueError(f"The target_col must be an element of {target_cols + ['all']}. "
498544
f"Got value target_col={target_col}.")
499545

500-
url = 'https://hillstorm1.s3.us-east-2.amazonaws.com/hillstorm_no_indices.csv.gz'
501-
filename = url.split('/')[-1]
502-
csv_path = _get_data(data_home=data_home, url=url, dest_subdir=dest_subdir,
546+
hillstrom_metadata = {
547+
'url': 'https://hillstorm1.s3.us-east-2.amazonaws.com/hillstorm_no_indices.csv.gz',
548+
'hillstrom_hash': 'a68a81291f53a14f4e29002629803ba3'
549+
}
550+
551+
filename = hillstrom_metadata['url'].split('/')[-1]
552+
csv_path = _get_data(data_home=data_home, url=hillstrom_metadata['url'], dest_subdir=dest_subdir,
503553
dest_filename=filename,
504554
download_if_missing=download_if_missing)
555+
556+
if _get_file_hash(csv_path) != hillstrom_metadata['hillstrom_hash']:
557+
raise ValueError(f"The {filename} file is broken,\
558+
please clean the directory with the clean_data_dir function, and run the function again")
505559

506560
treatment_col = 'segment'
507561

@@ -582,12 +636,21 @@ def fetch_megafon(data_home=None, dest_subdir=None, download_if_missing=True,
582636
:func:`.fetch_hillstrom`: Load and return Kevin Hillstrom Dataset MineThatData (classification or regression).
583637
584638
"""
585-
url_train = 'https://sklift.s3.eu-west-2.amazonaws.com/megafon_dataset.csv.gz'
586-
file_train = url_train.split('/')[-1]
587-
csv_train_path = _get_data(data_home=data_home, url=url_train, dest_subdir=dest_subdir,
588-
dest_filename=file_train,
639+
megafon_metadata = {
640+
'url': 'https://sklift.s3.eu-west-2.amazonaws.com/megafon_dataset.csv.gz',
641+
'megafon_hash': 'ee8d45a343d4d2cf90bb756c93959ecd'
642+
}
643+
644+
filename = megafon_metadata['url'].split('/')[-1]
645+
csv_path = _get_data(data_home=data_home, url=megafon_metadata['url'], dest_subdir=dest_subdir,
646+
dest_filename=filename,
589647
download_if_missing=download_if_missing)
590-
train = pd.read_csv(csv_train_path)
648+
649+
if _get_file_hash(csv_path) != megafon_metadata['megafon_hash']:
650+
raise ValueError(f"The {filename} file is broken,\
651+
please clean the directory with the clean_data_dir function, and run the function again")
652+
653+
train = pd.read_csv(csv_path)
591654

592655
target_col = 'conversion'
593656
treatment_col = 'treatment_group'

0 commit comments

Comments
 (0)