Welcome to verstack Discussions! #3

DanilZherebtsov · 2020-12-11T16:03:31Z

DanilZherebtsov
Dec 11, 2020
Maintainer

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about.
Share ideas.
Engage with other community members.
Welcome others and are open-minded. Remember that this is a community we
build together 💪.

To get started, comment below with an introduction of yourself and tell us about what you do with this community.

nicktishchenko · 2022-12-01T19:53:49Z

nicktishchenko
Dec 1, 2022

Hi Danil,
I am very impressed with your development – verstack. It is indeed very good based on the documentation and examples.

I wanted to try it myself but I cannot install it via pip install. All troubles start 'Preparing metadata (pyproject.toml). The error code is 'subprocess-exited-with-error'
exit code:1

Do you have any advice to me on how to get through? I am using the latest scikit-learn version

Thank you,
Nikolay

34 replies

DanilZherebtsov Dec 7, 2022
Maintainer Author

The main idea behind this requirement of scsplit is the fact that data indexing is a sensitive issue and must be under user's control. I would not want to mess up indexing if it matters in your application. So the error message informs user "Reset index of dataframe/Series before applying scsplit". This means that before applying scsplit user should make sure the indexes are in a range(0, len(df)). So in case your indexes mean something in your application, you may save the index mapping to reapply it after the split...

nicktishchenko Dec 7, 2022

Thank you, the idx's matter for me. I should do re-applying of them after the split.

DanilZherebtsov Dec 7, 2022
Maintainer Author

In this case you can create a dictionary with mapping and use it later for inverse transforming:

index_mapping = {key:value for (key, value) in zip(X_train.index, range(len(X_train)))}

nicktishchenko Dec 7, 2022

Thank you. In my case the data is indexed by record id.

DanilZherebtsov Dec 7, 2022
Maintainer Author

... that is what the mapping is for. It will contain keys as your original indexes (record id) and values as the consecutive integer numbering. After you save the mapping, you reset index as described above and your indexes will just be consecutive integers. Then you can apply scsplit and then reinstate the original indexes using the mapping you have saved.

Alternatively you can just make a new column 'record_id' inside your data and store your ids there.

DanilZherebtsov · 2022-12-02T13:30:03Z

DanilZherebtsov
Dec 2, 2022
Maintainer Author

this is very unlikely, but try to downgrade pandas to 1.3.0
pip install pandas==1.3.0

0 replies

DanilZherebtsov · 2022-12-07T04:51:36Z

DanilZherebtsov
Dec 7, 2022
Maintainer Author

Okay,
As for the learning_rate - this parameter is selected in a range between 0.01 and 0.05 based on the dataset length. This is not tunable.

And regarding the n_estimators - I have tried three experiments: two classifications and one regression and in each of them I've received a different number of n_estimators.

tuner.best_params
...
'n_estimators': 540}
...
'n_estimators': 699}
...
'n_estimators': 697}

Seems to be working as expected.

Let's have a look on the LGBMTuner dependencies versions you have installed.

import lightgbm 
import optuna
import pandas
import numpy
import sklearn
print(f'lightgbm:    {lightgbm.__version__}')
print(f'sklearn:     {sklearn.__version__}')
print(f'optuna:      {optuna.__version__}')
print(f'pandas:      {pandas.__version__}')
print(f'numpy:       {numpy.__version__}')

I've got:

lightgbm: 3.3.2
sklearn: 1.1.3
optuna: 3.0.4
pandas: 1.5.2
numpy: 1.23.5

6 replies

nicktishchenko Dec 7, 2022

I see that I have to update all of them but pandas and numpy. Thank you very much again!

nicktishchenko Dec 7, 2022

Greatly appreciate your help! Everything works as it should!

DanilZherebtsov Dec 7, 2022
Maintainer Author

could you please tell me which versions did you have, so I could update the minimum requirements for the package?

nicktishchenko Dec 7, 2022

Now I have same as yours.

nicktishchenko Dec 7, 2022

Works perfectly.

nicktishchenko · 2022-12-27T19:39:26Z

nicktishchenko
Dec 27, 2022

Hi Danil, I started testing scsplit. It looks that it introduces NaN's to the dataframe it is splitting into train and test: I use dataframe "dataset" with continuous labels in the last column. Before using scsplit "train, test = scsplit(dataset, stratify=dataset['trns_cnt'])" I run test on NaN's "dataset.isna().sum().sum()" with zero result, but when I run scsplit "X_train, X_test, y_train, y_test = scsplit(X, y, stratify=y, test_size= 0.3)" I get "ValueError: Input y contains NaN". What could be the reason for that? Thank you, Nick

10 replies

DanilZherebtsov Dec 30, 2022
Maintainer Author

I have just tested out scsplit and it worked as expected.

I've used the House Prices Advanced Regression Techniques train dataset from Kaggle.

Below is the code I've used

import pandas as pd
from verstack.stratified_continuous_split import scsplit

df = pd.read_csv('/Users/danil/Downloads/data/house_prices/house_prices_train.csv')

train, test = scsplit(df, stratify=df['SalePrice'], continuous=True)

print(train.SalePrice.mean())
print(test.SalePrice.mean())

Please refer to the example. Also please have a look at the article that describes how to use verstack.stratified_continuous_split If it does not work, please share full code that you have been using.

nicktishchenko Dec 30, 2022

Thank you, Danil. I shall try to replicate your result.

nicktishchenko Dec 30, 2022

Danil, you absolutely correct - scsplit works perfectly with the Housing Prices data. I need to ring through my data once again. In case we don't communicate before the end of the year, I wish you all the best in New Year 2023. Your product is great and I shall watch every step of its growth. Best of luck. Nikolay

DanilZherebtsov Dec 30, 2022
Maintainer Author

Thanks for the kind words Nikolai! All the best wishes to you too!

nicktishchenko Dec 30, 2022

Danil, I found out that my problem was with the index of data dataframe - it was not usual sequential but based on record id. Once I did reset_index, everything works fine. Thank you again for all your help and for your great product. Nikolay

nicktishchenko · 2023-01-13T19:13:56Z

nicktishchenko
Jan 13, 2023

Hi Danil, Hope you had nice New Year celebration. Have a nice Staryi Novyi God as well. I wanted to share with you some thoughts on the package related to dealing with category features. In my view, lgbm has great advantage in treating cat features especially when you deal with very large datasets - you do not need to deal with onehot encoding blowing up memory requirements for the sets with very large number of records. When I use verstack to get integration with optuna I am forced to do encoding of cat columns and this severely limits the size of the dataset I can process with verstack. Do you plan to eliminate the need for cat columns encoding to enable the use of your package with large datasets? Nikolay

1 reply

DanilZherebtsov Jan 14, 2023
Maintainer Author

Hi Nick,

That's a good question.

Please explain how are you using verstack with optuna? Are you referring to the verstack.LGBMTuner class that tunes LGBM with optuna automatically?

One more point: verstack features various categoric encoders and you are not limited by the memory hungry OneHotEncoder; you may use MeanTargetEncoder/FrequencyEncoder/WeightOfEvidenceEncoder/Factorizer which will keep the same cat_col dimension and potentially introduce more powerful encoding methods for a given dataset.

Does that answer your question or you are referring to a categoric_feature LGBM setting to be added to LGBMTuner initialisation.

nicktishchenko · 2023-01-14T18:30:14Z

nicktishchenko
Jan 14, 2023

Yes, Danil. I was talking about LGBMTuner when I mentioned Optuna. My point here is very simple - LightGBM has great built-in capability to handle categorical features. Why would one give them up and add processing these features in additional step at the expense of performance and resources? In my case, for instance, the dataset size limitation is prohibitively high to the extent that I have to give up LGBMTuner and use Optima together with lgbm like there is no verstack. So, it's like a wish list for me - using both great integration of verstack with Optuna without separate encoding of categorical features (which already can be done by lgbm itself).

3 replies

DanilZherebtsov Jan 14, 2023
Maintainer Author

Which LGBM arguments would you like to pass to LGBMTuner?

nicktishchenko Jan 14, 2023

categorical_features as a list of categorical columns would be nice

DanilZherebtsov May 3, 2023
Maintainer Author

Hi @nicktishchenko
I've started to implement the categorical_feature support into LGBMTuner only to realise that it is already supported by default.

According to LGBM docs you have to transform your unique categories into consecutive integers and then cast them into "categoric" dtype like so:

df['Sex'].unique()
encoding_dict = {val:ix for ix, val in enumerate(df['Sex'].unique())}
df['Sex'] = df['Sex'].map(encoding_dict)
df['Sex'] = df['Sex'].astype('category')
print(df['Sex'].dtype)
#--->CategoricalDtype(categories=[0, 1], ordered=False)

And then just pass this data to LGBMTuner without any additional settings:

from verstack import LGBMTuner
tuner = LGBMTuner(metric = 'accuracy')
X = df.drop('target', axis = 1)
y = df['target']
tuner.fit(X, y)

nicktishchenko · 2023-03-09T15:48:29Z

nicktishchenko
Mar 9, 2023

Hi Danil, the version of verstack in conda is 0.4.0 while the latest pip version is 3.6.6. Is it possible to push conda to update your verstack? Nikolay

2 replies

DanilZherebtsov Mar 9, 2023
Maintainer Author

Hi Nick. I didn't even know verstack was in conda index!! I believe it is a separate index from pypi. I have never pushed it to the conda distribution. I will have to look into that.

But even working in anaconda environment you can easily install verstack with pip install. Or is there any other reason that you want verstack in conda?

DanilZherebtsov Mar 28, 2023
Maintainer Author

Started working on uploading the package to Conda. Will let you know how it went shortly.

nicktishchenko · 2023-03-09T15:53:37Z

nicktishchenko
Mar 9, 2023

I know, but it is for cloud, and the very old version of verstack in conda is considered a risk by our IT :-(

1 reply

nicktishchenko Mar 9, 2023

PS. I cannot push IT and they are using conda installs

nicktishchenko · 2023-03-10T04:28:16Z

nicktishchenko
Mar 10, 2023

Hi Danil, how can I suppress automatic output of "Best threshold(s)" while running ThreshTuner()? Nick

3 replies

nicktishchenko Mar 10, 2023

I mean after thresh.fit?

DanilZherebtsov Mar 28, 2023
Maintainer Author

Hi Nick, sorry for a late reply. Added the verbose setting to ThreshTuner.
$ pip install --upgrade verstack
you should get version 3.6.7

from verstack import ThreshTuner
tuner = ThreshTuner(verbose=False)

nicktishchenko Mar 28, 2023

Danil, Thank you very much! Nick

DanilZherebtsov · 2023-05-03T16:45:55Z

DanilZherebtsov
May 3, 2023
Maintainer Author

@nicktishchenko
By the way LGBMTuner has had a major release allowing it to tune almost all other parameters of lightgbm.
So by default it will tune the predefined grids, but you are free to modify/add additional parameters for optimization from this list:

tuner.grid
  #--->{'boosting_type': None,
  #--->'num_iterations': None,
  #--->'learning_rate': None,
  #--->'num_leaves': {'low': 16, 'high': 255},                  <--- default setting
  #--->'max_depth': None,
  #--->'min_data_in_leaf': None,
  #--->'min_sum_hessian_in_leaf': {'low': 0.001, 'high': 10.0}, <--- default setting
  #--->'bagging_fraction': {'low': 0.5, 'high': 1.0},           <--- default setting
  #--->'feature_fraction': {'low': 0.5, 'high': 1.0},           <--- default setting
  #--->'max_delta_step': None,
  #--->'lambda_l1': {'low': 1e-08, 'high': 10.0},               <--- default setting
  #--->'lambda_l2': {'low': 1e-08, 'high': 10.0},               <--- default setting
  #--->'linear_lambda': None,
  #--->'min_gain_to_split': None,
  #--->'drop_rate': None,
  #--->'top_rate': None,
  #--->'min_data_per_group': None,
  #--->'max_cat_threshold': None}

It is reflected in the documentation 'LGBMTuner/Examples' : https://verstack.readthedocs.io/en/latest/#id16

1 reply

nicktishchenko May 3, 2023

Thank you, Danil!

Welcome to verstack Discussions! #3

Uh oh!

DanilZherebtsov Dec 11, 2020 Maintainer

👋 Welcome!

Replies: 10 comments · 61 replies

Uh oh!

Uh oh!

Uh oh!

DanilZherebtsov Dec 7, 2022 Maintainer Author

Uh oh!

Uh oh!

DanilZherebtsov Dec 7, 2022 Maintainer Author

Uh oh!

Uh oh!

DanilZherebtsov Dec 7, 2022 Maintainer Author

Uh oh!

DanilZherebtsov Dec 2, 2022 Maintainer Author

Uh oh!

DanilZherebtsov Dec 7, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

DanilZherebtsov Dec 7, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DanilZherebtsov Dec 30, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

DanilZherebtsov Dec 30, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

DanilZherebtsov Jan 14, 2023 Maintainer Author

Uh oh!

Uh oh!

DanilZherebtsov Jan 14, 2023 Maintainer Author

Uh oh!

Uh oh!

DanilZherebtsov May 3, 2023 Maintainer Author

Uh oh!

Uh oh!

DanilZherebtsov Mar 9, 2023 Maintainer Author

Uh oh!

DanilZherebtsov Mar 28, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DanilZherebtsov
Dec 11, 2020
Maintainer

Replies: 10 comments 61 replies

DanilZherebtsov Dec 7, 2022
Maintainer Author

DanilZherebtsov Dec 7, 2022
Maintainer Author

DanilZherebtsov Dec 7, 2022
Maintainer Author

DanilZherebtsov
Dec 2, 2022
Maintainer Author

DanilZherebtsov
Dec 7, 2022
Maintainer Author

DanilZherebtsov Dec 7, 2022
Maintainer Author

DanilZherebtsov Dec 30, 2022
Maintainer Author

DanilZherebtsov Dec 30, 2022
Maintainer Author

DanilZherebtsov Jan 14, 2023
Maintainer Author

DanilZherebtsov Jan 14, 2023
Maintainer Author

DanilZherebtsov May 3, 2023
Maintainer Author

DanilZherebtsov Mar 9, 2023
Maintainer Author

DanilZherebtsov Mar 28, 2023
Maintainer Author