Skip to content

Commit 1cbcb45

Browse files
committed
fix truncated feature value error when padding string sequence in run_multivalue_movielens_hash.py
1 parent fc49d2f commit 1cbcb45

File tree

3 files changed

+14
-18
lines changed

3 files changed

+14
-18
lines changed

docs/source/Examples.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,6 @@ if __name__ == "__main__":
187187
```
188188

189189
## Multi-value Input : Movielens
190-
----------------------------------
191190

192191
The MovieLens data has been used for personalized tag recommendation,which contains 668, 953 tag applications of users
193192
on movies. Here is a small fraction of data include sparse fields and a multivalent field.
@@ -275,7 +274,6 @@ if __name__ == "__main__":
275274
```
276275

277276
## Multi-value Input : Movielens with feature hashing on the fly
278-
----------------------------------
279277

280278
```python
281279
import numpy as np
@@ -300,7 +298,7 @@ if __name__ == "__main__":
300298
max_len = max(genres_length)
301299

302300
# Notice : padding=`post`
303-
genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', dtype=str, value=0)
301+
genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', dtype=object, value=0).astype(str)
304302

305303
# 2.set hashing space for each sparse field and generate feature config for sequence feature
306304

@@ -358,7 +356,7 @@ if __name__ == "__main__":
358356
max_len = max(genres_length)
359357

360358
# Notice : padding=`post`
361-
genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', dtype=str, value=0)
359+
genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', dtype=object, value=0).astype(str)
362360

363361
# 2.set hashing space for each sparse field and generate feature config for sequence feature
364362

@@ -521,11 +519,11 @@ if __name__ == "__main__":
521519
The UCI census-income dataset is extracted from the 1994 census database. It contains 299,285 instances of demographic
522520
information of American adults. There are 40 features in total. We construct a multi-task learning problem from this
523521
dataset by setting some of the features as prediction targets :
522+
524523
- Task 1: Predict whether the income exceeds $50K;
525-
- Task 2: Predict whether this person’s marital status is never married.
524+
- Task 2: Predict whether this person’s marital status is never married.
526525

527-
This example shows how to use ``MMOE`` to solve a multi
528-
task learning problem. You can get the demo
526+
This example shows how to use ``MMOE`` to solve a multi task learning problem. You can get the demo
529527
data [census-income.sample](https://github.com/shenweichen/DeepCTR/tree/master/examples/census-income.sample) and run
530528
the following codes.
531529

@@ -572,29 +570,29 @@ if __name__ == "__main__":
572570
data[feat] = lbe.fit_transform(data[feat])
573571

574572
fixlen_feature_columns = [SparseFeat(feat, data[feat].max() + 1, embedding_dim=4) for feat in sparse_features]
575-
+ [DenseFeat(feat, 1, ) for feat in dense_features]
573+
+ [DenseFeat(feat, 1, ) for feat in dense_features]
576574

577575
dnn_feature_columns = fixlen_feature_columns
578576
linear_feature_columns = fixlen_feature_columns
579-
577+
580578
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
581-
579+
582580
# 3.generate input data for model
583-
581+
584582
train, test = train_test_split(data, test_size=0.2, random_state=2020)
585583
train_model_input = {name: train[name] for name in feature_names}
586584
test_model_input = {name: test[name] for name in feature_names}
587-
585+
588586
# 4.Define Model,train,predict and evaluate
589587
model = MMOE(dnn_feature_columns, tower_dnn_hidden_units=[], task_types=['binary', 'binary'],
590588
task_names=['label_income', 'label_marital'])
591589
model.compile("adam", loss=["binary_crossentropy", "binary_crossentropy"],
592590
metrics=['binary_crossentropy'], )
593-
591+
594592
history = model.fit(train_model_input, [train['label_income'].values, train['label_marital'].values],
595593
batch_size=256, epochs=10, verbose=2, validation_split=0.2)
596594
pred_ans = model.predict(test_model_input, batch_size=256)
597-
595+
598596
print("test income AUC", round(roc_auc_score(test['label_income'], pred_ans[0]), 4))
599597
print("test marital AUC", round(roc_auc_score(test['label_marital'], pred_ans[1]), 4))
600598

examples/run_multivalue_movielens_hash.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,7 @@
2020
max_len = max(genres_length)
2121

2222
# Notice : padding=`post`
23-
genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', dtype=str, value=0)
24-
23+
genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', dtype=object, value=0).astype(str)
2524
# 2.set hashing space for each sparse field and generate feature config for sequence feature
2625

2726
fixlen_feature_columns = [SparseFeat(feat, data[feat].nunique() * 5, embedding_dim=4, use_hash=True, dtype='string')

examples/run_multivalue_movielens_vocab_hash.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,7 @@
2424
max_len = max(genres_length)
2525

2626
# Notice : padding=`post`
27-
genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', dtype=str, value=0)
28-
27+
genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', dtype=object, value=0).astype(str)
2928
# 2.set hashing space for each sparse field and generate feature config for sequence feature
3029

3130
fixlen_feature_columns = [SparseFeat(feat, data[feat].nunique() * 5, embedding_dim=4, use_hash=True,

0 commit comments

Comments
 (0)