@@ -477,6 +477,38 @@ class DataIter(ABC): # pylint: disable=too-many-instance-attributes
477
477
`X`) as key. Don't repeat the `X` for multiple batches with different meta data
478
478
(like `label`), make a copy if necessary.
479
479
480
+ .. note::
481
+
482
+ When the input for each batch is a DataFrame, we assume categories are
483
+ consistently encoded for all batches. For example, given two dataframes for two
484
+ batches, this is invalid:
485
+
486
+ .. code-block::
487
+
488
+ import pandas as pd
489
+
490
+ x0 = pd.DataFrame({"a": [0, 1]}, dtype="category")
491
+ x1 = pd.DataFrame({"a": [1, 2]}, dtype="category")
492
+
493
+ This is invalid because the `x0` has `[0, 1]` as categories while `x2` has `[1,
494
+ 2]`. They should share the same set of categories and encoding:
495
+
496
+ .. code-block::
497
+
498
+ import numpy as np
499
+
500
+ categories = np.array([0, 1, 2])
501
+ x0["a"] = pd.Categorical.from_codes(
502
+ codes=np.array([0, 1]), categories=categories
503
+ )
504
+ x1["a"] = pd.Categorical.from_codes(
505
+ codes=np.array([1, 2]), categories=categories
506
+ )
507
+
508
+ You can make sure the consistent encoding in your preprocessing step be careful
509
+ that the data is stored in formats that preserve the encoding when chunking the
510
+ data.
511
+
480
512
Parameters
481
513
----------
482
514
cache_prefix :
@@ -861,15 +893,16 @@ def __init__(
861
893
862
894
Experimental support of specializing for categorical features.
863
895
864
- If passing ' True' and ' data' is a data frame (from supported libraries such
865
- as Pandas, Modin or cuDF), columns of categorical types will automatically
866
- be set to be of categorical type (feature_type='c') in the resulting
867
- DMatrix .
896
+ If passing ` True` and ` data` is a data frame (from supported libraries such as
897
+ Pandas, Modin or cuDF), The DMatrix recognizes categorical columns and
898
+ automatically set the `feature_types` parameter. If `data` is not a data
899
+ frame, this argument is ignored .
868
900
869
- If passing ' False' and ' data' is a data frame with categorical columns,
870
- it will result in an error being thrown .
901
+ If passing ` False` and ` data` is a data frame with categorical columns, it
902
+ will result in an error.
871
903
872
- If 'data' is not a data frame, this argument is ignored.
904
+ See notes in the :py:class:`DataIter` for consistency requirement when the
905
+ input is an iterator.
873
906
874
907
JSON/UBJSON serialization format is required for this.
875
908
0 commit comments