add data intrfce rqrmnt on output of features, target, weights

ablaom · ablaom · commit 0d6669829a92 · 2024-10-31T10:46:19.000+13:00
diff --git a/docs/src/anatomy_of_an_implementation.md b/docs/src/anatomy_of_an_implementation.md
@@ -35,7 +35,7 @@ A transformer ordinarily implements `transform` instead of `predict`. For more o
     then an implementation must: (i) overload [`obs`](@ref) to articulate how
     provided data can be transformed into a form that does support
     this interface, as illustrated below under
-    [Providing an advanced data interface](@ref), and which may additionally
+    [Providing a separate data front end](@ref), and which may additionally
     enable certain performance benefits; or (ii) overload the trait
     [`LearnAPI.data_interface`](@ref) to specify a more relaxed data
     API.
@@ -314,7 +314,7 @@ recovered_model = deserialize(filename)
 @assert predict(recovered_model, X) == predict(model, X)
 ```
 
-## Providing an advanced data interface
+## Providing a separate data front end
 
 ```@setup anatomy2
 using LearnAPI
@@ -364,9 +364,13 @@ y = 2a - b + 3c + 0.05*rand(n)
 
 An implementation may optionally implement [`obs`](@ref), to expose to the user (or some
 meta-algorithm like cross-validation) the representation of input data internal to `fit`
-or `predict`, such as the matrix version `A` of `X` in the ridge example.  Here we
-specifically wrap all the pre-processed data into single object, for which we introduce a
-new type:
+or `predict`, such as the matrix version `A` of `X` in the ridge example.  That is, we may
+factor out of `fit` (and also `predict`) the data pre-processing step, `obs`, to expose
+its outcomes. These outcomes become alternative user inputs to `fit`. To see the use of
+`obs` in action, see [below](@ref advanced_demo).
+
+Here we specifically wrap all the pre-processed data into single object, for which we
+introduce a new type:
 
 ```@example anatomy2
 struct RidgeFitObs{T,M<:AbstractMatrix{T}}
@@ -503,7 +507,7 @@ As above, we add a signature which plays no role vis-à-vis LearnAPI.jl.
 LearnAPI.fit(learner::Ridge, X, y; kwargs...)  = fit(learner, (X, y); kwargs...)
 ```
 
-## Demonstration of an advanced `obs` workflow
+## [Demonstration of an advanced `obs` workflow](@id advanced_demo)
 
 We now can train and predict using internal data representations, resampled using the
 generic MLUtils.jl interface:
diff --git a/docs/src/obs.md b/docs/src/obs.md
@@ -83,7 +83,7 @@ end
 | [`obs(model, data)`](@ref)     | here `data` is `predict`-consumable | not typically | returns `data` |
 
 
-A sample implementation is given in [Providing an advanced data interface](@ref). 
+A sample implementation is given in [Providing a separate data front end](@ref). 
 
 
 ## Reference
diff --git a/docs/src/reference.md b/docs/src/reference.md
@@ -16,9 +16,7 @@ ML/statistical algorithms are typically applied in conjunction with resampling o
 *observations*, as in
 [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). In this
 document *data* will always refer to objects encapsulating an ordered sequence of
-individual observations. If a learner is trained using multiple data objects, it is
-undertood that individual objects share the same number of observations, and that
-resampling of one component implies synchronized resampling of the others.
+individual observations.
 
 A `DataFrame` instance, from [DataFrames.jl](https://dataframes.juliadata.org/stable/), is
 an example of data, the observations being the rows. Typically, data provided to
@@ -97,9 +95,11 @@ which can be tested with `@assert `[`LearnAPI.clone(learner)`](@ref)` == learner
 Note that if if `learner` is an instance of a *mutable* struct, this requirement
 generally requires overloading `Base.==` for the struct.
 
-No LearnAPI.jl method is permitted to mutate a learner. In particular, one should make
-deep copies of RNG hyperparameters before using them in a new implementation of
-[`fit`](@ref).
+!!! important
+
+    No LearnAPI.jl method is permitted to mutate a learner. In particular, one should make
+    deep copies of RNG hyperparameters before using them in a new implementation of
+    [`fit`](@ref).
 
 #### Composite learners (wrappers)
 
@@ -145,7 +145,7 @@ for each.
     [`LearnAPI.functions`](@ref).
 
 Most learners will also implement [`predict`](@ref) and/or [`transform`](@ref). For a
-bare minimum implementation, see the implementation of `SmallLearner`
+minimal (but useless) implementation, see the implementation of `SmallLearner`
 [here](https://github.com/JuliaAI/LearnAPI.jl/blob/dev/test/traits.jl).
 
 ### List of methods
diff --git a/src/fit_update.jl b/src/fit_update.jl
@@ -17,7 +17,7 @@ model = fit(learner, (X, y))
 ŷ = predict(model, Xnew)
 ```
 
-The second signature, with `data` omitted, is provided by learners that do not
+The signature `fit(learner; verbosity=1)` (no `data`) is provided by learners that do not
 generalize to new observations (called *static algorithms*). In that case,
 `transform(model, data)` or `predict(model, ..., data)` carries out the actual algorithm
 execution, writing any byproducts of that operation to the mutable object `model` returned
diff --git a/src/obs.jl b/src/obs.jl
@@ -77,11 +77,13 @@ only of suitable tables and arrays, then `obs` and `LearnAPI.data_interface` do
 to be overloaded. However, the user will get no performance benefits by using `obs` in
 that case.
 
-When overloading `obs(learner, data)` to output new model-specific representations of
+If overloading `obs(learner, data)` to output new model-specific representations of
 data, it may be necessary to also overload [`LearnAPI.features(learner,
 observations)`](@ref), [`LearnAPI.target(learner, observations)`](@ref) (supervised
 learners), and/or [`LearnAPI.weights(learner, observations)`](@ref) (if weights are
-supported), for each kind output `observations` of `obs(learner, data)`.
+supported), for each kind output `observations` of `obs(learner, data)`. Moreover, the
+outputs of these methods, applied to `observations`, must also implement the interface
+specfied by [`LearnAPI.data_interface(learner)`](@ref).
 
 ## Sample implementation
 
diff --git a/src/target_weights_features.jl b/src/target_weights_features.jl
@@ -5,20 +5,28 @@ Return, for each form of `data` supported in a call of the form [`fit(learner,
 data)`](@ref), the target variable part of `data`. If `nothing` is returned, the
 `learner` does not see a target variable in training (is unsupervised).
 
+The returned object `y` has the same number of observations as `data`. If `data` is the
+output of an [`obs`](@ref) call, then `y` is additionally guaranteed to implement the
+data interface specified by [`LearnAPI.data_interface(learner)`](@ref).
+
 # Extended help
 
 ## What is a target variable?
 
-Examples of target variables are house prices in realestate pricing estimates, the
+Examples of target variables are house prices in real estate pricing estimates, the
 "spam"/"not spam" labels in an email spam filtering task, "outlier"/"inlier" labels in
 outlier detection, cluster labels in clustering problems, and censored survival times in
 survival analysis. For more on targets and target proxies, see the "Reference" section of
 the LearnAPI.jl documentation.
 
 ## New implementations
 
-A fallback returns `nothing`. Must be implemented if `fit` consumes data including a
-target variable.
+A fallback returns `nothing`. The method must be overloaded if `fit` consumes data
+including a target variable.
+
+If overloading [`obs`](@ref), ensure that the return value, unless `nothing`, implements
+the data interface specified by [`LearnAPI.data_interface(learner)`](@ref), in the special
+case that `data` is the output of an `obs` call.
 
 $(DOC_IMPLEMENTED_METHODS(":(LearnAPI.target)"; overloaded=true))
 
@@ -32,10 +40,20 @@ Return, for each form of `data` supported in a call of the form [`fit(learner,
 data)`](@ref), the per-observation weights part of `data`. Where `nothing` is returned, no
 weights are part of `data`, which is to be interpreted as uniform weighting.
 
+The returned object `w` has the same number of observations as `data`. If `data` is the
+output of an [`obs`](@ref) call, then `w` is additionally guaranteed to implement the
+data interface specified by [`LearnAPI.data_interface(learner)`](@ref).
+
+# Extended help
+
 # New implementations
 
 Overloading is optional. A fallback returns `nothing`.
 
+If overloading [`obs`](@ref), ensure that the return value, unless `nothing`, implements
+the data interface specified by [`LearnAPI.data_interface(learner)`](@ref), in the special
+case that `data` is the output of an `obs` call.
+
 $(DOC_IMPLEMENTED_METHODS(":(LearnAPI.weights)"; overloaded=true))
 
 """
@@ -53,26 +71,34 @@ implemented, as in the following sample workflow:
 
 ```julia
 model = fit(learner, data)
-X = features(data)
+X = LearnAPI.features(learner, data)
 ŷ = predict(learner, kind_of_proxy, X) # eg, `kind_of_proxy = Point()`
 ```
 
-The returned object has the same number of observations as `data`. For supervised models
-(i.e., where `:(LearnAPI.target) in LearnAPI.functions(learner)`) `ŷ` above is generally
-intended to be an approximate proxy for `LearnAPI.target(learner, data)`, the training
-target.
+For supervised models (i.e., where `:(LearnAPI.target) in LearnAPI.functions(learner)`)
+`ŷ` above is generally intended to be an approximate proxy for `LearnAPI.target(learner,
+data)`, the training target.
+
+The object `X` returned by `LearnAPI.target` has the same number of observations as
+`data`. If `data` is the output of an [`obs`](@ref) call, then `X` is additionally
+guaranteed to implement the data interface specified by
+[`LearnAPI.data_interface(learner)`](@ref).
 
+# Extended help
 
 # New implementations
 
-That the output can be passed to `predict` and/or `transform`, and has the same number of
-observations as `data`, are the only contracts. A fallback returns `first(data)` if `data`
-is a tuple, and otherwise returns `data`.
+For density estimators, whose `fit` typically consumes *only* a target variable, you
+should overload this method to return `nothing`.
+
+It must otherwise be possible to pass the return value `X` to `predict` and/or
+`transform`, and `X` must have same number of observations as `data`. A fallback returns
+`first(data)` if `data` is a tuple, and otherwise returns `data`.
 
-Overloading may be necessary if [`obs(learner, data)`](@ref) is overloaded to return
-some learner-specific representation of training `data`. For density estimators, whose
-`fit` typically consumes *only* a target variable, you should overload this method to
-return `nothing`.
+Further overloadings may be necessary to handle the case that `data` is the output of
+[`obs(learner, data)`](@ref), if `obs` is being overloaded. In this case, be sure that
+`X`, unless `nothing`, implements the data interface specified by
+[`LearnAPI.data_interface(learner)`](@ref).
 
 """
 features(learner, data) = _first(data)