JuliaAI · ablaom · Nov 2, 2024 · Oct 27, 2024 · Oct 27, 2024 · Oct 27, 2024
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -17,8 +17,7 @@ jobs:
       fail-fast: false
       matrix:
         version:
-          - '1.6'  # previous LTS release
-          - '1.10' # new LTS release
+          - '1.10' # LTS release
           - '1' # automatically expands to the latest stable 1.x release of Julia.
         os:
           - ubuntu-latest

diff --git a/Project.toml b/Project.toml
@@ -4,30 +4,10 @@ authors = ["Anthony D. Blaom <[email protected]>"]
 version = "0.1.0"
 
 [compat]
-julia = "1.6"
+julia = "1.10"
 
 [extras]
-DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
-Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
-LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
-MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
-Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
-Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
-StableRNGs = "860ef19b-820b-49d6-a774-d7a799459cd3"
-Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
-Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 
 [targets]
-test = [
-    "DataFrames",
-    "Distributions",
-    "LinearAlgebra",
-    "MLUtils",
-    "Random",
-    "Serialization",
-    "StableRNGs",
-    "Statistics",
-    "Tables",
-    "Test",
-]
+test = ["Test",]
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -7,4 +7,4 @@ Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
 
 [compat]
 Documenter = "1"
-julia = "1.6"
+julia = "1.10"
diff --git a/docs/src/anatomy_of_an_implementation.md b/docs/src/anatomy_of_an_implementation.md
@@ -35,7 +35,7 @@ A transformer ordinarily implements `transform` instead of `predict`. For more o
     then an implementation must: (i) overload [`obs`](@ref) to articulate how
     provided data can be transformed into a form that does support
     this interface, as illustrated below under
-    [Providing an advanced data interface](@ref), and which may additionally
+    [Providing a separate data front end](@ref), and which may additionally
     enable certain performance benefits; or (ii) overload the trait
     [`LearnAPI.data_interface`](@ref) to specify a more relaxed data
     API.
@@ -314,7 +314,7 @@ recovered_model = deserialize(filename)
 @assert predict(recovered_model, X) == predict(model, X)
 ```
 
-## Providing an advanced data interface
+## Providing a separate data front end
 
 ```@setup anatomy2
 using LearnAPI
@@ -364,9 +364,13 @@ y = 2a - b + 3c + 0.05*rand(n)
 
 An implementation may optionally implement [`obs`](@ref), to expose to the user (or some
 meta-algorithm like cross-validation) the representation of input data internal to `fit`
-or `predict`, such as the matrix version `A` of `X` in the ridge example.  Here we
-specifically wrap all the pre-processed data into single object, for which we introduce a
-new type:
+or `predict`, such as the matrix version `A` of `X` in the ridge example.  That is, we may
+factor out of `fit` (and also `predict`) the data pre-processing step, `obs`, to expose
+its outcomes. These outcomes become alternative user inputs to `fit`. To see the use of
+`obs` in action, see [below](@ref advanced_demo).
+
+Here we specifically wrap all the pre-processed data into single object, for which we
+introduce a new type:
 
 ```@example anatomy2
 struct RidgeFitObs{T,M<:AbstractMatrix{T}}
@@ -420,10 +424,21 @@ LearnAPI.fit(learner::Ridge, data; kwargs...) =
 
 ### The `obs` contract
 
-Providing `fit` signatures matching the output of `obs`, is the first part of the `obs`
-contract. The second part is this: *The output of `obs` must implement the interface
-specified by the trait* [`LearnAPI.data_interface(learner)`](@ref). Assuming this is
-[`LearnAPI.RandomAccess()`](@ref) (the default) it usually suffices to overload
+Providing `fit` signatures matching the output of [`obs`](@ref), is the first part of the
+`obs` contract. Since `obs(learner, data)` should evidently support all `data` that
+`fit(learner, data)` supports, we must be able to apply `obs(learner, _)` to it's own
+output (`observations` below). This leads to the additional "no-op" declaration
+
+```@example anatomy2
+LearnAPI.obs(::Ridge, observations::RidgeFitObs) = observations
+```
+
+In other words, we ensure that `obs(learner, _)` is
+[involutive](https://en.wikipedia.org/wiki/Involution_(mathematics)).
+
+The second part of the `obs` contract is this: *The output of `obs` must implement the
+interface specified by the trait* [`LearnAPI.data_interface(learner)`](@ref). Assuming
+this is [`LearnAPI.RandomAccess()`](@ref) (the default) it usually suffices to overload
 `Base.getindex` and `Base.length`:
 
 ```@example anatomy2
@@ -432,11 +447,11 @@ Base.getindex(data::RidgeFitObs, I) =
 Base.length(data::RidgeFitObs) = length(data.y)
 ```
 
-We can do something similar for `predict`, but there's no need for a new type in this
-case:
+We do something similar for `predict`, but there's no need for a new type in this case:
 
 ```@example anatomy2
 LearnAPI.obs(::RidgeFitted, Xnew) = Tables.matrix(Xnew)'
+LearnAPI.obs(::RidgeFitted, observations::AbstractArray) = observations # involutivity
 
 LearnAPI.predict(model::RidgeFitted, ::Point, observations::AbstractMatrix) =
     observations'*model.coefficients
@@ -492,7 +507,7 @@ As above, we add a signature which plays no role vis-à-vis LearnAPI.jl.
 LearnAPI.fit(learner::Ridge, X, y; kwargs...)  = fit(learner, (X, y); kwargs...)
 ```
 
-## Demonstration of an advanced `obs` workflow
+## [Demonstration of an advanced `obs` workflow](@id advanced_demo)
 
 We now can train and predict using internal data representations, resampled using the
 generic MLUtils.jl interface:

diff --git a/docs/src/common_implementation_patterns.md b/docs/src/common_implementation_patterns.md
@@ -27,12 +27,12 @@ implementations fall into one (or more) of the following informally understood p
 
 - [Feature Engineering](@ref): Algorithms for selecting or combining features
 
-- Dimension Reduction: Transformers that learn to reduce feature space dimension
+- [Dimension Reduction](@ref): Transformers that learn to reduce feature space dimension
 
 - Missing Value Imputation
 
-- Transformers: Other transformers, such as standardizers, and categorical
-  encoders.
+- [Transformers](@ref transformers): Other transformers, such as standardizers, and
+  categorical encoders.
 
 - [Static Algorithms](@ref): Algorithms that do not learn, in the sense they must be
   re-executed for each new data set (do not generalize), but which have hyperparameters

diff --git a/docs/src/obs.md b/docs/src/obs.md
@@ -83,7 +83,7 @@ end
 | [`obs(model, data)`](@ref)     | here `data` is `predict`-consumable | not typically | returns `data` |
 
 
-A sample implementation is given in [Providing an advanced data interface](@ref). 
+A sample implementation is given in [Providing a separate data front end](@ref). 
 
 
 ## Reference

diff --git a/docs/src/patterns/dimension_reduction.md b/docs/src/patterns/dimension_reduction.md
@@ -1 +1,7 @@
 # Dimension Reduction
+
+Check out the following examples:
+
+- [Truncated
+  SVD]((https://github.com/JuliaAI/LearnTestAPI.jl/blob/dev/test/patterns/dimension_reduction.jl
+  (from the TestLearnAPI.jl test suite)
diff --git a/docs/src/patterns/transformers.md b/docs/src/patterns/transformers.md
@@ -1 +1,7 @@
-# Transformers
+# [Transformers](@id transformers)
+
+Check out the following examples:
+
+- [Truncated
+  SVD]((https://github.com/JuliaAI/LearnTestAPI.jl/blob/dev/test/patterns/dimension_reduction.jl
+  (from the TestLearnAPI.jl test suite)
diff --git a/docs/src/reference.md b/docs/src/reference.md
@@ -16,9 +16,7 @@ ML/statistical algorithms are typically applied in conjunction with resampling o
 *observations*, as in
 [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). In this
 document *data* will always refer to objects encapsulating an ordered sequence of
-individual observations. If a learner is trained using multiple data objects, it is
-undertood that individual objects share the same number of observations, and that
-resampling of one component implies synchronized resampling of the others.
+individual observations.
 
 A `DataFrame` instance, from [DataFrames.jl](https://dataframes.juliadata.org/stable/), is
 an example of data, the observations being the rows. Typically, data provided to
@@ -97,9 +95,11 @@ which can be tested with `@assert `[`LearnAPI.clone(learner)`](@ref)` == learner
 Note that if if `learner` is an instance of a *mutable* struct, this requirement
 generally requires overloading `Base.==` for the struct.
 
-No LearnAPI.jl method is permitted to mutate a learner. In particular, one should make
-deep copies of RNG hyperparameters before using them in a new implementation of
-[`fit`](@ref).
+!!! important
+
+    No LearnAPI.jl method is permitted to mutate a learner. In particular, one should make
+    deep copies of RNG hyperparameters before using them in a new implementation of
+    [`fit`](@ref).
 
 #### Composite learners (wrappers)
 
@@ -116,7 +116,7 @@ understood to have a valid implementation of the LearnAPI.jl interface.
 
 #### Example
 
-Any instance of `GradientRidgeRegressor` defined below is a valid learner.
+Below is an example of a learner type with a valid constructor:
 
 ```julia
 struct GradientRidgeRegressor{T<:Real}
@@ -145,7 +145,7 @@ for each.
     [`LearnAPI.functions`](@ref).
 
 Most learners will also implement [`predict`](@ref) and/or [`transform`](@ref). For a
-bare minimum implementation, see the implementation of `SmallLearner`
+minimal (but useless) implementation, see the implementation of `SmallLearner`
 [here](https://github.com/JuliaAI/LearnAPI.jl/blob/dev/test/traits.jl).
 
 ### List of methods

diff --git a/src/clone.jl b/src/clone.jl
@@ -7,7 +7,7 @@ Return a shallow copy of `learner` with the specified hyperparameter replacement
 clone(learner; epochs=100, learning_rate=0.01)
 ```
 
-It is guaranteed that `LearnAPI.clone(learner) == learner`.
+A LearnAPI.jl contract ensures that `LearnAPI.clone(learner) == learner`.
 
 """
 function clone(learner; replacements...)

diff --git a/src/fit_update.jl b/src/fit_update.jl
@@ -17,7 +17,7 @@ model = fit(learner, (X, y))
 ŷ = predict(model, Xnew)
 ```
 
-The second signature, with `data` omitted, is provided by learners that do not
+The signature `fit(learner; verbosity=1)` (no `data`) is provided by learners that do not
 generalize to new observations (called *static algorithms*). In that case,
 `transform(model, data)` or `predict(model, ..., data)` carries out the actual algorithm
 execution, writing any byproducts of that operation to the mutable object `model` returned

diff --git a/src/obs.jl b/src/obs.jl
@@ -54,8 +54,25 @@ For each supported form of `data` in `fit(learner, data)`, it must be true that
 fit(learner, observations)` is equivalent to `model = fit(learner, data)`, whenever
 `observations = obs(learner, data)`. For each supported form of `data` in calls
 `predict(model, ..., data)` and `transform(model, data)`, where implemented, the calls
-`predict(model, ..., observations)` and `transform(model, observations)` are supported
-alternatives, whenever `observations = obs(model, data)`.
+`predict(model, ..., observations)` and `transform(model, observations)` must be supported
+alternatives with the same output, whenever `observations = obs(model, data)`.
+
+If `LearnAPI.data_interface(learner) == RandomAccess()` (the default), then `fit`,
+`predict` and `transform` must additionally accept `obs` output that has been *subsampled*
+using `MLUtils.getobs`, with the obvious interpretation applying to the outcomes of such
+calls (e.g., if *all* observations are subsampled, then outcomes should be the same as if
+using the original data).
+
+Implicit in preceding requirements is that `obs(learner, _)` and `obs(model, _)` are
+involutive, meaning both the following hold:
+
+```julia
+obs(learner, obs(learner, data)) == obs(learner, data)
+obs(model, obs(model, data) == obs(model, obs(model, data)
+```
+
+If one overloads `obs`, one typically needs additionally overloadings to guarantee
+involutivity.
 
 The fallback for `obs` is `obs(model_or_learner, data) = data`, and the fallback for
 `LearnAPI.data_interface(learner)` is `LearnAPI.RandomAccess()`. For details refer to
@@ -66,15 +83,19 @@ only of suitable tables and arrays, then `obs` and `LearnAPI.data_interface` do
 to be overloaded. However, the user will get no performance benefits by using `obs` in
 that case.
 
-When overloading `obs(learner, data)` to output new model-specific representations of
-data, it may be necessary to also overload [`LearnAPI.features`](@ref),
-[`LearnAPI.target`](@ref) (supervised learners), and/or [`LearnAPI.weights`](@ref) (if
-weights are supported), for extracting relevant parts of the representation.
+If overloading `obs(learner, data)` to output new model-specific representations of
+data, it may be necessary to also overload [`LearnAPI.features(learner,
+observations)`](@ref), [`LearnAPI.target(learner, observations)`](@ref) (supervised
+learners), and/or [`LearnAPI.weights(learner, observations)`](@ref) (if weights are
+supported), for each kind output `observations` of `obs(learner, data)`. Moreover, the
+outputs of these methods, applied to `observations`, must also implement the interface
+specified by [`LearnAPI.data_interface(learner)`](@ref).
 
 ## Sample implementation
 
-Refer to the "Anatomy of an Implementation" section of the LearnAPI.jl
-[manual](https://juliaai.github.io/LearnAPI.jl/dev/).
+Refer to the ["Anatomy of an
+Implementation"](https://juliaai.github.io/LearnAPI.jl/dev/anatomy_of_an_implementation/#Providing-an-advanced-data-interface)
+section of the LearnAPI.jl manual.
 
 
 """

diff --git a/src/target_weights_features.jl b/src/target_weights_features.jl
@@ -5,20 +5,28 @@ Return, for each form of `data` supported in a call of the form [`fit(learner,
 data)`](@ref), the target variable part of `data`. If `nothing` is returned, the
 `learner` does not see a target variable in training (is unsupervised).
 
+The returned object `y` has the same number of observations as `data`. If `data` is the
+output of an [`obs`](@ref) call, then `y` is additionally guaranteed to implement the
+data interface specified by [`LearnAPI.data_interface(learner)`](@ref).
+
 # Extended help
 
 ## What is a target variable?
 
-Examples of target variables are house prices in realestate pricing estimates, the
+Examples of target variables are house prices in real estate pricing estimates, the
 "spam"/"not spam" labels in an email spam filtering task, "outlier"/"inlier" labels in
 outlier detection, cluster labels in clustering problems, and censored survival times in
 survival analysis. For more on targets and target proxies, see the "Reference" section of
 the LearnAPI.jl documentation.
 
 ## New implementations
 
-A fallback returns `nothing`. Must be implemented if `fit` consumes data including a
-target variable.
+A fallback returns `nothing`. The method must be overloaded if `fit` consumes data
+including a target variable.
+
+If overloading [`obs`](@ref), ensure that the return value, unless `nothing`, implements
+the data interface specified by [`LearnAPI.data_interface(learner)`](@ref), in the special
+case that `data` is the output of an `obs` call.
 
 $(DOC_IMPLEMENTED_METHODS(":(LearnAPI.target)"; overloaded=true))
 
@@ -32,10 +40,20 @@ Return, for each form of `data` supported in a call of the form [`fit(learner,
 data)`](@ref), the per-observation weights part of `data`. Where `nothing` is returned, no
 weights are part of `data`, which is to be interpreted as uniform weighting.
 
+The returned object `w` has the same number of observations as `data`. If `data` is the
+output of an [`obs`](@ref) call, then `w` is additionally guaranteed to implement the
+data interface specified by [`LearnAPI.data_interface(learner)`](@ref).
+
+# Extended help
+
 # New implementations
 
 Overloading is optional. A fallback returns `nothing`.
 
+If overloading [`obs`](@ref), ensure that the return value, unless `nothing`, implements
+the data interface specified by [`LearnAPI.data_interface(learner)`](@ref), in the special
+case that `data` is the output of an `obs` call.
+
 $(DOC_IMPLEMENTED_METHODS(":(LearnAPI.weights)"; overloaded=true))
 
 """
@@ -53,26 +71,34 @@ implemented, as in the following sample workflow:
 
 ```julia
 model = fit(learner, data)
-X = features(data)
-ŷ = predict(learner, kind_of_proxy, X) # eg, `kind_of_proxy = Point()`
+X = LearnAPI.features(learner, data)
+ŷ = predict(model, kind_of_proxy, X) # eg, `kind_of_proxy = Point()`
 ```
 
-The returned object has the same number of observations as `data`. For supervised models
-(i.e., where `:(LearnAPI.target) in LearnAPI.functions(learner)`) `ŷ` above is generally
-intended to be an approximate proxy for `LearnAPI.target(learner, data)`, the training
-target.
+For supervised models (i.e., where `:(LearnAPI.target) in LearnAPI.functions(learner)`)
+`ŷ` above is generally intended to be an approximate proxy for `LearnAPI.target(learner,
+data)`, the training target.
+
+The object `X` returned by `LearnAPI.target` has the same number of observations as
+`data`. If `data` is the output of an [`obs`](@ref) call, then `X` is additionally
+guaranteed to implement the data interface specified by
+[`LearnAPI.data_interface(learner)`](@ref).
 
+# Extended help
 
 # New implementations
 
-That the output can be passed to `predict` and/or `transform`, and has the same number of
-observations as `data`, are the only contracts. A fallback returns `first(data)` if `data`
-is a tuple, and otherwise returns `data`.
+For density estimators, whose `fit` typically consumes *only* a target variable, you
+should overload this method to return `nothing`.
+
+It must otherwise be possible to pass the return value `X` to `predict` and/or
+`transform`, and `X` must have same number of observations as `data`. A fallback returns
+`first(data)` if `data` is a tuple, and otherwise returns `data`.
 
-Overloading may be necessary if [`obs(learner, data)`](@ref) is overloaded to return
-some learner-specific representation of training `data`. For density estimators, whose
-`fit` typically consumes *only* a target variable, you should overload this method to
-return `nothing`.
+Further overloadings may be necessary to handle the case that `data` is the output of
+[`obs(learner, data)`](@ref), if `obs` is being overloaded. In this case, be sure that
+`X`, unless `nothing`, implements the data interface specified by
+[`LearnAPI.data_interface(learner)`](@ref).
 
 """
 features(learner, data) = _first(data)