Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ jobs:
fail-fast: false
matrix:
version:
- '1.6' # previous LTS release
- '1.10' # new LTS release
- '1.10' # LTS release
- '1' # automatically expands to the latest stable 1.x release of Julia.
os:
- ubuntu-latest
Expand Down
24 changes: 2 additions & 22 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,10 @@ authors = ["Anthony D. Blaom <[email protected]>"]
version = "0.1.0"

[compat]
julia = "1.6"
julia = "1.10"

[extras]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
StableRNGs = "860ef19b-820b-49d6-a774-d7a799459cd3"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = [
"DataFrames",
"Distributions",
"LinearAlgebra",
"MLUtils",
"Random",
"Serialization",
"StableRNGs",
"Statistics",
"Tables",
"Test",
]
test = ["Test",]
2 changes: 1 addition & 1 deletion docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"

[compat]
Documenter = "1"
julia = "1.6"
julia = "1.10"
39 changes: 27 additions & 12 deletions docs/src/anatomy_of_an_implementation.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ A transformer ordinarily implements `transform` instead of `predict`. For more o
then an implementation must: (i) overload [`obs`](@ref) to articulate how
provided data can be transformed into a form that does support
this interface, as illustrated below under
[Providing an advanced data interface](@ref), and which may additionally
[Providing a separate data front end](@ref), and which may additionally
enable certain performance benefits; or (ii) overload the trait
[`LearnAPI.data_interface`](@ref) to specify a more relaxed data
API.
Expand Down Expand Up @@ -314,7 +314,7 @@ recovered_model = deserialize(filename)
@assert predict(recovered_model, X) == predict(model, X)
```

## Providing an advanced data interface
## Providing a separate data front end

```@setup anatomy2
using LearnAPI
Expand Down Expand Up @@ -364,9 +364,13 @@ y = 2a - b + 3c + 0.05*rand(n)

An implementation may optionally implement [`obs`](@ref), to expose to the user (or some
meta-algorithm like cross-validation) the representation of input data internal to `fit`
or `predict`, such as the matrix version `A` of `X` in the ridge example. Here we
specifically wrap all the pre-processed data into single object, for which we introduce a
new type:
or `predict`, such as the matrix version `A` of `X` in the ridge example. That is, we may
factor out of `fit` (and also `predict`) the data pre-processing step, `obs`, to expose
its outcomes. These outcomes become alternative user inputs to `fit`. To see the use of
`obs` in action, see [below](@ref advanced_demo).

Here we specifically wrap all the pre-processed data into single object, for which we
introduce a new type:

```@example anatomy2
struct RidgeFitObs{T,M<:AbstractMatrix{T}}
Expand Down Expand Up @@ -420,10 +424,21 @@ LearnAPI.fit(learner::Ridge, data; kwargs...) =

### The `obs` contract

Providing `fit` signatures matching the output of `obs`, is the first part of the `obs`
contract. The second part is this: *The output of `obs` must implement the interface
specified by the trait* [`LearnAPI.data_interface(learner)`](@ref). Assuming this is
[`LearnAPI.RandomAccess()`](@ref) (the default) it usually suffices to overload
Providing `fit` signatures matching the output of [`obs`](@ref), is the first part of the
`obs` contract. Since `obs(learner, data)` should evidently support all `data` that
`fit(learner, data)` supports, we must be able to apply `obs(learner, _)` to it's own
output (`observations` below). This leads to the additional "no-op" declaration

```@example anatomy2
LearnAPI.obs(::Ridge, observations::RidgeFitObs) = observations
```

In other words, we ensure that `obs(learner, _)` is
[involutive](https://en.wikipedia.org/wiki/Involution_(mathematics)).

The second part of the `obs` contract is this: *The output of `obs` must implement the
interface specified by the trait* [`LearnAPI.data_interface(learner)`](@ref). Assuming
this is [`LearnAPI.RandomAccess()`](@ref) (the default) it usually suffices to overload
`Base.getindex` and `Base.length`:

```@example anatomy2
Expand All @@ -432,11 +447,11 @@ Base.getindex(data::RidgeFitObs, I) =
Base.length(data::RidgeFitObs) = length(data.y)
```

We can do something similar for `predict`, but there's no need for a new type in this
case:
We do something similar for `predict`, but there's no need for a new type in this case:

```@example anatomy2
LearnAPI.obs(::RidgeFitted, Xnew) = Tables.matrix(Xnew)'
LearnAPI.obs(::RidgeFitted, observations::AbstractArray) = observations # involutivity

LearnAPI.predict(model::RidgeFitted, ::Point, observations::AbstractMatrix) =
observations'*model.coefficients
Expand Down Expand Up @@ -492,7 +507,7 @@ As above, we add a signature which plays no role vis-à-vis LearnAPI.jl.
LearnAPI.fit(learner::Ridge, X, y; kwargs...) = fit(learner, (X, y); kwargs...)
```

## Demonstration of an advanced `obs` workflow
## [Demonstration of an advanced `obs` workflow](@id advanced_demo)

We now can train and predict using internal data representations, resampled using the
generic MLUtils.jl interface:
Expand Down
6 changes: 3 additions & 3 deletions docs/src/common_implementation_patterns.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,12 @@ implementations fall into one (or more) of the following informally understood p

- [Feature Engineering](@ref): Algorithms for selecting or combining features

- Dimension Reduction: Transformers that learn to reduce feature space dimension
- [Dimension Reduction](@ref): Transformers that learn to reduce feature space dimension

- Missing Value Imputation

- Transformers: Other transformers, such as standardizers, and categorical
encoders.
- [Transformers](@ref transformers): Other transformers, such as standardizers, and
categorical encoders.

- [Static Algorithms](@ref): Algorithms that do not learn, in the sense they must be
re-executed for each new data set (do not generalize), but which have hyperparameters
Expand Down
2 changes: 1 addition & 1 deletion docs/src/obs.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ end
| [`obs(model, data)`](@ref) | here `data` is `predict`-consumable | not typically | returns `data` |


A sample implementation is given in [Providing an advanced data interface](@ref).
A sample implementation is given in [Providing a separate data front end](@ref).


## Reference
Expand Down
6 changes: 6 additions & 0 deletions docs/src/patterns/dimension_reduction.md
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
# Dimension Reduction

Check out the following examples:

- [Truncated
SVD]((https://github.com/JuliaAI/LearnTestAPI.jl/blob/dev/test/patterns/dimension_reduction.jl
(from the TestLearnAPI.jl test suite)
8 changes: 7 additions & 1 deletion docs/src/patterns/transformers.md
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
# Transformers
# [Transformers](@id transformers)

Check out the following examples:

- [Truncated
SVD]((https://github.com/JuliaAI/LearnTestAPI.jl/blob/dev/test/patterns/dimension_reduction.jl
(from the TestLearnAPI.jl test suite)
16 changes: 8 additions & 8 deletions docs/src/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,7 @@ ML/statistical algorithms are typically applied in conjunction with resampling o
*observations*, as in
[cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). In this
document *data* will always refer to objects encapsulating an ordered sequence of
individual observations. If a learner is trained using multiple data objects, it is
undertood that individual objects share the same number of observations, and that
resampling of one component implies synchronized resampling of the others.
individual observations.

A `DataFrame` instance, from [DataFrames.jl](https://dataframes.juliadata.org/stable/), is
an example of data, the observations being the rows. Typically, data provided to
Expand Down Expand Up @@ -97,9 +95,11 @@ which can be tested with `@assert `[`LearnAPI.clone(learner)`](@ref)` == learner
Note that if if `learner` is an instance of a *mutable* struct, this requirement
generally requires overloading `Base.==` for the struct.

No LearnAPI.jl method is permitted to mutate a learner. In particular, one should make
deep copies of RNG hyperparameters before using them in a new implementation of
[`fit`](@ref).
!!! important

No LearnAPI.jl method is permitted to mutate a learner. In particular, one should make
deep copies of RNG hyperparameters before using them in a new implementation of
[`fit`](@ref).

#### Composite learners (wrappers)

Expand All @@ -116,7 +116,7 @@ understood to have a valid implementation of the LearnAPI.jl interface.

#### Example

Any instance of `GradientRidgeRegressor` defined below is a valid learner.
Below is an example of a learner type with a valid constructor:

```julia
struct GradientRidgeRegressor{T<:Real}
Expand Down Expand Up @@ -145,7 +145,7 @@ for each.
[`LearnAPI.functions`](@ref).

Most learners will also implement [`predict`](@ref) and/or [`transform`](@ref). For a
bare minimum implementation, see the implementation of `SmallLearner`
minimal (but useless) implementation, see the implementation of `SmallLearner`
[here](https://github.com/JuliaAI/LearnAPI.jl/blob/dev/test/traits.jl).

### List of methods
Expand Down
2 changes: 1 addition & 1 deletion src/clone.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Return a shallow copy of `learner` with the specified hyperparameter replacement
clone(learner; epochs=100, learning_rate=0.01)
```

It is guaranteed that `LearnAPI.clone(learner) == learner`.
A LearnAPI.jl contract ensures that `LearnAPI.clone(learner) == learner`.

"""
function clone(learner; replacements...)
Expand Down
2 changes: 1 addition & 1 deletion src/fit_update.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ model = fit(learner, (X, y))
ŷ = predict(model, Xnew)
```

The second signature, with `data` omitted, is provided by learners that do not
The signature `fit(learner; verbosity=1)` (no `data`) is provided by learners that do not
generalize to new observations (called *static algorithms*). In that case,
`transform(model, data)` or `predict(model, ..., data)` carries out the actual algorithm
execution, writing any byproducts of that operation to the mutable object `model` returned
Expand Down
37 changes: 29 additions & 8 deletions src/obs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,25 @@ For each supported form of `data` in `fit(learner, data)`, it must be true that
fit(learner, observations)` is equivalent to `model = fit(learner, data)`, whenever
`observations = obs(learner, data)`. For each supported form of `data` in calls
`predict(model, ..., data)` and `transform(model, data)`, where implemented, the calls
`predict(model, ..., observations)` and `transform(model, observations)` are supported
alternatives, whenever `observations = obs(model, data)`.
`predict(model, ..., observations)` and `transform(model, observations)` must be supported
alternatives with the same output, whenever `observations = obs(model, data)`.

If `LearnAPI.data_interface(learner) == RandomAccess()` (the default), then `fit`,
`predict` and `transform` must additionally accept `obs` output that has been *subsampled*
using `MLUtils.getobs`, with the obvious interpretation applying to the outcomes of such
calls (e.g., if *all* observations are subsampled, then outcomes should be the same as if
using the original data).

Implicit in preceding requirements is that `obs(learner, _)` and `obs(model, _)` are
involutive, meaning both the following hold:

```julia
obs(learner, obs(learner, data)) == obs(learner, data)
obs(model, obs(model, data) == obs(model, obs(model, data)
```

If one overloads `obs`, one typically needs additionally overloadings to guarantee
involutivity.

The fallback for `obs` is `obs(model_or_learner, data) = data`, and the fallback for
`LearnAPI.data_interface(learner)` is `LearnAPI.RandomAccess()`. For details refer to
Expand All @@ -66,15 +83,19 @@ only of suitable tables and arrays, then `obs` and `LearnAPI.data_interface` do
to be overloaded. However, the user will get no performance benefits by using `obs` in
that case.

When overloading `obs(learner, data)` to output new model-specific representations of
data, it may be necessary to also overload [`LearnAPI.features`](@ref),
[`LearnAPI.target`](@ref) (supervised learners), and/or [`LearnAPI.weights`](@ref) (if
weights are supported), for extracting relevant parts of the representation.
If overloading `obs(learner, data)` to output new model-specific representations of
data, it may be necessary to also overload [`LearnAPI.features(learner,
observations)`](@ref), [`LearnAPI.target(learner, observations)`](@ref) (supervised
learners), and/or [`LearnAPI.weights(learner, observations)`](@ref) (if weights are
supported), for each kind output `observations` of `obs(learner, data)`. Moreover, the
outputs of these methods, applied to `observations`, must also implement the interface
specified by [`LearnAPI.data_interface(learner)`](@ref).

## Sample implementation

Refer to the "Anatomy of an Implementation" section of the LearnAPI.jl
[manual](https://juliaai.github.io/LearnAPI.jl/dev/).
Refer to the ["Anatomy of an
Implementation"](https://juliaai.github.io/LearnAPI.jl/dev/anatomy_of_an_implementation/#Providing-an-advanced-data-interface)
section of the LearnAPI.jl manual.


"""
Expand Down
58 changes: 42 additions & 16 deletions src/target_weights_features.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,28 @@ Return, for each form of `data` supported in a call of the form [`fit(learner,
data)`](@ref), the target variable part of `data`. If `nothing` is returned, the
`learner` does not see a target variable in training (is unsupervised).

The returned object `y` has the same number of observations as `data`. If `data` is the
output of an [`obs`](@ref) call, then `y` is additionally guaranteed to implement the
data interface specified by [`LearnAPI.data_interface(learner)`](@ref).

# Extended help

## What is a target variable?

Examples of target variables are house prices in realestate pricing estimates, the
Examples of target variables are house prices in real estate pricing estimates, the
"spam"/"not spam" labels in an email spam filtering task, "outlier"/"inlier" labels in
outlier detection, cluster labels in clustering problems, and censored survival times in
survival analysis. For more on targets and target proxies, see the "Reference" section of
the LearnAPI.jl documentation.

## New implementations

A fallback returns `nothing`. Must be implemented if `fit` consumes data including a
target variable.
A fallback returns `nothing`. The method must be overloaded if `fit` consumes data
including a target variable.

If overloading [`obs`](@ref), ensure that the return value, unless `nothing`, implements
the data interface specified by [`LearnAPI.data_interface(learner)`](@ref), in the special
case that `data` is the output of an `obs` call.

$(DOC_IMPLEMENTED_METHODS(":(LearnAPI.target)"; overloaded=true))

Expand All @@ -32,10 +40,20 @@ Return, for each form of `data` supported in a call of the form [`fit(learner,
data)`](@ref), the per-observation weights part of `data`. Where `nothing` is returned, no
weights are part of `data`, which is to be interpreted as uniform weighting.

The returned object `w` has the same number of observations as `data`. If `data` is the
output of an [`obs`](@ref) call, then `w` is additionally guaranteed to implement the
data interface specified by [`LearnAPI.data_interface(learner)`](@ref).

# Extended help

# New implementations

Overloading is optional. A fallback returns `nothing`.

If overloading [`obs`](@ref), ensure that the return value, unless `nothing`, implements
the data interface specified by [`LearnAPI.data_interface(learner)`](@ref), in the special
case that `data` is the output of an `obs` call.

$(DOC_IMPLEMENTED_METHODS(":(LearnAPI.weights)"; overloaded=true))

"""
Expand All @@ -53,26 +71,34 @@ implemented, as in the following sample workflow:

```julia
model = fit(learner, data)
X = features(data)
ŷ = predict(learner, kind_of_proxy, X) # eg, `kind_of_proxy = Point()`
X = LearnAPI.features(learner, data)
ŷ = predict(model, kind_of_proxy, X) # eg, `kind_of_proxy = Point()`
```

The returned object has the same number of observations as `data`. For supervised models
(i.e., where `:(LearnAPI.target) in LearnAPI.functions(learner)`) `ŷ` above is generally
intended to be an approximate proxy for `LearnAPI.target(learner, data)`, the training
target.
For supervised models (i.e., where `:(LearnAPI.target) in LearnAPI.functions(learner)`)
`ŷ` above is generally intended to be an approximate proxy for `LearnAPI.target(learner,
data)`, the training target.

The object `X` returned by `LearnAPI.target` has the same number of observations as
`data`. If `data` is the output of an [`obs`](@ref) call, then `X` is additionally
guaranteed to implement the data interface specified by
[`LearnAPI.data_interface(learner)`](@ref).

# Extended help

# New implementations

That the output can be passed to `predict` and/or `transform`, and has the same number of
observations as `data`, are the only contracts. A fallback returns `first(data)` if `data`
is a tuple, and otherwise returns `data`.
For density estimators, whose `fit` typically consumes *only* a target variable, you
should overload this method to return `nothing`.

It must otherwise be possible to pass the return value `X` to `predict` and/or
`transform`, and `X` must have same number of observations as `data`. A fallback returns
`first(data)` if `data` is a tuple, and otherwise returns `data`.

Overloading may be necessary if [`obs(learner, data)`](@ref) is overloaded to return
some learner-specific representation of training `data`. For density estimators, whose
`fit` typically consumes *only* a target variable, you should overload this method to
return `nothing`.
Further overloadings may be necessary to handle the case that `data` is the output of
[`obs(learner, data)`](@ref), if `obs` is being overloaded. In this case, be sure that
`X`, unless `nothing`, implements the data interface specified by
[`LearnAPI.data_interface(learner)`](@ref).

"""
features(learner, data) = _first(data)
Expand Down
Loading
Loading