Skip to content

Commit 0630f46

Browse files
Implement HypothesisCoding and deprecate ContrastsCoding (#158)
* add HypothesisCoding, tests, don't export ContrastsCoding * actually commit tests * update docs * add seqdiff coding, update docs and tests * Apply suggestions from code review Co-Authored-By: Milan Bouchet-Valat <[email protected]> * outcome → outcomes Co-Authored-By: Milan Bouchet-Valat <[email protected]> * deprecate but export ContrastsCoding; clarify docstring * fix typo Co-Authored-By: Milan Bouchet-Valat <[email protected]> * bump patch version Co-authored-by: Milan Bouchet-Valat <[email protected]>
1 parent 51f1749 commit 0630f46

File tree

5 files changed

+225
-14
lines changed

5 files changed

+225
-14
lines changed

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
name = "StatsModels"
22
uuid = "3eaba693-59b7-5ba5-a881-562e759f1c8d"
3-
version = "0.6.7"
3+
version = "0.6.8"
44

55
[deps]
66
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"

docs/src/contrasts.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
```@meta
22
CurrentModule = StatsModels
3+
DocTestSetup = quote
4+
using StatsModels
5+
using LinearAlgebra
6+
end
37
```
48

59
# Modeling categorical data
@@ -14,7 +18,8 @@ The following contrast coding systems are implemented:
1418
* [`DummyCoding`](@ref)
1519
* [`EffectsCoding`](@ref)
1620
* [`HelmertCoding`](@ref)
17-
* [`ContrastsCoding`](@ref)
21+
* [`HypothesisCoding`](@ref)
22+
* [`SeqDiffCoding`](@ref)
1823

1924
## How to specify contrast coding
2025

@@ -44,13 +49,15 @@ ContrastsMatrix
4449
DummyCoding
4550
EffectsCoding
4651
HelmertCoding
47-
ContrastsCoding
52+
SeqDiffCoding
53+
HypothesisCoding
4854
```
4955

5056
### Special internal contrasts
5157

5258
```@docs
5359
FullDummyCoding
60+
ContrastsCoding
5461
```
5562

5663
## Further details

src/StatsModels.jl

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,10 @@ export
2424
EffectsCoding,
2525
DummyCoding,
2626
HelmertCoding,
27+
SeqDiffCoding,
28+
HypothesisCoding,
2729
ContrastsCoding,
28-
30+
2931
coefnames,
3032
dropterm,
3133
setcontrasts!,

src/contrasts.jl

Lines changed: 135 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@ Base.getindex(contrasts::ContrastsMatrix{C,T}, rowinds, colinds) where {C,T} =
213213
# Making a contrast type T only requires that there be a method for
214214
# contrasts_matrix(T, baseind, n) and optionally termnames(T, levels, baseind)
215215
# The rest is boilerplate.
216-
for contrastType in [:DummyCoding, :EffectsCoding, :HelmertCoding]
216+
for contrastType in [:DummyCoding, :EffectsCoding, :HelmertCoding, :SeqDiffCoding]
217217
@eval begin
218218
mutable struct $contrastType <: AbstractContrasts
219219
base::Any
@@ -368,25 +368,153 @@ function contrasts_matrix(C::HelmertCoding, baseind, n)
368368
end
369369

370370
"""
371-
ContrastsCoding(mat::Matrix[, base[, levels]])
371+
SeqDiffCoding([base[, levels]])
372+
373+
Code each level in order to test "sequential difference" hypotheses, which
374+
compares each level to the level below it (starting with the second level).
375+
Specifically, the ``n``th predictor tests the hypothesis that the difference
376+
between levels ``n`` and ``n+1`` is zero.
377+
378+
# Examples
379+
380+
```jldoctest seqdiff
381+
julia> seqdiff = StatsModels.ContrastsMatrix(SeqDiffCoding(), ["a", "b", "c", "d"]).matrix
382+
4×3 Array{Float64,2}:
383+
-0.75 -0.5 -0.25
384+
0.25 -0.5 -0.25
385+
0.25 0.5 -0.25
386+
0.25 0.5 0.75
387+
```
388+
389+
The interpretation of sequential difference coding may be hard to see from the
390+
contrasts matrix itself. The corresponding hypothesis matrix shows a clearer
391+
picture. From the rows of the hypothesis matrix, we can see that these
392+
contrasts test the difference between the first and second levels, the second
393+
and third, and the third and fourth, respectively:
394+
395+
```jldoctest seqdiff
396+
julia> round.(pinv(seqdiff), digits=2)
397+
3×4 Array{Float64,2}:
398+
-1.0 1.0 -0.0 0.0
399+
-0.0 -1.0 1.0 -0.0
400+
0.0 -0.0 -1.0 1.0
401+
```
402+
403+
"""
404+
SeqDiffCoding
405+
406+
function contrasts_matrix(C::SeqDiffCoding, baseind, n)
407+
mat = zeros(n, n-1)
408+
for col in 1:n-1
409+
mat[1:col, col] .= col-n
410+
mat[col+1:end, col] .= col
411+
end
412+
return mat ./ n
413+
end
414+
415+
416+
"""
417+
HypothesisCoding(hypotheses::Matrix[, levels])
418+
419+
Specify how to code a categorical variable in terms of a *hypothesis matrix*.
420+
For a variable with ``k`` levels, this should be a ``k-1 \times k`` matrix.
421+
Each row of the matrix corresponds to a hypothesis about the mean
422+
outcomes under each of the ``k`` levels of the predictor. The entries in the
423+
row give the weights assigned to each of these ``k`` means, and the
424+
corresponding predictor in a regression model estimates the weighted sum of
425+
these cell means.
426+
427+
For instance, if we have a variable which has four levels A, B, C, and D, and we
428+
want to test the hypothesis that the difference between the average outcomes for
429+
levels A and B is different from zero, the corresponding row of the hypothesis
430+
matrix would be `[-1, 1, 0, 0]`. Likewise, to test whether the difference
431+
between B and C is different from zero, the hypothesis vector would be `[0, -1,
432+
1, 0]`. To test each "successive difference" hypothesis, the full hypothesis
433+
matrix would be
434+
435+
```jldoctest hyp
436+
julia> sdiff_hypothesis = [-1 1 0 0
437+
0 -1 1 0
438+
0 0 -1 1];
439+
```
440+
441+
Contrasts are derived the hypothesis matrix by taking the pseudoinverse:
442+
443+
```jldoctest hyp
444+
julia> sdiff_contrasts = pinv(sdiff_hypothesis)
445+
4×3 Array{Float64,2}:
446+
-0.75 -0.5 -0.25
447+
0.25 -0.5 -0.25
448+
0.25 0.5 -0.25
449+
0.25 0.5 0.75
450+
```
451+
452+
The above matrix is what is produced by constructing a [`ContrastsMatrix`](@ref) from a
453+
`HypothesisCoding` instance:
454+
455+
```jldoctest hyp
456+
julia> StatsModels.ContrastsMatrix(HypothesisCoding(sdiff_hypothesis), ["a", "b", "c", "d"]).matrix
457+
4×3 Array{Float64,2}:
458+
-0.75 -0.5 -0.25
459+
0.25 -0.5 -0.25
460+
0.25 0.5 -0.25
461+
0.25 0.5 0.75
462+
```
463+
464+
The interpretation of the such "sequential difference" contrasts are clear when
465+
expressed as a hypothesis matrix, but it is not obvious just from looking at the
466+
contrasts matrix. For this reason `HypothesisCoding` is preferred for
467+
specifying custom contrast coding schemes over `ContrastsCoding`.
468+
469+
"""
470+
mutable struct HypothesisCoding <: AbstractContrasts
471+
hypotheses::Matrix
472+
contrasts::Matrix
473+
base::Nothing
474+
levels::Union{Vector,Nothing}
475+
476+
function HypothesisCoding(hypotheses, base, levels)
477+
contrasts = pinv(hypotheses)
478+
check_contrasts_size(contrasts, levels)
479+
new(hypotheses, contrasts, base, levels)
480+
end
481+
end
482+
483+
HypothesisCoding(mat::Matrix; levels=nothing) =
484+
HypothesisCoding(mat, nothing, levels)
485+
486+
function contrasts_matrix(C::HypothesisCoding, baseind, n)
487+
check_contrasts_size(C.contrasts, n)
488+
C.contrasts
489+
end
490+
491+
492+
"""
493+
StatsModels.ContrastsCoding(mat::Matrix[, base[, levels]])
372494
373495
Coding by manual specification of contrasts matrix. For k levels, the contrasts
374-
must be a k by k-1 Matrix.
496+
must be a k by k-1 Matrix. The contrasts in this matrix will be copied directly
497+
into the model matrix; if you want to specify your contrasts as hypotheses (i.e.,
498+
weights assigned to each group's cell mean), you should use
499+
[`HypothesisCoding`](@ref) instead.
375500
"""
376501
mutable struct ContrastsCoding <: AbstractContrasts
377502
mat::Matrix
378503
base::Any
379504
levels::Union{Vector,Nothing}
380505

381506
function ContrastsCoding(mat, base, levels)
382-
if levels !== nothing
383-
check_contrasts_size(mat, length(levels))
384-
end
507+
Base.depwarn("`ContrastsCoding(contrasts)` is deprecated and will not be exported" *
508+
" in the future, use `HypothesisCoding(pinv(contrasts))` instead.",
509+
:ContrastsCoding)
510+
check_contrasts_size(mat, levels)
385511
new(mat, base, levels)
386512
end
387513
end
388514

389-
check_contrasts_size(mat::Matrix, n_lev) =
515+
check_contrasts_size(mat::Matrix, ::Nothing) = check_contrasts_size(mat, size(mat,1))
516+
check_contrasts_size(mat::Matrix, levels::Vector) = check_contrasts_size(mat, length(levels))
517+
check_contrasts_size(mat::Matrix, n_lev::Int) =
390518
size(mat) == (n_lev, n_lev-1) ||
391519
throw(ArgumentError("contrasts matrix wrong size for $n_lev levels. " *
392520
"Expected $((n_lev, n_lev-1)), got $(size(mat))"))

test/contrasts.jl

Lines changed: 77 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,12 @@
113113
1 1]
114114
@test coefnames(mf_missing) == ["(Intercept)"; "x: b"]
115115

116+
# Sequential difference coding
117+
setcontrasts!(mf, x = SeqDiffCoding())
118+
seqdiff_contr = pinv([-1 1 0
119+
0 -1 1]);
120+
@test ModelMatrix(mf).m hcat(ones(6), seqdiff_contr[[2, 1, 3, 1, 1, 2], :])
121+
116122
# Things that are bad to do:
117123
# Applying contrasts that only have a subset of data levels:
118124
@test_throws ArgumentError setcontrasts!(mf, x = EffectsCoding(levels = [:a, :b]))
@@ -125,20 +131,88 @@
125131
contrasts = [0 1
126132
-1 -.5
127133
1 -.5]
128-
setcontrasts!(mf, x = ContrastsCoding(contrasts))
134+
setcontrasts!(mf, x = StatsModels.ContrastsCoding(contrasts))
129135
@test ModelMatrix(mf).m == [1 -1 -.5
130136
1 0 1
131137
1 1 -.5
132138
1 0 1
133139
1 0 1
134140
1 -1 -.5]
135141

142+
contrasts2 = [1 0
143+
1 1
144+
0 1]
145+
setcontrasts!(mf, x = StatsModels.ContrastsCoding(contrasts2))
146+
@test ModelMatrix(mf).m == [1 1 1
147+
1 1 0
148+
1 0 1
149+
1 1 0
150+
1 1 0
151+
1 1 1]
152+
153+
hypotheses2 = pinv(contrasts2)
154+
setcontrasts!(mf, x = StatsModels.HypothesisCoding(hypotheses2))
155+
@test ModelMatrix(mf).m [1 1 1
156+
1 1 0
157+
1 0 1
158+
1 1 0
159+
1 1 0
160+
1 1 1]
161+
162+
# different results for non-orthogonal hypotheses/contrasts:
163+
hypotheses3 = [1 1 0
164+
0 1 1]
165+
setcontrasts!(mf, x = StatsModels.HypothesisCoding(hypotheses3))
166+
@test !(ModelMatrix(mf).m [1 1 1
167+
1 1 0
168+
1 0 1
169+
1 1 0
170+
1 1 0
171+
1 1 1])
172+
136173
# throw argument error if number of levels mismatches
137-
@test_throws ArgumentError setcontrasts!(mf, x = ContrastsCoding(contrasts[1:2, :]))
138-
@test_throws ArgumentError setcontrasts!(mf, x = ContrastsCoding(hcat(contrasts, contrasts)))
174+
@test_throws ArgumentError setcontrasts!(mf, x = StatsModels.ContrastsCoding(contrasts[1:2, :]))
175+
@test_throws ArgumentError setcontrasts!(mf, x = StatsModels.ContrastsCoding(hcat(contrasts, contrasts)))
139176

140177
# contrasts types must be instantiated (should throw ArgumentError, currently
141178
# MethodError on apply_schema)
142179
@test_broken setcontrasts!(mf, x = DummyCoding)
143180

181+
@testset "hypothesis coding" begin
182+
183+
# to get the scaling right, divide by three (because three levels)
184+
effects_hyp = [-1 2 -1
185+
-1 -1 2] ./ 3
186+
187+
@test modelmatrix(setcontrasts!(mf, x = HypothesisCoding(effects_hyp)))
188+
modelmatrix(setcontrasts!(mf, x = EffectsCoding()))
189+
190+
d2 = DataFrame(y = rand(100),
191+
x = repeat([:a, :b, :c, :d], inner=25))
192+
193+
sdiff_hyp = HypothesisCoding([-1 1 0 0
194+
0 -1 1 0
195+
0 0 -1 1])
196+
197+
effects_hyp = HypothesisCoding([-1 3 -1 -1
198+
-1 -1 3 -1
199+
-1 -1 -1 3] ./ 4)
200+
201+
f = apply_schema(@formula(y ~ 1 + x), schema(d2))
202+
203+
f_sdiff = apply_schema(f, schema(d2, Dict(:x => sdiff_hyp)))
204+
f_effects = apply_schema(f, schema(d2, Dict(:x => effects_hyp)))
205+
206+
y_means = by(d2, :x, :y => mean).y_mean
207+
208+
y, X_sdiff = modelcols(f_sdiff, d2)
209+
@test X_sdiff \ y [mean(y_means); diff(y_means)]
210+
211+
y, X_effects = modelcols(f_effects, d2)
212+
@test X_effects \ y [mean(y_means); y_means[2:end] .- mean(y_means)]
213+
214+
@test X_effects modelcols(apply_schema(f.rhs,
215+
schema(d2, Dict(:x=>EffectsCoding()))),
216+
d2)
217+
end
144218
end

0 commit comments

Comments
 (0)