diff --git a/LDT_accuracy.qmd b/LDT_accuracy.qmd index b729406..24dafff 100644 --- a/LDT_accuracy.qmd +++ b/LDT_accuracy.qmd @@ -27,19 +27,23 @@ and define some constants ```{julia} #| code-fold: true #| output: false -@isdefined(contrasts) || const contrasts = Dict{Symbol, Any}() +@isdefined(contrasts) || const contrasts = Dict{Symbol,Any}() @isdefined(progress) || const progress = false ``` ## Create the dataset - ```{julia} #| output: false trials = innerjoin( - DataFrame(dataset(:ELP_ldt_trial)), - select(DataFrame(dataset(:ELP_ldt_item)), :item, :isword, :wrdlen), - on = :item + DataFrame(dataset(:ELP_ldt_trial)), + select( + DataFrame(dataset(:ELP_ldt_item)), + :item, + :isword, + :wrdlen, + ), + on=:item, ) ``` @@ -54,20 +58,28 @@ This takes about ten to fifteen minutes on a recent laptop ```{julia} contrasts[:isword] = EffectsCoding() contrasts[:wrdlen] = Center(8) -@time gm1 = let f = @formula(acc ~ 1 + isword * wrdlen + (1|item) + (1|subj)) - fit(MixedModel, f, trials, Bernoulli(); contrasts, progress, init_from_lmm=(:β, :θ)) -end +@time gm1 = + let f = + @formula(acc ~ 1 + isword * wrdlen + (1 | item) + (1 | subj)) + fit( + MixedModel, + f, + trials, + Bernoulli(); + contrasts, + progress, + init_from_lmm=(:β, :θ), + ) + end ``` - ```{julia} print(gm1) ``` - ```{julia} #| fig-cap: Conditional modes and 95% prediction intervals on random effects for subject in model gm1 #| label: fig-gm1condmodesubj #| code-fold: true -qqcaterpillar!(Figure(; size=(800,800)), gm1, :subj) -``` \ No newline at end of file +qqcaterpillar!(Figure(; size=(800, 800)), gm1, :subj) +``` diff --git a/Project.toml b/Project.toml index 5a824dd..a6ccda7 100644 --- a/Project.toml +++ b/Project.toml @@ -6,11 +6,11 @@ version = "0.1.0" [deps] AlgebraOfGraphics = "cbdf2221-f076-402e-a563-3d30da359d67" Arrow = "69666777-d1a9-59fb-9406-91d4454c9d45" -BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf" CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" CairoMakie = "13f3f980-e62b-5c42-98c6-ff1f3baf88f0" CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597" Chain = "8be319e6-bccf-4806-a6f7-6fae938471bc" +Chairmarks = "0ca39b1e-fe0b-4e98-acfc-b1656634c4de" DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a" DataFrameMacros = "75880514-38bc-4a95-a458-c2aea5a3a702" DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" @@ -71,6 +71,7 @@ StatsAPI = "1" StatsBase = "0.33, 0.34" StatsModels = "0.7" Tables = "1" +TidierPlots = "0.7,0.8" TypedTables = "1" ZipFile = "0.10" julia = "1.8" diff --git a/_quarto.yml b/_quarto.yml index e5684fc..2028a55 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -70,9 +70,7 @@ format: engine: julia julia: - exeflags: - - --project - - -tauto + exeflags: ["-tauto", "--project"] execute: freeze: auto diff --git a/aGHQ.qmd b/aGHQ.qmd index a8a82ea..389f2b8 100644 --- a/aGHQ.qmd +++ b/aGHQ.qmd @@ -5,7 +5,7 @@ fig-dpi: 150 fig-format: png engine: julia julia: - exeflags: ["--project"] + exeflags: ["-tauto", "--project"] --- # GLMM log-likelihood {#sec-GLMMdeviance} @@ -136,8 +136,9 @@ Load the packages to be used #| output: false #| label: packagesA03 using AlgebraOfGraphics -using BenchmarkTools using CairoMakie +using Chairmarks # a more modern BenchmarkTools +using DataFrames using EmbraceUncertainty: dataset using FreqTables using LinearAlgebra @@ -146,6 +147,7 @@ using MixedModelsMakie using NLopt using PooledArrays using StatsAPI +using TidierPlots ``` and define some constants @@ -156,6 +158,8 @@ and define some constants #| label: constantsA03 @isdefined(contrasts) || const contrasts = Dict{Symbol,Any}() @isdefined(progress) || const progress = false +TidierPlots_set("plot_show", false) +TidierPlots_set("plot_log", false) ``` ## Generalized linear models for binary data {#sec-BernoulliGLM} @@ -402,8 +406,7 @@ Each evaluation of the deviance is fast, requiring only a fraction of a millisec ```{julia} βopt = copy(com05fe.β) -@benchmark deviance(setβ!(m, β)) seconds = 1 setup = - (m = com05fe; β = βopt) +@be deviance(setβ!($com05fe, $βopt)) ``` but the already large number of evaluations for these six coefficients would not scale well as this dimension increases. @@ -635,7 +638,7 @@ The IRLS algorithm has converged in 4 iterations to essentially the same devianc Each iteration of the IRLS algorithm takes more time than a deviance evaluation, but still only a fraction of a millisecond on a laptop computer. ```{julia} -@benchmark deviance(updateβ!(m)) seconds = 1 setup = (m = com05fe) +@be deviance(updateβ!($com05fe)) ``` ## GLMMs and the PIRLS algorithm {#sec-PIRLS} @@ -866,7 +869,7 @@ pirls!(m; verbose=true); As with IRLS, PIRLS is a fast and stable algorithm for determining the mode of the conditional distribution $(\mcU|\mcY=\bby)$ with $\bbtheta$ and $\bbbeta$ held fixed. ```{julia} -@benchmark pirls!(mm) seconds = 1 setup = (mm = m) +@be pirls!($m) ``` The time taken for the four iterations to determine the conditional mode of $\bbu$ is comparable to the time taken for a single call to `updateβ!`. @@ -1127,28 +1130,34 @@ The weights and positions for the 9th order rule are shown in @fig-ghnine. #| code-fold: true #| fig-cap: Weights and positions for the 9th order normalized Gauss-Hermite quadrature rule #| label: fig-ghnine -#| warning: false -draw( - data(gausshermitenorm(9)) * - mapping(:abscissae => "Positions", :weights); - figure=(; size=(600, 450)), -) +df9 = DataFrame(gausshermitenorm(9)) +ggplot(df9, aes(; x=:abscissae, y=:weights)) + +geom_point() + +labs(; x="Positions", y="Weights") +# draw( +# data(gausshermitenorm(9)) * +# mapping(:abscissae => "Positions", :weights); +# figure=(; size=(600,450)), +# ) ``` -Notice that the magnitudes of the weights drop quite dramatically away from zero, even on a logarithmic scale (@fig-ghninelog) +Notice that the magnitudes of the weights drop quite dramatically as the position moves away from zero, even on a logarithmic scale (@fig-ghninelog) ```{julia} #| code-fold: true -#| fig-cap: Weights (logarithm base 2) and positions for the 9th order normalized Gauss-Hermite quadrature rule +#| fig-cap: Weights (logarithm base 10) and positions for the 9th order normalized Gauss-Hermite quadrature rule #| label: fig-ghninelog -#| warning: false -draw( - data(gausshermitenorm(9)) * mapping( - :abscissae => "Positions", - :weights => log2 => "log₂(weight)", - ); - figure=(; size=(600, 450)), -) +ggplot(df9, aes(; x=:abscissae, y=:weights)) + +geom_point() + +labs(; x="Positions", y="Weights") + +scale_y_log10() +# draw( +# data(gausshermitenorm(9)) * mapping( +# :abscissae => "Positions", +# :weights => log2 => "log₂(weight)", +# ); +# figure=(; size=(600,450)), +# ) ``` The definition of `MixedModels.GHnorm` is similar to the `gausshermitenorm` function with some extra provisions for ensuring symmetry of the abscissae and of the weights and for caching values once they have been calculated. diff --git a/datatables.qmd b/datatables.qmd index 9eea74f..94f4823 100644 --- a/datatables.qmd +++ b/datatables.qmd @@ -11,8 +11,8 @@ Load the packages to be used using DataFrames using EmbraceUncertainty: dataset using MixedModels -using Tables -using TypedTables +using Tables: Tables, columnnames, table +using TypedTables: Table, getproperties ``` A call to fit a mixed-effects model using the `MixedModels` package follows the *formula/data* specification that is common to many statistical model-fitting packages, especially those in [R](https://r-project.org). @@ -27,7 +27,7 @@ These include the *data.frame* type in [R](https://R-Project.org) and the [data. An increasing popular representation of data frames is as [Arrow](https://arrow.apache.org) tables. [Polars](https://pola.rs) uses the Arrow format internally as does the [DuckDB](https://duckdb.org) database. -Recently it was announced that the 2.0 release of [pandas](https://pandas.pydata.org) will allow for Arrow tables. +Also, the 2.0 and later releases of [pandas](https://pandas.pydata.org) will allow for Arrow tables. Arrow specifies a language-independent tabular memory format that provides many desirable properties, such as provision for missing data and compact representation of categorical data vectors, for data science. The memory format also defines a file format for storing and exchanging data tables. @@ -55,9 +55,10 @@ Before going into detail about the properties and use of the `Table` type, let u An important characteristic of any system for working with data tables is whether the table is stored in memory column-wise or row-wise. As described above, most implementations of data frames for data science store the data column-wise. +That is, elements of a column are stored in adjacent memory locations, allowing for fast traversals of columns. In *relational database management systems* (RDBMS), such as [PostgreSQL](https://postgresql.org), [SQLite](https://sqlite.org), and a multitude of commercial systems, a data table, called a *relation*, is typically stored row-wise. -Such systems typically use [SQL](https://en.wikipedia.org/wiki/SQL), the *structured query language*, to define and access the data in tables, which is why that acronym appears in many of the names. +Such systems generally use [SQL](https://en.wikipedia.org/wiki/SQL), the *structured query language*, to define and access the data in tables, which is why that acronym appears in many of the names. There are exceptions to the row-wise rule, such as [DuckDB](https://duckdb.org), an SQL-based RDBMS, that, as mentioned above, represents relations as Arrow tables. Many external representations of data tables, such as in comma-separated-value (CSV) files, are row-oriented. @@ -82,11 +83,61 @@ Tables.RowTable The actual implementation of a row-table or column-table type may be different from these prototypes but it must provide access methods as if it were one of these types. `Tables.jl` provides the "glue" to treat a particular data table type as if it were row-oriented, by calling `Tables.rows` or `Tables.rowtable` on it, or column-oriented, by calling `Tables.columntable` on it. +### MatrixTable as a concrete table type + +[Tables.jl](https://github.com/JuliaData/Tables.jl) does provide one concrete table type, which is a `MatrixTable`. +As the name implies, this is a column-table view of a `Matrix`. +Because the data values are stored in a matrix, the columns must be homogeneous - that is, the element types of all the columns must be the same. + +We will use this type of table in @sec-GLMMdeviance to evaluate and store several related properties of each observation in a generalized linear model. +In this case all of the properties are expressed as floating-point numbers. +We construct the table as + +```{julia} +ytbl = + let header = (:y, :offset, :η, :μ, :dev, :rtwwt, :wwresp) + + tb = table( + zeros(Float64, length(contra.use), length(header)); + header, + ) + tb.y .= contra.use .≠ "N" + tb + end +last(Table(ytbl), 8) +``` + +The columns of the table are accessed as *properties*, e.g. `ytbl.η`. + +A common idiom when passing a column table to a Julia function is to *destructure* the table into individual columns, which is a notation for binding several names to properties of the tables in one statement. +That is, instead of beginning a function with several statements like + +```julia +function updateμ!(ytbl) + y = ytbl.y + η = ytbl.η + offset = ytbl.offset + μ = ytbl.μ + ... + return ytbl +end +``` + +one can write + +```julia +function updateμ!(ytbl) + (; y, η, offset, μ) = ytbl + ... + return ytbl +end +``` + ## The Table type from TypedTables [TypedTables.jl](https://github.com/JuliaData/TypedTables.jl) is a lightweight package (about 1500 lines of source code) that provides a concrete implementation of column-tables, called simply `Table`, as a `NamedTuple` of vectors. -A `Table` that is constructed from another type of column-table, such as an `Arrow.Table` or a `DataFrame` or an explicit `NamedTuple` of vectors, is simply a wrapper around the original table's contents. +A `Table` that is constructed from another type of column-table, such as an `Arrow.Table` or a `DataFrame` or an explicit `NamedTuple` of vectors, is just a wrapper around the original table's contents. On the other hand, constructing a `Table` from a row table first creates a `ColumnTable`, then wraps it. ```{julia} diff --git a/glmmbernoulli.qmd b/glmmbernoulli.qmd index fd49de2..3abba66 100644 --- a/glmmbernoulli.qmd +++ b/glmmbernoulli.qmd @@ -149,11 +149,12 @@ contrasts[:urban] = HelmertCoding() com01 = let d = contra, ds = Bernoulli(), - f = @formula(use ~ - 1 + livch + (age + abs2(age)) * urban + (1 | dist)) + f = @formula( + use ~ 1 + livch + (age + abs2(age)) * urban + (1 | dist) + ) - fit(MixedModel, f, d, ds; contrasts, nAGQ, progress) -end + fit(MixedModel, f, d, ds; contrasts, nAGQ, progress) + end ``` Notice that in the formula language defined by the [StatsModels](https://github.com/JuliaStats/StatsModels.jl) package, an interaction term is written with the `&` operator. @@ -284,8 +285,9 @@ A series of such model fits led to a model with random effects for the combinati ```{julia} com05 = - let f = @formula(use ~ - 1 + urban + ch * age + abs2(age) + (1 | dist & urban)), + let f = @formula( + use ~ 1 + urban + ch * age + abs2(age) + (1 | dist & urban) + ), d = contra, ds = Bernoulli() diff --git a/intro.qmd b/intro.qmd index df93173..746ac88 100644 --- a/intro.qmd +++ b/intro.qmd @@ -5,7 +5,7 @@ fig-dpi: 150 fig-format: png engine: julia julia: - exeflags: ["--project"] + exeflags: [ "-tauto", "--project"] --- # A simple, linear, mixed-effects model {#sec-examlmm} @@ -117,8 +117,11 @@ using MixedModels # fit and examine mixed-effects models using MixedModelsMakie # graphics for mixed-effects models using Random # random number generation using StatsBase # basic statistical summaries +using TidierPlots # ggplot2-like graphics in Julia using EmbraceUncertainty: dataset # `dataset` means this one +TidierPlots_set("plot_show", false) +TidierPlots_set("plot_log", false) ``` A package must be attached before any of the data sets or functions in the package can be used. @@ -620,6 +623,7 @@ Plots of the bootstrap estimates for individual parameters are obtained by extra For example, ```{julia} +tbl = DataFrame(dsm01samp.tbl) βdf = @subset(dsm01pars, :type == "β") ``` @@ -632,15 +636,9 @@ You can check the details by clicking on the "Code" button in the HTML version o #| fig-cap: Kernel density plot of bootstrap fixed-effects parameter estimates from dsm01 #| label: fig-dsm01_bs_beta_density #| warning: false -draw( - data(@subset(dsm01pars, :type == "β")) * - mapping( - :value => "Bootstrap samples of β"; - color=(:names => "Names"), - ) * - AlgebraOfGraphics.density(); - figure=(; size=(600, 340)), -) +ggplot(tbl, aes(; x=:β1)) + +geom_density() + +labs(; x="Bootstrap samples of β₁") ``` The distribution of the estimates of `β₁` is more-or-less a Gaussian (or "normal") shape, with a mean value of `{julia} repr(mean(βdf.value), context=:compact=>true)` which is close to the estimated `β₁` of `{julia} repr(only(dsm01.β), context=:compact=>true)`. @@ -657,15 +655,18 @@ The situation is different for the estimates of the standard deviation parameter #| fig-cap: Kernel density plot of bootstrap variance-component parameter estimates from model dsm01 #| label: fig-dsm01_bs_sigma_density #| warning: false -draw( - data(@subset(dsm01pars, :type == "σ")) * - mapping( - :value => "Bootstrap samples of σ"; - color=(:group => "Group"), - ) * - AlgebraOfGraphics.density(); - figure=(; size=(600, 340)), -) +ggplot(stack(DataFrame(tbl), 3:4), aes(; x=:value, color=:variable)) + +geom_density(alpha=0.6) + +labs(; x="Bootstrap samples of σ") +# draw( +# data(@subset(dsm01pars, :type == "σ")) * +# mapping( +# :value => "Bootstrap samples of σ"; +# color=(:group => "Group"), +# ) * +# AlgebraOfGraphics.density(); +# figure=(; size=(600, 340)), +# ) ``` The estimator for the residual standard deviation, $\sigma$, is approximately normally distributed but the estimator for $\sigma_1$, the standard deviation of the `batch` random effects is bimodal (i.e. has two "modes" or local maxima). diff --git a/largescaledesigned.qmd b/largescaledesigned.qmd index cdb2bca..2066a54 100644 --- a/largescaledesigned.qmd +++ b/largescaledesigned.qmd @@ -5,7 +5,7 @@ fig-dpi: 150 fig-format: png engine: julia julia: - exeflags: ["--project"] + exeflags: ["-tauto", "--project"] --- # A large-scale designed experiment {#sec-largescaledesigned} @@ -17,7 +17,6 @@ Load the packages to be used #| output: false #| label: packages04 using AlgebraOfGraphics -using Arrow using CairoMakie using Chain using DataFrameMacros @@ -29,6 +28,7 @@ using MixedModels using MixedModelsMakie using StandardizedPredictors using StatsBase +using TidierPlots ``` and define some constants, if not already defined. @@ -39,6 +39,8 @@ and define some constants, if not already defined. #| label: constants04 @isdefined(contrasts) || const contrasts = Dict{Symbol,Any}() @isdefined(progress) || const progress = false +TidierPlots_set("plot_show", false) +TidierPlots_set("plot_log", false) ``` As with many techniques in data science, the place where "the rubber meets the road", as they say in the automotive industry, for mixed-effects models is when working on large-scale studies. @@ -156,21 +158,24 @@ The named argument `error=false` is required because there is one column, `acc`, If `error=false` were not given then the error thrown when trying to `disallowmissing` on the `acc` column would be propagated and the top-level call would fail. ::: -A barchart of the word length counts, @fig-ldtwrdlenhist, shows that the majority of the items are between 3 and 14 characters. +A bar plot of the word length counts, @fig-ldtwrdlenhist, shows that the majority of the items are between 3 and 14 characters. ```{julia} #| code-fold: true -#| fig-cap: "Histogram of word lengths in the items used in the lexical decision task." +#| fig-cap: "Bar plot of word lengths in the items used in the lexical decision task." #| label: fig-ldtwrdlenhist -#| warning: false -let wlen = 1:21 - draw( - data((; wrdlen=wlen, count=counts(byitem.wrdlen, wlen))) * - mapping(:wrdlen => "Length of word", :count) * - visual(BarPlot); - figure=(; size=(600, 450)), - ) -end +ggplot(ldttrial, aes(; x=:wrdlen)) + +geom_bar() + +labs(; x="Word length") +# #| warning: false +# let wlen = 1:21 +# draw( +# data((; wrdlen=wlen, count=counts(byitem.wrdlen, wlen))) * +# mapping(:wrdlen => "Length of word", :count) * +# visual(BarPlot); +# figure=(; size=(600, 450)) +# ) +# end ``` To examine trends in accuracy by word length we use a scatterplot smoother on the binary response, as described in @sec-plottingbinary. @@ -285,16 +290,10 @@ A plot of the median response time versus proportion accurate, @fig-ldtmedianrtv #| code-fold: true #| fig-cap: "Median response time versus proportion accurate by subject in the LDT." #| label: fig-ldtmedianrtvspropacc -#| warning: false -draw( - data(bysubj) * - mapping( - :spropacc => "Proportion accurate", - :smedianrt => "Median response time (ms)", - ) * - (smooth() + visual(Scatter)); - figure=(; size=(600, 450)), -) +ggplot(bysubj, aes(; x=:spropacc, y=:smedianrt)) + +geom_point() + +geom_smooth(; method="smooth") + +labs(; x="Proportion accurate", y="Median response time (ms)") ``` As described in @Balota_2007, the participants performed the trials in blocks of 250 followed by a short break. @@ -350,13 +349,15 @@ A density plot of the pruned response times, @fig-elpldtrtdens, shows they are s #| code-fold: true #| fig-cap: Kernel density plot of the pruned response times (ms.) in the LDT. #| label: fig-elpldtrtdens -#| warning: false -draw( - data(pruned) * - mapping(:rt => "Response time (ms.) for correct responses") * - AlgebraOfGraphics.density(); - figure=(; size=(600, 340)), -) +ggplot(pruned, aes(; x=:rt)) + +geom_density() + +labs(; x="Response time (ms.) for correct responses") +# draw( +# data(pruned) * +# mapping(:rt => "Response time (ms.) for correct responses") * +# AlgebraOfGraphics.density(); +# figure=(; size=(600, 340)), +# ) ``` In such cases it is common to transform the response to a scale such as the logarithm of the response time or to the speed of the response, which is the inverse of the response time. @@ -367,18 +368,20 @@ The density of the response speed, in responses per second, is shown in @fig-elp #| code-fold: true #| fig-cap: Kernel density plot of the pruned response speed in the LDT. #| label: fig-elpldtspeeddens -#| warning: false -draw( - data(pruned) * - mapping( - :rt => - ( - x -> 1000 / x - ) => "Response speed (s⁻¹) for correct responses", - ) * - AlgebraOfGraphics.density(); - figure=(; size=(600, 340)), -) +ggplot(transform!(pruned, :rt => (x -> 1000 ./ x) => :speed)) + +geom_density(aes(; x=:speed)) + +labs(; x="Response speed (s⁻¹) for correct responses") +# draw( +# data(pruned) * +# mapping( +# :rt => +# ( +# x -> 1000 / x +# ) => "Response speed (s⁻¹) for correct responses", +# ) * +# AlgebraOfGraphics.density(); +# figure=(; size=(600, 340)), +# ) ``` @fig-elpldtrtdens and @fig-elpldtspeeddens indicate that it may be more reasonable to establish a lower bound of 1/3 second (333 ms) on the response latency, corresponding to an upper bound of 3 per second on the response speed. @@ -520,9 +523,23 @@ To compare the conditional means of the random effects for item in these two mod #| code-fold: true #| fig-cap: "Conditional means of scalar random effects for item in model elm01, fit to the pruned data, versus those for model elm02, fit to the pruned data with inaccurate subjects removed." #| label: fig-itemreelm01vselm02 -#| warning: false - -CairoMakie.activate!(; type="png") +## FIXME: figure out how to change the alpha on the points +# ggplot( # this throws an error on the use of color=:isword - can't work out why +# leftjoin!( +# leftjoin!( +# rename!(DataFrame(raneftables(elm01)[:item]), [:item, :elm01]), +# rename!(DataFrame(raneftables(elm02)[:item]), [:item, :elm02]); +# on=:item, +# ), +# select(byitem, :item, :isword => string; renamecols=false); +# on=:item, +# ), +# aes(; x=:elm01, y=:elm02), +# ) + geom_point(aes(; color=:isword); alpha=0.2) + +# labs(; +# x="Conditional means of item random effects for model elm01", +# y="Conditional means of item random effects for model elm02", +# ) condmeans = leftjoin!( leftjoin!( rename!(DataFrame(raneftables(elm01)[:item]), [:item, :elm01]), @@ -542,10 +559,6 @@ draw( ) ``` -::: {.callout-note} -Adjust the alpha on @fig-itemreelm01vselm02. -::: - @fig-itemreelm01vselm02 is exactly of the form that would be expected in a sample from a correlated multivariate Gaussian distribution. The correlation of the two sets of conditional means is about 96%. diff --git a/longitudinal.qmd b/longitudinal.qmd index 9000d24..c4b8557 100644 --- a/longitudinal.qmd +++ b/longitudinal.qmd @@ -46,6 +46,7 @@ using MixedModelsMakie using Random using RCall using StandardizedPredictors +using TidierPlots ``` and declare some constants, if not already defined. @@ -56,6 +57,8 @@ and declare some constants, if not already defined. #| label: constants03 @isdefined(contrasts) || const contrasts = Dict{Symbol,Any}() @isdefined(progress) || const progress = false +TidierPlots_set("plot_show", false) +TidierPlots_set("plot_log", false) ``` Longitudinal data consist of repeated measurements on the same subject, or some other observational unit, taken over time. @@ -88,7 +91,8 @@ A common way of plotting such longitudinal data is response versus time on a sin #| code-fold: true #| fig-cap: Length of ramus bone versus age for a sample of 20 boys. #| label: fig-egspaghetti -#| warning: false +# ggplot(egdf, aes(; x=:time, y=:resp, color=:Subj)) + # need to figure out colormap +# geom_point(; size=10)# + geom_line() draw( data(egdf) * mapping( @@ -587,6 +591,7 @@ bxm03samp = parametricbootstrap( progress=false, ) bxm03pars = DataFrame(bxm03samp.allpars) +tbl = DataFrame(bxm03samp.tbl) DataFrame(shortestcovint(bxm03samp)) ``` @@ -598,16 +603,18 @@ A kernel density plot, @fig-bxm03rhodens, of the parametric bootstrap estimates #| code-fold: true #| fig-cap: Kernel density plots of parametric bootstrap estimates of correlation estimates from model bxm03 #| label: fig-bxm03rhodens -#| warning: false -draw( - data(@subset(bxm03pars, :type == "ρ")) * - mapping( - :value => "Bootstrap replicates of correlation estimates"; - color=(:names => "Variables"), - ) * - AlgebraOfGraphics.density(); - figure=(; size=(600, 400)), -) +ggplot(stack(tbl, [:ρ1, :ρ2, :ρ3]), aes(; x=:value, color=:variable)) + + geom_density() + + labs(; x="Bootstrap replicates of correlation estimates") +# draw( +# data(@subset(bxm03pars, :type == "ρ")) * +# mapping( +# :value => "Bootstrap replicates of correlation estimates"; +# color=(:names => "Variables"), +# ) * +# AlgebraOfGraphics.density(); +# figure=(; size=(600, 400)), +# ) ``` Even on the scale of [Fisher's z transformation](https://en.wikipedia.org/wiki/Fisher_transformation), @fig-bxm03rhodensatanh, these estimates are highly skewed. diff --git a/multiple.qmd b/multiple.qmd index 011a4a4..c88613a 100644 --- a/multiple.qmd +++ b/multiple.qmd @@ -5,7 +5,7 @@ fig-dpi: 150 fig-format: png engine: julia julia: - exeflags: ["--project"] + exeflags: ["-tauto", "--project"] --- # Models with multiple random-effects terms {#sec-Multiple} @@ -47,6 +47,7 @@ using MixedModelsMakie using Random using RCall using StatsBase +using TidierPlots ``` and define some constants, if not already defined, @@ -57,6 +58,8 @@ and define some constants, if not already defined, #| label: constants02 @isdefined(contrasts) || const contrasts = Dict{Symbol,Any}() @isdefined(progress) || const progress = false +TidierPlots_set("plot_show", false) +TidierPlots_set("plot_log", false) ``` The mixed models considered in the previous chapter had only one random-effects term, which was a simple, scalar random-effects term, and a single fixed-effects coefficient. @@ -111,45 +114,28 @@ of the data, then plot it #| code-fold: true #| fig-cap: "Diameter of inhibition zone by plate and sample. Plates are ordered by increasing mean response." #| label: fig-penicillindot -#| warning: false -""" - _meanrespfrm(df, :resp::Symbol, :grps::Symbol; sumryf::Function=mean) - -Returns a `DataFrame` created from df with the levels of `grps` reordered according to -`combine(groupby(df, grps), resp => sumryf)` and this summary DataFrame, also with the -levels of `grps` reordered. -""" -function _meanrespfrm( - df, - resp::Symbol, - grps::Symbol; - sumryf::Function=mean, -) - # ensure the relevant columns are types that Makie can deal with - df = transform( - df, - resp => Array, - grps => CategoricalArray; - renamecols=false, +let sumry = sort!( + combine( + groupby(penicillin, :plate), + :diameter => mean => :meandia, + ), + :meandia, + ), + df = sort( + transform( + penicillin, + :plate => ByRow(sorter(sumry.plate)); + renamecols=false, + ), + :plate, ) - # create a summary table by mean resp - sumry = - sort!(combine(groupby(df, grps), resp => sumryf => resp), resp) - glevs = string.(sumry[!, grps]) # group levels in ascending order of mean resp - levels!(df[!, grps], glevs) - levels!(sumry[!, grps], glevs) - return df, sumry -end - -let - df, _ = _meanrespfrm(penicillin, :diameter, :plate) - sort!(df, [:plate, :sample]) mp = mapping( :diameter => "Diameter of inhibition zone [mm]", - :plate => "Plate"; + :plate => "Plate", color=:sample, ) + draw( data(df) * mp * visual(ScatterLines; marker='○', markersize=12); figure=(; size=(600, 450)), @@ -255,6 +241,7 @@ A parametric bootstrap sample of the parameter estimates #| code-fold: true bsrng = Random.seed!(9876789) pnm01samp = parametricbootstrap(bsrng, 10_000, pnm01; progress) +tbl = DataFrame(pnm01samp.tbl) pnm01pars = DataFrame(pnm01samp.allpars); ``` @@ -271,15 +258,9 @@ As for model `dsm01` the bootstrap parameter estimates of the fixed-effects para #| fig-cap: "Parametric bootstrap estimates of fixed-effects parameters in model pnm01" #| label: fig-pnm01bsbeta #| warning: false -draw( - data(@subset(pnm01pars, :type == "β")) * - mapping( - :value => "Bootstrap samples of β"; - color=(:names => "Names"), - ) * - AlgebraOfGraphics.density(); - figure=(; size=(600, 340)), -) +ggplot(tbl, aes(x=:β1)) + +geom_density() + +labs(; x="Bootstrap samples of β₁") ``` and the shortest coverage interval on this parameter is close to the Wald interval @@ -295,16 +276,18 @@ The densities of the variance-components, on the scale of the standard deviation #| fig-cap: "Parametric bootstrap estimates of variance components in model pnm01" #| label: fig-pnm01bssigma #| code-fold: true -#| warning: false -draw( - data(@subset(pnm01pars, :type == "σ")) * - mapping( - :value => "Bootstrap samples of σ"; - color=(:group => "Group"), - ) * - AlgebraOfGraphics.density(); - figure=(; size=(600, 340)), -) +ggplot(stack(tbl, [:σ, :σ1, :σ2]), aes(; x=:value, color=:variable)) + + geom_density() + + labs(; x="Bootstrap samples of σ") +# draw( +# data(@subset(pnm01pars, :type == "σ")) * +# mapping( +# :value => "Bootstrap samples of σ"; +# color=(:group => "Group"), +# ) * +# AlgebraOfGraphics.density(); +# figure=(; size=(600, 340)), +# ) ``` The lack of precision in the estimate of $\sigma_2$, the standard deviation of the random effects for `sample`, is a consequence of only having 6 distinct levels of the `sample` factor. @@ -459,6 +442,7 @@ Furthermore, kernel density estimates from a parametric bootstrap sample of the ```{julia} Random.seed!(4567654) psm01samp = parametricbootstrap(10_000, psm01; progress) +tbl = DataFrame(psm01samp.tbl) psm01pars = DataFrame(psm01samp.allpars); ``` @@ -466,16 +450,18 @@ psm01pars = DataFrame(psm01samp.allpars); #| fig-cap: "Kernel density plots of bootstrap estimates of σ for model psm01" #| label: fig-psm01bssampdens #| code-fold: true -#| warning: false -draw( - data(@subset(psm01pars, :type == "σ")) * - mapping( - :value => "Bootstrap samples of σ"; - color=(:group => "Group"), - ) * - AlgebraOfGraphics.density(); - figure=(; size=(600, 340)), -) +ggplot(stack(tbl, [:σ, :σ1, :σ2]), aes(; x=:value, color=:variable)) + + geom_density() + + labs(; x="Bootstrap samples of σ") +# draw( +# data(@subset(psm01pars, :type == "σ")) * +# mapping( +# :value => "Bootstrap samples of σ"; +# color=(:group => "Group"), +# ) * +# AlgebraOfGraphics.density(); +# figure=(; size=(600, 340)), +# ) ``` Because there are several indications that $\sigma_2$ could reasonably be zero, resulting in a simpler model incorporating random effects for only `sample`, we perform a statistical test of this hypothesis. @@ -587,12 +573,9 @@ Although the response, `y`, is on a scale of 1 to 5, #| fig-cap: "Histogram of instructor ratings in the *insteval* data" #| label: fig-instevalhist #| warning: false -draw( - data(Table(response=1:5, count=counts(insteval.y))) * - mapping(:response => "Rating", :count => "Count") * - visual(BarPlot); - figure=(; size=(600, 340)), -) +ggplot(DataFrame(insteval), aes(x=:y)) + +geom_bar() + +labs(; x="Rating") ``` it is sufficiently diffuse to warrant treating it as if it were a continuous response.