duvenaud
diff --git a/‎additive.pdf‎
19 Bytes b/‎additive.pdf‎
19 Bytes
diff --git a/‎additive.tex‎
Lines changed: 2 additions & 2 deletions b/‎additive.tex‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎grammar.pdf‎
0 Bytes b/‎grammar.pdf‎
0 Bytes
diff --git a/‎grammar.tex‎
Lines changed: 1 addition & 1 deletion b/‎grammar.tex‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎intro.pdf‎
-7 Bytes b/‎intro.pdf‎
-7 Bytes
diff --git a/‎intro.tex‎
Lines changed: 2 additions & 2 deletions b/‎intro.tex‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎kernels.pdf‎
121 Bytes b/‎kernels.pdf‎
121 Bytes
diff --git a/‎kernels.tex‎
Lines changed: 10 additions & 10 deletions b/‎kernels.tex‎
Lines changed: 10 additions & 10 deletions
diff --git a/‎thesis.pdf‎
137 Bytes b/‎thesis.pdf‎
137 Bytes
@@ -153,7 +153,7 @@ \subsection{Weighting different orders of interaction}
 
 On different datasets, the dominant order of interaction estimated by the additive model varies widely.
 In some cases, the variance is concentrated almost entirely onto a single order of interaction.
-This may may be a side-effect of using the same lengthscales for all orders of interaction.; lengthscales appropriate for low-dimensional regression might not be appropriate for high-dimensional regression.
+This may may be a side-effect of using the same lengthscales for all orders of interaction; lengthscales appropriate for low-dimensional regression might not be appropriate for high-dimensional regression.
 A re-scaling of lengthscales which preserves relative average distances between datapoints might be expected to improve the model.
 %An additive \gp{} with all of its variance coming from the 1st order is equivalent to a sum of one-dimensional functions.
 %An additive \gp{} with all its variance coming from the $D$th order is equivalent to a \gp{} with an \seard{} kernel.
@@ -219,7 +219,7 @@ \subsubsection{Evaluation of derivatives}
 \label{eq:additive-derivatives}
 \end{align}
 %
-\Cref{eq:additive-derivatives} gives the terms that $k_j$ is multiplied by in the original polynomial, which are the terms required by the chain rule.
+\Cref{eq:additive-derivatives} gives all terms that $k_j$ is multiplied by in the original polynomial, which are exactly the terms required by the chain rule.
 These derivatives allow gradient-based optimization of the base kernel parameters with respect to the marginal likelihood.
 
 
 
@@ -676,7 +676,7 @@ \subsection{Structure recovery on synthetic data}
 
 
 \Cref{tbl:synthetic} shows the results.
-For the highest signal-to-noise ratio, \procedurename{} usually recoveres the correct structure.
+For the highest signal-to-noise ratio, \procedurename{} usually recovers the correct structure.
 The reported additional linear structure in the last row can be explained the fact that functions sampled from \kSE{} kernels with long length-scales occasionally have near-linear trends.
 As the noise increases, our method generally backs off to simpler structures rather than reporting spurious structure.
 
 
@@ -25,7 +25,7 @@ \chapter{Introduction}
 
 %This thesis will be concerned with finding structure in functions.
 %The types of structure examined in this thesis 
-One can construct models of functions having many different types of structure, such as additivity, symmetry, periodicity, changepoints, or combinations of these, using Gaussian processes (\gp{}s).
+One can construct models of functions that have many different types of structure, such as additivity, symmetry, periodicity, changepoints, or combinations of these, using Gaussian processes (\gp{}s).
 %To be able to learn a wide variety of structures, we would like to have an expressive language of models of functions.
 %We would like to be able to represent simple kinds of functions, such as linear functions or polynomials.
 %We would also like to have models of arbitrarily complex functions, specified in terms of high-level properties such as how smooth they are, whether they repeat over time, or which symmetries they have.
@@ -35,7 +35,7 @@ \chapter{Introduction}
 %
 %This chapter will introduce the basic properties of \gp{}s.
 Chapter \ref{ch:kernels} will describe how to model these different types of structure using \gp{}s.
-This short chapter will introduce the basic properties of \gp{}s, and provide an outline of the thesis.
+This short chapter introduces the basic properties of \gp{}s, and provides an outline of the thesis.
 %Chapter \ref{ch:grammar} will show how searching over many
 
 
 
@@ -146,15 +146,15 @@ \subsection{Combining properties through multiplication}
 Here, we discuss a few examples:
 
 \begin{itemize}
-\item {\bf Locally Periodic Functions.}
-In univariate data, multiplying a kernel by \kSE{} gives a way of converting global structure to local structure.
-For example, $\Per$ corresponds to exactly periodic structure, whereas $\Per \kerntimes \SE$ corresponds to locally periodic structure, as shown in the second column of \cref{fig:kernels_times}.
-
 \item {\bf Polynomial Regression.}
 By multiplying together $T$ linear kernels, we obtain a prior on polynomials of degree $T$.
 %This class of functions also has a simple parametric form.
 The first column of \cref{fig:kernels_times} shows a quadratic kernel.
 
+\item {\bf Locally Periodic Functions.}
+In univariate data, multiplying a kernel by \kSE{} gives a way of converting global structure to local structure.
+For example, $\Per$ corresponds to exactly periodic structure, whereas $\Per \kerntimes \SE$ corresponds to locally periodic structure, as shown in the second column of \cref{fig:kernels_times}.
+
 \item {\bf Functions with Growing Amplitude.}
 Multiplying by a linear kernel means that the marginal standard deviation of the function being modeled grows linearly away from the location given by kernel parameter $c$.
 The third and fourth columns of \cref{fig:kernels_times} show two examples.
@@ -381,7 +381,7 @@ \subsection{Example: An additive model of concrete strength}
 
 To illustrate how additive kernels give rise to interpretable models, we built an additive model of the strength of concrete as a function of the amount of seven different ingredients (cement, slag, fly ash, water, plasticizer, coarse aggregate and fine aggregate), and the age of the concrete \citep{yeh1998modeling}.
 %We model measurements of the compressive strength of concrete, as a function of the concentration of 7 ingredients, plus the age of the concrete.
-Our simple model is a sum of 8 different one-dimensional functions, each depending on one of these variables:
+Our simple model is a sum of 8 different one-dimensional functions, each depending on only one of these quantities:
 %
 \begin{align}
 f(\vx) & = 
@@ -521,12 +521,12 @@ \subsubsection{Posterior covariance of additive components}
 \def\incpic#1{\includegraphics[width=0.11\columnwidth]{../figures/decomp/concrete-#1}}
 \begin{tabular}{p{2mm}*{6}{c}}
   & {cement} & {slag} & {fly ash} & {water} & \parbox{0.1\columnwidth}{plasticizer} & {age} \\
- \rotatebox{90}{{$\;\;$cement}} & \incpic{Cement-Cement} & \incpic{Cement-Slag} & \incpic{Cement-Fly-Ash} & \incpic{Cement-Water} & \incpic{Cement-Plasticizer} & \incpic{Cement-Age} \\ 
- \rotatebox{90}{{$\;\;$$\;\;$slag}} & \incpic{Slag-Cement} & \incpic{Slag-Slag} & \incpic{Slag-Fly-Ash} & \incpic{Slag-Water} & \incpic{Slag-Plasticizer} & \incpic{Slag-Age} \\ 
+ \rotatebox{90}{$\;\;$ cement} & \incpic{Cement-Cement} & \incpic{Cement-Slag} & \incpic{Cement-Fly-Ash} & \incpic{Cement-Water} & \incpic{Cement-Plasticizer} & \incpic{Cement-Age} \\ 
+ \rotatebox{90}{$\;\;\;\;$ slag} & \incpic{Slag-Cement} & \incpic{Slag-Slag} & \incpic{Slag-Fly-Ash} & \incpic{Slag-Water} & \incpic{Slag-Plasticizer} & \incpic{Slag-Age} \\ 
  \rotatebox{90}{{$\;\;$fly ash}} & \incpic{Fly-Ash-Cement} & \incpic{Fly-Ash-Slag} & \incpic{Fly-Ash-Fly-Ash} & \incpic{Fly-Ash-Water} & \incpic{Fly-Ash-Plasticizer} & \incpic{Fly-Ash-Age} \\ 
  \rotatebox{90}{{$\quad$water}} & \incpic{Water-Cement} & \incpic{Water-Slag} & \incpic{Water-Fly-Ash} & \incpic{Water-Water} & \incpic{Water-Plasticizer} & \incpic{Water-Age} \\ 
  \rotatebox{90}{{plasticizer}} & \incpic{Plasticizer-Cement} & \incpic{Plasticizer-Slag} & \incpic{Plasticizer-Fly-Ash} & \incpic{Plasticizer-Water} & \incpic{Plasticizer-Plasticizer} & \incpic{Plasticizer-Age}\\
-\rotatebox{90}{$\;\;$$\;\;$\phantom{t}age} & \incpic{Age-Cement} & \incpic{Age-Slag} & \incpic{Age-Fly-Ash} & \incpic{Age-Water} & \incpic{Age-Plasticizer} & \incpic{Age-Age} \\
+\rotatebox{90}{$\;\;\;$ \phantom{t}age} & \incpic{Age-Cement} & \incpic{Age-Slag} & \incpic{Age-Fly-Ash} & \incpic{Age-Water} & \incpic{Age-Plasticizer} & \incpic{Age-Age} \\
  \end{tabular}
  \fbox{
 \begin{tabular}{c}
@@ -539,7 +539,7 @@ \subsubsection{Posterior covariance of additive components}
 %Each plot shows the posterior correlations between the height of two functions, evaluated across the range of the data upon which they depend.
 %Color indicates the amount of correlation between the function value of the two components.
 Red indicates high correlation, teal indicates no correlation, and blue indicates negative correlation.
-Plots on the diagonal show posterior correlations within each function.
+Plots on the diagonal show posterior correlations between different values of the same function.
 Correlations are evaluated over the same input ranges as in \cref{fig:interpretable functions}. 
 %Off-diagonal plots show posterior covariance between each pair of functions, as a function of both inputs.
 %Negative correlation means that one function is high and the other low, but which one is uncertain.
@@ -549,7 +549,7 @@ \subsubsection{Posterior covariance of additive components}
 \end{figure}
 %
 For example, \cref{fig:interpretable interactions} shows the posterior correlation between all non-zero components of the concrete model.
-This figure shows that most of the correlation occurs within components, but there is also negative correlation between the ``cement'' and ``slag'' variables.
+This figure shows that most of the correlation occurs within components, but there is also negative correlation between the height of $f_1(\textnormal{cement})$ and $f_2(\textnormal{slag})$.
 This reflects an ambiguity in the model about which one of these functions is high and the other low.