duvenaud
diff --git a/‎additive.pdf‎
-106 Bytes b/‎additive.pdf‎
-106 Bytes
diff --git a/‎additive.tex‎
Lines changed: 2 additions & 2 deletions b/‎additive.tex‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎intro.pdf‎
473 Bytes b/‎intro.pdf‎
473 Bytes
diff --git a/‎intro.tex‎
Lines changed: 7 additions & 4 deletions b/‎intro.tex‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎kernels.pdf‎
-86 Bytes b/‎kernels.pdf‎
-86 Bytes
diff --git a/‎kernels.tex‎
Lines changed: 4 additions & 4 deletions b/‎kernels.tex‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎thesis.pdf‎
-1.38 KB b/‎thesis.pdf‎
-1.38 KB
@@ -154,7 +154,7 @@ \subsection{Weighting different orders of interaction}
 On different datasets, the dominant order of interaction estimated by the additive model varies widely.
 In some cases, the variance is concentrated almost entirely onto a single order of interaction.
 This may may be a side-effect of using the same lengthscales for all orders of interaction; lengthscales appropriate for low-dimensional regression might not be appropriate for high-dimensional regression.
-A re-scaling of lengthscales which preserves relative average distances between datapoints might be expected to improve the model.
+%A re-scaling of lengthscales which enforces similar distances between datapoints might improve the model.
 %An additive \gp{} with all of its variance coming from the 1st order is equivalent to a sum of one-dimensional functions.
 %An additive \gp{} with all its variance coming from the $D$th order is equivalent to a \gp{} with an \seard{} kernel.
 %
@@ -209,7 +209,7 @@ \subsection{Efficiently evaluating additive kernels}
 \subsubsection{Evaluation of derivatives}
 
 Conveniently, we can use the same trick to efficiently compute the necessary derivatives of the additive kernel with respect to the base kernels.
-This can be done by removing the base kernel of interest $k_j$ from each term of the polynomials:
+This can be done by removing the base kernel of interest, $k_j$, from each term of the polynomials:
 %
 \begin{align}
 \frac{\partial k_{add_n}}{\partial k_j} = 
 
@@ -23,9 +23,12 @@ \chapter{Introduction}
 First, keeping all hypotheses that match the data helps to guard against over-fitting.
 Second, comparing how well a dataset is fit by different models gives a way of finding which sorts of structure are present in that data. % models having different types of structure is a way to find which sorts of structure are present in a dataset.
 
-%This thesis will be concerned with finding structure in functions.
+This thesis focuses on constructing models of functions.
 %The types of structure examined in this thesis 
-One can construct models of functions that have many different types of structure, such as additivity, symmetry, periodicity, changepoints, or combinations of these, using Gaussian processes (\gp{}s).
+Chapter \ref{ch:kernels} describes how to model functions having many different types of structure, such as additivity, symmetry, periodicity, changepoints, or combinations of these, using Gaussian processes (\gp{}s).
+Chapters \ref{ch:grammar} and \ref{ch:description} show how such models can be automatically constructed from data, and then automatically described.
+Later chapters explore several extensions of these models.
+%will describe how to model functions having many different types of structure, such as additivity, symmetry, periodicity, changepoints, or combinations of these, using Gaussian processes (\gp{}s).
 %To be able to learn a wide variety of structures, we would like to have an expressive language of models of functions.
 %We would like to be able to represent simple kinds of functions, such as linear functions or polynomials.
 %We would also like to have models of arbitrarily complex functions, specified in terms of high-level properties such as how smooth they are, whether they repeat over time, or which symmetries they have.
@@ -34,7 +37,7 @@ \chapter{Introduction}
 %All of these models of function can be constructed using Gaussian processes (\gp{}s).%, a tractable class of models of functions.
 %
 %This chapter will introduce the basic properties of \gp{}s.
-Chapter \ref{ch:kernels} will describe how to model these different types of structure using \gp{}s.
+%Chapter \ref{ch:kernels} will describe how to model these different types of structure using \gp{}s.
 This short chapter introduces the basic properties of \gp{}s, and provides an outline of the thesis.
 %Chapter \ref{ch:grammar} will show how searching over many
 
@@ -96,7 +99,7 @@ \section{Gaussian process models}
 \subsection{Model selection}
 
 The crucial property of \gp{}s that allows us to automatically construct models is that we can compute the \emph{marginal likelihood} of a dataset given a particular model, also known as the \emph{evidence} \citep{mackay1992bayesian}.
-The marginal likelihood allows one to compare models, automatically balancing between the capacity of a model and its fit to the data~\citep{rasmussen2001occam,mackay2003information}.
+The marginal likelihood allows one to compare models, balancing between the capacity of a model and its fit to the data~\citep{rasmussen2001occam,mackay2003information}.
 %discover the appropriate amount of detail to use, due to Bayesian Occam's razor 
 %
 %Choosing a kernel, or kernel parameters, by maximizing the marginal likelihood will typically result in selecting the \emph{least} flexible model which still captures all the structure in the data.
 
@@ -429,7 +429,7 @@ \subsection{Example: An additive model of concrete strength}
 
 \Cref{fig:interpretable functions} shows the marginal posterior distribution of each of the eight one-dimensional functions in the model. %\cref{eq:concrete}.
 The parameters controlling the variance of two of the functions, $f_6(\textnormal{coarse})$ and $f_7(\textnormal{fine})$ were set to zero, meaning that the marginal likelihood preferred a parsimonious model which did not depend on these inputs.
-This is an example of the automatic sparsity that arises by maximizing marginal likelihood in \gp{} models, and another example of automatic relevance determination (\ARD) \citep{neal1995bayesian}.
+This is an example of the automatic sparsity that arises by maximizing marginal likelihood in \gp{} models, and is another example of automatic relevance determination (\ARD) \citep{neal1995bayesian}.
 
 The ability to learn kernel parameters in this way is much more difficult when using non-probabilistic methods such as Support Vector Machines \citep{cortes1995support}, for which cross-validation is often the best method to select kernel parameters.
 
@@ -535,11 +535,11 @@ \subsubsection{Posterior covariance of additive components}
 \end{tabular}
 }
 \caption[Visualizing posterior correlations between components]
-{Posterior correlations between the heights of the one-dimensional functions in \cref{eq:concrete}, whose sum models concrete strength.
+{Posterior correlations between the heights of different one-dimensional functions in \cref{eq:concrete}, whose sum models concrete strength.
 %Each plot shows the posterior correlations between the height of two functions, evaluated across the range of the data upon which they depend.
 %Color indicates the amount of correlation between the function value of the two components.
 Red indicates high correlation, teal indicates no correlation, and blue indicates negative correlation.
-Plots on the diagonal show posterior correlations between different values of the same function.
+Plots on the diagonal show posterior correlations between different evaluations of the same function.
 Correlations are evaluated over the same input ranges as in \cref{fig:interpretable functions}. 
 %Off-diagonal plots show posterior covariance between each pair of functions, as a function of both inputs.
 %Negative correlation means that one function is high and the other low, but which one is uncertain.
@@ -550,7 +550,7 @@ \subsubsection{Posterior covariance of additive components}
 %
 For example, \cref{fig:interpretable interactions} shows the posterior correlation between all non-zero components of the concrete model.
 This figure shows that most of the correlation occurs within components, but there is also negative correlation between the height of $f_1(\textnormal{cement})$ and $f_2(\textnormal{slag})$.
-This reflects an ambiguity in the model about which one of these functions is high and the other low.
+%This reflects an ambiguity in the model about which one of these functions is high and the other low.