Update docs now that improved standard parameterization note is on arXiv.

Sohl-Dickstein · Sam Schoenholz · commit 0661929e1885 · 2020-01-24T17:30:24.000-08:00
PiperOrigin-RevId: 291473673
diff --git a/README.md b/README.md
@@ -245,7 +245,7 @@ import neural_tangents as nt  # 64-bit precision enabled
 We remark the following differences between our library and the JAX one.
 
 * All `nt.stax` layers are instantiated with a function call, i.e. `nt.stax.Relu()` vs `jax.experimental.stax.Relu`.
-* All layers with trainable parameters use the _NTK parameterization_ by default (see [[10]](#5-neural-tangent-kernel-convergence-and-generalization-in-neural-networks-neurips-2018-arthur-jacot-franck-gabriel-clément-hongler), Remark 1). However, Dense and Conv layers also support the _standard parameterization_ via a `parameterization` keyword argument. <!-- TODO(jaschasd) add link to note deriving NTK for standard parameterization -->
+* All layers with trainable parameters use the _NTK parameterization_ by default (see [[10]](#5-neural-tangent-kernel-convergence-and-generalization-in-neural-networks-neurips-2018-arthur-jacot-franck-gabriel-clément-hongler), Remark 1). However, Dense and Conv layers also support the _standard parameterization_ via a `parameterization` keyword argument (see [[15]](#15-on-the-infinite-width-limit-of-neural-networks-with-a-standard-parameterization)).
 * `nt.stax` and `jax.experimental.stax` may have different layers and options available (for example `nt.stax` layers support `CIRCULAR` padding, but only `NHWC` data format).
 
 ### Python 2 is not supported
@@ -358,10 +358,10 @@ a small dataset using a small learning rate.
 Neural Tangents has been used in the following papers:
 
 
-* [Disentangling Trainability and Generalization in Deep Learning](https://arxiv.org/abs/1912.13053) \
+* [Disentangling Trainability and Generalization in Deep Learning.](https://arxiv.org/abs/1912.13053) \
 Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz
 
-* [Information in Infinite Ensembles of Infinitely-Wide Neural Networks](https://arxiv.org/abs/1911.09189) \
+* [Information in Infinite Ensembles of Infinitely-Wide Neural Networks.](https://arxiv.org/abs/1911.09189) \
 Ravid Shwartz-Ziv, Alexander A. Alemi
 
 * [Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel.](https://arxiv.org/abs/1905.13654) \
@@ -372,6 +372,9 @@ Descent.](https://arxiv.org/abs/1902.06720) \
 Jaehoon Lee*, Lechao Xiao*, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha
 Sohl-Dickstein, Jeffrey Pennington
 
+* [On the Infinite Width Limit of Neural Networks with a Standard Parameterization.](https://arxiv.org/pdf/2001.07301.pdf) \
+Jascha Sohl-Dickstein, Roman Novak, Samuel S. Schoenholz, Jaehoon Lee
+
 Please let us know if you make use of the code in a publication and we'll add it
 to the list!
 
@@ -423,3 +426,5 @@ If you use the code in a publication, please cite the repo using the .bib,
 ##### [13] [Mean Field Residual Networks: On the Edge of Chaos.](https://arxiv.org/abs/1712.08969) *NeurIPS 2017.* Greg Yang, Samuel S. Schoenholz
 
 ##### [14] [Wide Residual Networks.](https://arxiv.org/abs/1605.07146) *BMVC 2018.* Sergey Zagoruyko, Nikos Komodakis
+
+##### [15] [On the Infinite Width Limit of Neural Networks with a Standard Parameterization.](https://arxiv.org/pdf/2001.07301.pdf) *arXiv 2020.* Jascha Sohl-Dickstein, Roman Novak, Samuel S. Schoenholz, Jaehoon Lee
diff --git a/neural_tangents/stax.py b/neural_tangents/stax.py
@@ -25,8 +25,9 @@
   similarly to `init_fn` and `apply_fn`.
 
 2) In layers with random weights, NTK parameterization is used by default
-  (https://arxiv.org/abs/1806.07572, page 3). Standard parameterization can
-  be specified for Conv and Dense layers by a keyword argument.
+  (https://arxiv.org/abs/1806.07572, page 3). Standard parameterization
+  (https://arxiv.org/abs/2001.07301) can be specified for Conv and Dense layers
+  by a keyword argument.
 
 3) Some functionality may be missing (e.g. `BatchNorm`), and some may be present
   only in our library (e.g. `CIRCULAR` padding, `LayerNorm`, `GlobalAvgPool`,
@@ -951,8 +952,9 @@ def Dense(out_dim,
       Under ntk parameterization (https://arxiv.org/abs/1806.07572, page 3),
         weights and biases are initialized as W_ij ~ N(0,1), b_i ~ N(0,1), and
         the finite width layer equation is z_i = W_std / sqrt([width]) sum_j
-        W_ij x_j + b_std b_i Under standard parameterization, weights and biases
-        are initialized as W_ij ~ N(0,W_std^2/[width]), b_i ~ N(0,b_std^2), and
+        W_ij x_j + b_std b_i Under standard parameterization
+        (https://arxiv.org/abs/2001.07301), weights and biases are initialized
+        as W_ij ~ N(0,W_std^2/[width]), b_i ~ N(0,b_std^2), and
         the finite width layer equation is z_i = \sum_j W_ij x_j + b_i.