Add async sgd document (#8474)

Yancey · web-flow · commit fee90b50747d · 2018-02-28T19:58:18.000+08:00
* add async sgd document

* fix ci

* update by comment

* update doc
diff --git a/doc/howto/cluster/cmd_argument_cn.md b/doc/howto/cluster/cmd_argument_cn.md
@@ -1,14 +1,17 @@
-## 启动参数说明
+# 启动参数说明
 
 下面以`doc/howto/cluster/src/word2vec`中的代码作为实例，介绍使用PaddlePaddle v2 API完成分布式训练。
 
-### 启动参数服务器
+## 启动参数服务器
+
 执行以下的命令启动一个参数服务器并等待和计算节点的数据交互
+
 ```bash
 $ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1
 ```
 
 如果希望可以在后台运行pserver程序，并保存输出到一个日志文件，可以运行：
+
 ```bash
 $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 &> pserver.log
 ```
@@ -20,8 +23,10 @@ $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num
 - ports_num_for_sparse：**必选，默认0**，用于稀疏类型参数通信的端口个数
 - num_gradient_servers：**必选，默认1**，当前训练任务pserver总数
 
-### 启动计算节点
+## 启动计算节点
+
 执行以下命令启动使用python编写的trainer程序（文件名为任意文件名，如train.py）
+
 ```bash
 $ python train.py
 ```
@@ -67,7 +72,7 @@ paddle.init(
 - pservers：**必选，默认127.0.0.1**，当前训练任务启动的pserver的IP列表，多个IP使用“,”隔开
 
 
-### 准备数据集
+## 准备数据集
 
 参考样例数据准备脚本[prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py)，准备训练数据和验证数据集，我们使用paddle.dataset.imikolov数据集，并根据分布式训练并发数（trainer节点个数），在`prepare.py`开头部分指定`SPLIT_COUNT`将数据切分成多份。
 
@@ -84,7 +89,8 @@ for f in flist:
 ```
 
 示例程序`prepare.py`会把训练集和测试集分别分割成多个文件（例子中为3个，后缀为`-00000`、`-00001`和`-00002`）:
-```
+
+```bash
 train.txt
 train.txt-00000
 train.txt-00001
@@ -99,12 +105,13 @@ test.txt-00002
 
 对于不同的训练任务，训练数据格式和训练程序的`reader()`会大不相同，所以开发者需要根据自己训练任务的实际场景完成训练数据的分割和`reader()`的编写。
 
-### 准备训练程序
+## 准备训练程序
 
 我们会对每个训练任务都会在每个节点上创建一个工作空间（workspace），其中包含了用户的训练程序、程序依赖、挂载或下载的训练数据分片。
 
 最后，工作空间应如下所示：
-```
+
+```bash
 .
 |-- my_lib.py
 |-- word_dict.pickle
@@ -133,3 +140,21 @@ test.txt-00002
 
 - `train_data_dir`：包含训练数据的目录，可以是从分布式存储挂载过来的，也可以是在任务启动前下载到本地的。
 - `test_data_dir`：包含测试数据集的目录。
+
+## 异步 SGD 更新
+
+我们可以通过设置 `optimize` 的参数使之支持异步SGD更新。
+例如，设置 `AdaGrad` optimize 的 `is_async` 和 `async_lagged_grad_discard_ratio` 参数：
+
+```python
+adagrad = paddle.optimizer.AdaGrad(
+    is_async=True,
+    async_lagged_grad_discard_ratio=1.6,
+    learning_rate=3e-3,
+    regularization=paddle.optimizer.L2Regularization(8e-4))
+```
+
+- `is_async`: 是否为异步SGD更新模式。
+- `async_lagged_grad_discard_ratio`: 异步SGD更新的步长控制，接收到足够的gradient(
+  `async_lagged_grad_discard_ratio * num_gradient_servers`)之后，后面的gradient
+  将会被抛弃。
diff --git a/doc/howto/cluster/cmd_argument_en.md b/doc/howto/cluster/cmd_argument_en.md
@@ -1,18 +1,19 @@
-## Command-line arguments
+# Command-line arguments
 
 We'll take `doc/howto/cluster/src/word2vec` as an example to introduce distributed training using PaddlePaddle v2 API.
 
-### Starting parameter server
+## Starting parameter server
 
 Type the below command to start a parameter server which will wait for trainers to connect:
 
 ```bash
-$ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1
+$ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 --nics=eth0
 ```
 
 If you wish to run parameter servers in background, and save a log file, you can type:
+
 ```bash
-$ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 &> pserver.log
+$ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 --nics=eth0 &> pserver.log &
 ```
 
 Parameter Description
@@ -21,8 +22,10 @@ Parameter Description
 - ports_num: **required, default 1**, total number of ports will listen on.
 - ports_num_for_sparse: **required, default 0**, number of ports which serves sparse parameter update.
 - num_gradient_servers: **required, default 1**, total number of gradient servers.
+- nics: **optional, default xgbe0,xgbe1**, network device name which paramter server will listen on.
+
+## Starting trainer
 
-### Starting trainer
 Type the command below to start the trainer(name the file whatever you want, like "train.py")
 
 ```bash
@@ -70,7 +73,7 @@ Parameter Description
 - trainer_id: **required, default 0**, ID for every trainer, start from 0.
 - pservers: **required, default 127.0.0.1**, list of IPs of parameter servers, separated by ",".
 
-### Prepare Training Dataset
+## Prepare Training Dataset
 
 Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files.
 
@@ -88,7 +91,7 @@ for f in flist:
 
 Example code `prepare.py` will split training data and testing data into 3 files with digital suffix like `-00000`, `-00001` and`-00002`:
 
-```
+```bash
 train.txt
 train.txt-00000
 train.txt-00001
@@ -103,13 +106,13 @@ When job started, every trainer needs to get it's own part of data. In some dist
 
 Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job.
 
-### Prepare Training program
+## Prepare Training program
 
 We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.
 
-
 Your workspace may looks like:
-```
+
+```bash
 .
 |-- my_lib.py
 |-- word_dict.pickle
@@ -138,3 +141,21 @@ Your workspace may looks like:
 
 - `train_data_dir`: containing training data. Mount from storage service or copy trainning data to here.
 - `test_data_dir`: containing testing data.
+
+## Async SGD Update
+
+We can set some parameters of the optimizer to make it support async SGD update.
+For example, we can set the `is_async` and `async_lagged_grad_discard_ratio` of the `AdaGrad` optimizer:
+
+```python
+adagrad = paddle.optimizer.AdaGrad(
+    is_async=True,
+    async_lagged_grad_discard_ratio=1.6,
+    learning_rate=3e-3,
+    regularization=paddle.optimizer.L2Regularization(8e-4))
+```
+
+- `is_async`: Is Async-SGD or not.
+- `async_lagged_grad_discard_ratio`: For async SGD gradient commit control.
+  when `async_lagged_grad_discard_ratio * num_gradient_servers` commit passed,
+  current async gradient will be discard silently.
diff --git a/python/paddle/trainer_config_helpers/optimizers.py b/python/paddle/trainer_config_helpers/optimizers.py
@@ -361,6 +361,7 @@ def settings(batch_size,
              learning_rate_decay_b=0.,
              learning_rate_schedule='poly',
              learning_rate_args='',
+             async_lagged_grad_discard_ratio=1.5,
              learning_method=None,
              regularization=None,
              is_async=False,
@@ -396,6 +397,10 @@ def settings(batch_size,
                                         value larger than some value, will be
                                         clipped.
     :type gradient_clipping_threshold: float
+    :param async_lagged_grad_discard_ratio: async SGD gradient commit control,
+          when async_lagged_grad_discard_ratio * num_gradient_servers commit passed, 
+          the current async SGD gradient is discarded.
+    :type async_lagged_grad_discard_ratio: float
     """
     if isinstance(regularization, BaseRegularization):
         regularization = [regularization]
@@ -409,7 +414,7 @@ def settings(batch_size,
     args = [
         'batch_size', 'learning_rate', 'learning_rate_decay_a',
         'learning_rate_decay_b', 'learning_rate_schedule', 'learning_rate_args',
-        'gradient_clipping_threshold'
+        'gradient_clipping_threshold', 'async_lagged_grad_discard_ratio'
     ]
     kwargs = dict()
     kwargs['algorithm'] = algorithm