PaddlePaddle
diff --git a/‎doc/fluid/design/algorithm/parameter_average.md
Lines changed: 3 additions & 1 deletion b/‎doc/fluid/design/algorithm/parameter_average.md
Lines changed: 3 additions & 1 deletion
diff --git a/‎doc/fluid/design/concurrent/channel.md
Lines changed: 14 additions & 14 deletions b/‎doc/fluid/design/concurrent/channel.md
Lines changed: 14 additions & 14 deletions
diff --git a/‎doc/fluid/design/concurrent/select_op.md
Lines changed: 21 additions & 21 deletions b/‎doc/fluid/design/concurrent/select_op.md
Lines changed: 21 additions & 21 deletions
diff --git a/‎doc/fluid/design/dist_train/distributed_architecture.md
Lines changed: 5 additions & 5 deletions b/‎doc/fluid/design/dist_train/distributed_architecture.md
Lines changed: 5 additions & 5 deletions
diff --git a/‎doc/fluid/design/dist_train/multi_cpu.md
Lines changed: 2 additions & 2 deletions b/‎doc/fluid/design/dist_train/multi_cpu.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/fluid/design/dist_train/parameter_server.md
Lines changed: 3 additions & 3 deletions b/‎doc/fluid/design/dist_train/parameter_server.md
Lines changed: 3 additions & 3 deletions
diff --git a/‎doc/fluid/design/dynamic_rnn/rnn.md
Lines changed: 4 additions & 4 deletions b/‎doc/fluid/design/dynamic_rnn/rnn.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎doc/fluid/design/modules/batch_norm_op.md
Lines changed: 11 additions & 11 deletions b/‎doc/fluid/design/modules/batch_norm_op.md
Lines changed: 11 additions & 11 deletions
@@ -7,7 +7,9 @@ Polyak and Juditsky (1992) showed that the test performance of simple average of
 
 Hence, to accelerate the speed of Stochastic Gradient Descent, Averaged Stochastic Gradient Descent (ASGD) was proposed in Polyak and Juditsky (1992). For ASGD, the running average of parameters obtained by SGD, is used as the estimator for <img src="./images/theta_star.gif"/><br/> . The averaging is done as follows:
 
-![](./images/asgd.gif)
+<p align="center">
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/asgd.gif"><br />
+</p>
 
 We propose averaging for any optimizer similar to how ASGD performs it, as mentioned above.
 
 
@@ -2,7 +2,7 @@
 
 ## Introduction
 
-A Channel is a data structure that allows for synchronous interprocess 
+A Channel is a data structure that allows for synchronous interprocess
 communication via message passing.  It is a fundemental component of CSP
 (communicating sequential processes), and allows for users to pass data
 between threads without having to worry about synchronization.
@@ -18,7 +18,7 @@ Creates a new channel that takes in variables of a specific dtype.
 
 - **fluid.make_channel(dtype, capacity=0)**
   - **dtype**: The data type of variables being sent/received through channel
-  - **capacity**: The capacity of the channel.  A capacity of 0 represents 
+  - **capacity**: The capacity of the channel.  A capacity of 0 represents
     an unbuffered channel.  Capacity > 0 represents a buffered channel
 
 ```
@@ -40,8 +40,8 @@ fluid.channel_close(ch)
 
 ### Send data to a channel
 
-Sends a variable to a channel.  Currently, variables of dtype `LoDTensor`, 
-`LoDRankTable`, `LoDTensorArray`, `SelectedRows`, `ReaderHolder`, and 
+Sends a variable to a channel.  Currently, variables of dtype `LoDTensor`,
+`LoDRankTable`, `LoDTensorArray`, `SelectedRows`, `ReaderHolder`, and
 `ChannelHolder` are supported.
 
 By default, the data of the Variable is moved from the sender to the receiver,
@@ -52,7 +52,7 @@ however the user can optionally copy the data before performing the send.
   - **variable**: The variable to send to the channel
   - **is_copy**: If set to True, channel_send will perform a variable assign
   to copy the source variable to a new variable to be sent.
-  
+
 ```
 ch = fluid.make_channel(dtype=core.VarDesc.VarType.LOD_TENSOR)
 var = fill_constant(shape=[1],dtype=core.VarDesc.VarType.INT32, value=100)
@@ -68,7 +68,7 @@ receiving variable.
   - **channel**: The channel to receive the variable from
   - **return_variable**: The destination variable used to store the data of the
   variable received from the channel
-  
+
 ```
 ch = fluid.make_channel(dtype=core.VarDesc.VarType.LOD_TENSOR)
 var = fill_constant(shape=[1],dtype=core.VarDesc.VarType.INT32, value=-1)
@@ -84,9 +84,9 @@ internal queues, locks, and conditional variables.
 ### QueueMessage
 
 QueueMessage encapsulates the state of the channel send/receive operation to be
-put in the **sendq/recvq**.  It contains a condition variable used to lock the 
+put in the **sendq/recvq**.  It contains a condition variable used to lock the
 thread (when there are no available sends/receives).  In addition, it contains
-a callback function to notify a thread when the QueueMessage is being 
+a callback function to notify a thread when the QueueMessage is being
 processed by the channel.
 
 ### Queues
@@ -108,21 +108,21 @@ channel_recv operation will put a new QueueMessage on the recvq and block the
 current thread under two conditions:
   1. The channel is buffered and there is no data on the buff_
   2. The channel is unbuffered and does not have a sender
-  
+
 ### State diagram
 
 #### Channel Send
 
 <p align="center">
-<img src="./images/channel_send.png"/><br/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/channel_send.png"/><br/>
 </p>
-  
+
 #### Channel Receive
 
 <p align="center">
-<img src="./images/channel_recv.png"/><br/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/channel_recv.png"/><br/>
 </p>
-  
+
 ## Limitations and Considerations
 
 ### Variable Copy
@@ -135,5 +135,5 @@ be sent before it is sent.
 
 Please note that this is acheived by adding an **assign** operator and creating
 a temporary variable that is sent in place of the original variable.  Please
-note that **assign** operator has limited support for only certain variables 
+note that **assign** operator has limited support for only certain variables
 datatypes.
@@ -2,13 +2,13 @@
 
 ## Introduction
 
-In golang, the [**select**](https://golang.org/ref/spec#Select_statements) 
-statement lets a goroutine wait on multiple communication operations at the 
-same time. The **select** blocks until one of its cases can run, then 
-executes the case.  If multiple cases are ready to run, then one case is 
+In golang, the [**select**](https://golang.org/ref/spec#Select_statements)
+statement lets a goroutine wait on multiple communication operations at the
+same time. The **select** blocks until one of its cases can run, then
+executes the case.  If multiple cases are ready to run, then one case is
 choosen at random to be executed.
 
-With the introduction of CSP for Paddle, we mimic this behavior by 
+With the introduction of CSP for Paddle, we mimic this behavior by
 creating a ***select_op***.
 
 ## How to use it
@@ -17,11 +17,11 @@ The **select_op** is available as a c++ operator.  However most users
 will prefer to use the much simplier Python API.
 
 - **fluid.Select()**: Creates a select operator and adds it to the current
-block within the main program.  Also creates a sub block and adds it to the 
-main program.  This sub block is used to hold all variables and operators 
+block within the main program.  Also creates a sub block and adds it to the
+main program.  This sub block is used to hold all variables and operators
 used by the case statements.
- 
-Within the select block, users can add cases by 
+
+Within the select block, users can add cases by
 calling **select.case** or **select.default** method.
 
 - **fluid.Select.case(channel_action, channel, result_variable)**: Represents
@@ -37,13 +37,13 @@ execute.
 ```
 ch1 = fluid.make_channel(dtype=core.VarDesc.VarType.LOD_TENSOR)
 quit_ch = fluid.make_channel(dtype=core.VarDesc.VarType.LOD_TENSOR)
-            
+
 x = fill_constant(shape=[1], dtype=core.VarDesc.VarType.INT32, value=0)
 y = fill_constant(shape=[1], dtype=core.VarDesc.VarType.INT32, value=1)
- 
+
 while_cond = fill_constant(shape=[1], dtype=core.VarDesc.VarType.BOOL, value=True)
 while_op = While(cond=while_cond)    
- 
+
 with while_op.block():
     with fluid.Select() as select:
         with select.case(fluid.channel_send, channel, x):
@@ -99,17 +99,17 @@ blocks {
     }
   }
   // Create "select" operator.
-  // inputs: 
+  // inputs:
   //   X: All input variables used by operators within the select block
   //   case_to_execute: Variable filled in by select_op when it determines
   //     which case to execute.
   //  
   // outputs:
-  //   Out: All output variables referenced by operators within select block. 
-  // 
+  //   Out: All output variables referenced by operators within select block.
+  //
   // attrs:
   //   sub_block: The block id containing the select "cases"
-  //   cases:  Serialized list of all cases in the select op. 
+  //   cases:  Serialized list of all cases in the select op.
   //     Each case is serialized as: '<index>,<type>,<channel>,<value>'
   //     where type is 0 for default, 1 for send, and 2 for receive.
   //     No channel and values are needed for default cases.
@@ -150,7 +150,7 @@ into **X**.  It will also create a temp variable called **case_to_execute**.  Th
 filled in by the select_op after it has completed processing the case statements.
 
 If there are no available cases to execute (ie: all cases are blocked on channel operations, and
-there is no default statement), then the select_op will block the current thread.  The thread will 
+there is no default statement), then the select_op will block the current thread.  The thread will
 unblock once there is a channel operation affecting one of the case statements, at which point, the
 **select_op** will set the **case_to_execute** variable to the index of the case to execute.
 
@@ -247,17 +247,17 @@ blocks {
 
 ```
 
-Cases are represented by a **conditional_block operator**, whose's condition is set as the output of 
-equal(**case_to_execute**, **case_index**).  Since each case index is unique in this sub-block, 
+Cases are represented by a **conditional_block operator**, whose's condition is set as the output of
+equal(**case_to_execute**, **case_index**).  Since each case index is unique in this sub-block,
 only one case will be executed.
 
 ### select_op flow
 
 <p align="center">
-<img src="./images/select_op_workflow.png"/><br/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/select_op_workflow.png"/><br/>
 </p>
 
-The select algorithm is inspired by golang's select routine.  Please refer to 
+The select algorithm is inspired by golang's select routine.  Please refer to
 http://www.tapirgames.com/blog/golang-concurrent-select-implementation for more information.
 
 ## Backward Pass
 
@@ -40,11 +40,11 @@ computation is only specified in Python code which sits outside of PaddlePaddle,
 
 Similar to how a compiler uses an intermediate representation (IR) so that the programmer does not need to manually optimize their code for most of the cases, we can have an intermediate representation in PaddlePaddle as well. The compiler optimizes the IR as follows:
 
-<img src="src/compiler.png"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/compiler.png"/>
 
 PaddlePaddle can support model parallelism by converting the IR so that the user no longer needs to manually perform the computation and operations in the Python component:
 
-<img src="src/paddle-compile.png"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/paddle-compile.png"/>
 
 The IR for PaddlePaddle after refactoring is called a `Block`, it specifies the computation dependency graph and the variables used in the computation.
 
@@ -60,7 +60,7 @@ For a detailed explanation, refer to this document -
 
 The revamped distributed training architecture can address the above discussed limitations. Below is the illustration of how it does so:
 
-<img src="src/distributed_architecture.png"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/distributed_architecture.png"/>
 
 The major components are: *Python API*, *Distribute Transpiler* and *Remote Executor*.
 
@@ -152,7 +152,7 @@ for data in train_reader():
 `JobDesc` object describe the distributed job resource specification to run on
 Cluster environment.
 
-<img src="src/remote_executor.png" width="500" align="center" />
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/remote_executor.png" width="500" align="center" />
 
 `RemoteExecutor.run` sends the `ProgramDesc` and
 [TrainingJob](https://github.com/PaddlePaddle/cloud/blob/unreleased-tpr/doc/autoscale/README.md#training-job-resource)
@@ -171,7 +171,7 @@ In the future, a more general placement algorithm should be implemented, which m
 
 The local training architecture will be the same as the distributed training architecture, the difference is that everything runs locally, and there is just one PaddlePaddle runtime:
 
-<img src="src/local_architecture.png"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/local_architecture.png"/>
 
 
 ### Training Data
 
@@ -8,11 +8,11 @@ Op graph to a multi-CPU Op graph, and run `ParallelDo` Op to run the graph.
 
 ## Transpiler
 
-<img src="src/multi-threads/[email protected]" width="300">
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/[email protected]" width="300">
 
 After converted:
 
-<img src="src/multi-threads/[email protected]" width="1000">
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/[email protected]" width="1000">
 
 ## Implement
 
 
@@ -41,11 +41,11 @@ We will need these OPs: *Send*, *Recv*, *Enqueue*, *Dequeue*.
 Below is an example of converting the user defined graph to the
 subgraphs for the trainer and the parameter server:
 
-<img src="src/local-graph.png" width="300"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/local-graph.png" width="300"/>
 
 After converting:
 
-<img src="src/dist-graph.png" width="700"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/dist-graph.png" width="700"/>
 
 1. The parameter variable W and its optimizer program are placed on the parameter server.
 1. Operators are added to the program.
@@ -69,7 +69,7 @@ In Fluid, we introduce [SelectedRows](../selected_rows.md) to represent a list o
 non-zero gradient data. So when we do parameter optimization both locally and remotely,
 we only need to send those non-zero rows to the optimizer operators:
 
-<img src="src/sparse_update.png" width="700" />
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/sparse_update.png" width="700" />
 
 ### Benefits
 
 
@@ -5,7 +5,7 @@ This document describes the RNN (Recurrent Neural Network) operator and how it i
 ## RNN Algorithm Implementation
 
 <p align="center">
-<img src="./rnn.jpg"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/rnn.jpg"/>
 </p>
 
 The above diagram shows an RNN unrolled into a full network.
@@ -22,7 +22,7 @@ There are several important concepts here:
 There could be local variables defined in each step-net.  PaddlePaddle runtime realizes these variables in *step-scopes* which are created for each step.
 
 <p align="center">
-<img src="./rnn.png"/><br/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/rnn.png"/><br/>
 Figure 2 illustrates the RNN's data flow
 </p>
 
@@ -93,7 +93,7 @@ For example, we could have a 2-level RNN, where the top level corresponds to par
 The following figure illustrates feeding in text into the lower level, one sentence at a step, and the feeding in step outputs to the top level. The final top level output is about the whole text.
 
 <p align="center">
-<img src="./2_level_rnn.png"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/2_level_rnn.png"/>
 </p>
 
 ```python
@@ -149,5 +149,5 @@ If the `output_all_steps` is set to False, it will only output the final time st
 
 
 <p align="center">
-<img src="./rnn_2level_data.png"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/rnn_2level_data.png"/>
 </p>
@@ -2,7 +2,7 @@
 
 ## What is batch normalization
 
-Batch normalization is a frequently-used method in deep network training. It adjusts the mean and variance of a layer's output, and make the data distribution easier for next layer's training. 
+Batch normalization is a frequently-used method in deep network training. It adjusts the mean and variance of a layer's output, and make the data distribution easier for next layer's training.
 
 The principle of batch normalization can be summarized into a simple function:
 
@@ -66,21 +66,21 @@ As most C++ operators do, `batch_norm_op` is defined by inputs, outputs, attribu
 
 The following graph showes the training computational process of `batch_norm_op`:
 
-<img src="../images/batch_norm_op_kernel.png" width="800"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/batch_norm_op_kernel.png" width="800"/>
 
 cudnn provides APIs to finish the whole series of computation, we can use them in our GPU kernel.
 
 ### Python
 
 `batch_norm_op` is warpped as a layer in Python:
 
-```python 
-def batch_norm_layer(net, 
+```python
+def batch_norm_layer(net,
                      input,
-                     output, 
-                     scale, 
-                     bias, 
-                     use_global_est = False, 
+                     output,
+                     scale,
+                     bias,
+                     use_global_est = False,
                      epsilon = 1e-6,
                      momentum = 0.99):
 	mean_cache = scope.new_var(name = 'estimated_mean', trainable = False)
@@ -119,15 +119,15 @@ for pass_id in range(PASS_NUM):
     if pass_id % 100 == 0:
         net.infer(test_image)    # run inferencing model
     # ...
-``` 
+```
 
 `is_infer` is an attribute. Once an operator is created, its attributes can not be changed. It suggests us that we shall maintain two `batch_norm_op` in the model, one's `is_infer` is `True`(we call it `infer_batch_norm_op`) and the other one's is `False`(we call it `train_batch_norm_op`). They share all parameters and variables, but be placed in two different branches. That is to say, if a network contains a `batch_norm_op`, it will fork into two branches, one go through `train_batch_norm_op` and the other one go through `infer_batch_norm_op`:
 
 <div align=center>
-<img src="../images/batch_norm_fork.png" width="500"/>
+<img src="https://github.com/PaddlePaddle/Paddle/tree/develop/doc/fluid/images/batch_norm_fork.png" width="500"/>
 </div>
 
-Just like what is shown in the above graph, the net forks before `batch_norm_op` and will never merge again. All the operators after `batch_norm_op` will duplicate. 
+Just like what is shown in the above graph, the net forks before `batch_norm_op` and will never merge again. All the operators after `batch_norm_op` will duplicate.
 
 When the net runs in training mode, the end of the left branch will be set as the running target, so the dependency tracking process will ignore right branch automatically. When the net runs in inferencing mode, the process is reversed.