virtaitech
diff --git a/‎README.md‎
Lines changed: 8 additions & 2 deletions b/‎README.md‎
Lines changed: 8 additions & 2 deletions
diff --git a/‎blogposts/pytorch_models.md‎
Lines changed: 84 additions & 23 deletions b/‎blogposts/pytorch_models.md‎
Lines changed: 84 additions & 23 deletions
diff --git a/‎dockerfiles/README.md‎
Lines changed: 44 additions & 24 deletions b/‎dockerfiles/README.md‎
Lines changed: 44 additions & 24 deletions
diff --git a/‎dockerfiles/client-pytorch-1.0.1-py3/Dockerfile‎
Lines changed: 37 additions & 0 deletions b/‎dockerfiles/client-pytorch-1.0.1-py3/Dockerfile‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎dockerfiles/client-pytorch-1.0.1-py3/README.md‎
Lines changed: 25 additions & 0 deletions b/‎dockerfiles/client-pytorch-1.0.1-py3/README.md‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎dockerfiles/client-pytorch-1.0.1-py3/pip.conf‎
Lines changed: 3 additions & 0 deletions b/‎dockerfiles/client-pytorch-1.0.1-py3/pip.conf‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎dockerfiles/client-pytorch-1.0.1-py3/requirement.txt‎
Lines changed: 9 additions & 0 deletions b/‎dockerfiles/client-pytorch-1.0.1-py3/requirement.txt‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎dockerfiles/client-pytorch-1.1.0-py3/README.md‎
Lines changed: 1 addition & 1 deletion b/‎dockerfiles/client-pytorch-1.1.0-py3/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎dockerfiles/client-tf1.12-base/build-docker-images.sh‎
Lines changed: 0 additions & 7 deletions b/‎dockerfiles/client-tf1.12-base/build-docker-images.sh‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎dockerfiles/client-tf1.12-py2/build-docker-images.sh‎
Lines changed: 0 additions & 7 deletions b/‎dockerfiles/client-tf1.12-py2/build-docker-images.sh‎
Lines changed: 0 additions & 7 deletions
@@ -26,9 +26,15 @@ Orion vGPU软件用户手册
 
 # What's New
 
-* 2019/07/06 Orion vGPU软件更新：更加精确的显存控制
+* **2019/07/08** Docker 镜像更新：[PyTorch 1.0.1](./dockerfiles/client-pytorch-1.0.1-py3)，[TensorFlow 1.8.0](./dockerfiles/client-tf1.8-base)
 
-  用户需要确保Orion Controller，Orion Server和Orion Client都是最新版本。不同版本的Orion vGPU组件无法共同工作。
+  PyTorch 1.0.1镜像中，PyTorch 1.0.1和torchvision 0.2.2均由官方wheels安装，无须从源码编译。
+  
+  用户使用最新镜像之前，需要确保Orion Server已经更新至最新版本。
+
+* **2019/07/06** Orion vGPU软件更新：更加精确的显存控制
+
+  用户需要确保Orion Controller，Orion Server和Orion Client更新至最新版本。不同版本的Orion vGPU组件无法共同工作。
 
   * `Orion Controller` 使用最新的`orion-controller`
 
 
@@ -5,38 +5,34 @@
 我们推荐用户在我们准备的Orion Client容器内部运行PyTorch模型
 
 ```bash
-docker pull virtaitech/orion-client:pytorch-1.1.0-py3
+docker pull virtaitech/orion-client:pytorch-1.0.1-py3
+# (or use pytorch 1.1.0)
+# docker pull virtaitech/orion-client:pytorch-1.1.0-py3
 ```
 
 运行容器之前，用户需要保证Orion Controller和Orion Server正常运行。运行容器时，需要将`orion-shm`工具创建的`/dev/shm/orionsock<index>`挂载进容器内的同一路径。用户还需要设置正确的`ORION_CONTROLLER`环境变量。
 
-<a id="run-container"></a>**容器内使用PyTorch Multiprocessing**
+<a id="run-container"></a>**容器内使用**
 
-为了在容器中使用PyTorch的Multiprocessing模块，包括PyTorch DataLoader，用户需要增大容器所能使用的共享内存最大值。为此，`docker run`时需要加上
+如果用户需要在容器中使用PyTorch Multiprocessing模块（例如通过DataLoader多进程加载Imagenet数据集），用户需要增大容器内用于IPC的共享内存限额。为此，`docker run`时需要加上参数
 
 ```bash
 --ipc=host
+# (or specify max shm size)
+# --shm-size=8G
 ```
-或者
-```bash
---shm-size=8G
-```
-参数。这一要求与Orion vGPU软件无关，即使通过`nvidia-docker`使用本地物理GPU也是需要的。关于这一点，用户可以参见[NVIDIA官方PyTorch镜像说明](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html)。
 
-**在KVM虚拟机或者裸物理机运行PyTorch使用Orion vGPU**
 
-如果用户要在非容器环境中安装PyTorch以及Orion Client Runtime以使用Orion vGPU资源，建议用户将Orion Client运行时安装到`/usr/local/cuda-9.0`，并创建软链接`/usr/local/cuda => /usr/local/cuda-9.0`:
+这一要求与Orion vGPU软件无关，即使通过`nvidia-docker`使用本地物理GPU也是需要的。关于这一点，用户可以参见[NVIDIA官方PyTorch镜像说明](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html)。
 
-```bash
-mkdir -p /usr/local/cuda-9.0
-./install-client -d /usr/local/cuda-9.0
-ln -s /usr/local/cuda-9.0 /usr/local/cuda
-```
+**在KVM虚拟机或者裸物理机运行PyTorch使用Orion vGPU**
 
-此外，用户可能需要根据[PyTorch Dockerfile文档](../dockerfiles/client-pytorch-1.1.0-py3/README.md)中介绍的步骤，从源码编译一个依赖项更精简的PyTorch版本。
+* 对于PyTorch 1.0.x 版本，用户运行`install-client`安装包安装Orion Client Runtime后即可支持。PyTorch可以直接从
+
+* 对于PyTorch 1.1.0 版本，我们目前要求用户从源码编译，以去掉部分组件依赖。用户可以参考根据[PyTorch 1.1.0 镜像文档](../dockerfiles/client-pytorch-1.1.0-py3/README.md)中介绍的步骤、编译参数，从源码编译一个依赖项更精简的PyTorch 1.1.0版本。
 
 ## 支持情况
-Orion vGPU对PyTorch的支持还在持续开发中。目前，我们支持PyTorch 1.0.1和1.1.0版本。
+Orion vGPU对PyTorch的支持还在持续开发中。目前，我们支持PyTorch 1.0.x和1.1.0版本。
 
 需要注意的是
 * Orion vGPU目前不支持PyTorch通过RDMA网络使用远程GPU资源
@@ -60,7 +56,7 @@ Orion vGPU对PyTorch的支持还在持续开发中。目前，我们支持PyTorc
                 ...
     ```
 
-    用户可以用`--dataset lfw --dataroot=path/to/celeba`参数使用解压后的数据集进行训练。由于我们目前只支持单块Orion vGPU，用户需要加上`--ngpu 1`参数：
+    用户可以用`--dataset lfw --dataroot=path/to/celeba`参数使用解压后的数据集进行训练。由于我们暂时不能完全支持NCCL通信库，用户需要加上`--ngpu 1`参数指定使用一块Orion vGPU进行训练：
     ```bash
     python3 main.py --dataset lfw --dataroot /path/to/celeba --cuda --ngpu 1
     ```
@@ -105,21 +101,86 @@ Orion vGPU对PyTorch的支持还在持续开发中。目前，我们支持PyTorc
     以保证容器内部的DataLoader Worker进程之间可以通过共享内存交换数据。
 
 * [MNIST Convnets](https://github.com/pytorch/examples/tree/master/mnist) 支持
+
+    ```bash
+    python3 main.py
+    ```
+
+    默认会训练10个epochs。
+
 * [MNIST Hogwild](https://github.com/pytorch/examples/tree/master/mnist_hogwild) 暂不支持，对CUDA IPC的全面支持还在开发阶段。
 * [Linear Regression](https://github.com/pytorch/examples/tree/master/regression) 支持
+
+    ```bash
+    python3 main.py
+    ```
+
 * [Reinforcement Learning](https://github.com/pytorch/examples/tree/master/reinforcement_learning) 支持
+
+    ```bash
+    pip3 install -r requirements.txt
+    # For REINFORCE:
+    python3 reinforce.py
+    # For actor critic:
+    python3 actor_critic.py
+    ```
+
 * [SNLI with GloVe vectors and LSTMs](https://github.com/pytorch/examples/tree/master/snli) 支持
 
-    用户需要安装`spacy`，并下载语料集：
+    用户需要安装torchtext和spacy，并下载spacy模型：
+    此外，用户需要安装`spacy`，
     ```bash
-    pip3 install spacy
+    pip3 install torchtext spacy
+
     python3 -m spacy download en
     ```
-* [Super Resolution](https://github.com/pytorch/examples/tree/master/super_resolution) 提供的容器内不支持，因为编译时没有带上Lapack支持。
+
+    然后运行模型：
+
+    ```bash
+    python3 main.py
+    ```
+
+* [Super Resolution](https://github.com/pytorch/examples/tree/master/super_resolution) ~~提供的容器内不支持~~提供的PyTorch 1.1.0镜像不支持，因为编译PyTorch时没有带上Lapack支持。
+
+    **Update 2019/07/08** 用户可以在我们提供的PyTorch 1.0.1镜像中运行这个例子：
+
+    ```bash
+    # Train 100 epochs
+    python3 main.py --upscale_factor 3 --batchSize 4 --testBatchSize 100 --nEpochs 100 --lr 0.001
+
+    # Super Resolution
+    python3 super_resolve.py --input_image dataset/BSDS300/images/test/16077.jpg --model model_epoch_500.pth --output_filename out.png
+    ```
+
+    生成的图片为`out.png`，用户可以与输入图片`dataset/BSDS300/images/test/16077.jpg`对比效果。
+
 * [Time Sequence Prediction](https://github.com/pytorch/examples/tree/master/time_sequence_prediction) 支持
+
+    ```bash
+    # Generate input data
+    python3 generate_sine_wave.py
+    # Train
+    python3 train.py
+    ```
+
+    训练结束后会在当前目录生成预测的波形图。
+
 * [Variational Auto-Encoders](https://github.com/pytorch/examples/tree/master/vae) 支持
+
+    ```bash
+    python3 main.py
+    ```
+
 * [Word Language Model using LSTM](https://github.com/pytorch/examples/tree/master/word_language_model) 支持
 
+    ```bash
+    # Train a tied LSTM on Wikitext-2 with CUDA
+    python3 main.py --cuda --epochs 6 --tied
+    # Generate samples from the trained LSTM model.
+    python3 generate.py
+    ```
+
 ## 多卡训练Resnet50模型示例
 
 本节中，我们展示一个有趣的场景：将两块本地Tesla P100 16GB计算卡虚拟化成4块Orion vGPU用于在Imagenet数据集上训练Resnet50模型。我们在Orion Client内的资源申请环境变量为`ORION_VGPU=4`，`ORION_GMEM=7800`，这样可以保证每两块Orion vGPU位于一块Tesla P100计算卡上。
@@ -179,9 +240,9 @@ python3 main.py --arch resnet50 \
 
 可以看到，实际的计算任务被Orion Server进程`oriond`完全接管。
 
-用户或许会发现，`orion-smi`汇报的显存使用少于`nvidia-smi`汇报的实际物理显存使用，这是因为`orion-smi`工具目前只汇报堆上分配的显存，没有计入CUDA context、cuDNN等隐式占用的显存开销。
+~~用户或许会发现，`orion-smi`汇报的显存使用少于`nvidia-smi`汇报的实际物理显存使用，这是因为`orion-smi`工具目前只汇报堆上分配的显存，没有计入CUDA context、cuDNN等隐式占用的显存开销。~~
 
-**[2019-07-06]更新**  Orion vGPU软件增加了对显存使用控制的精确控制，将包括CUDA context在内的主要隐式显存分配纳入了控制范围，因此用户使用`orion-smi`工具看到的结果会非常接近于`nvidia-smi`。对于PyTorch模型训练，一般显存显示误差在20秒以内。
+**Update 2019/07/06**  Orion vGPU软件增加了对显存使用控制的精确控制，将包括CUDA context在内的主要隐式显存分配纳入了控制范围，因此用户使用`orion-smi`工具看到的结果会非常接近于`nvidia-smi`。对于PyTorch模型训练，一般显存显示误差在20MB以内。
 
 
 经过7个epoch后，我们的训练达到49.362% top-1精度，75.680% top-5精度。
 
@@ -2,28 +2,37 @@
 
 我们准备了安装有Orion Client Runtime，以及TensorFlow，PyTorch的不同镜像。其中，
 * TensorFlow 1.12直接从`pip`源安装
-* PyTorch 1.1.0从官方源码直接编译生成
+* PyTorch 1.0.1直接从`pip`源安装，1.1.0从官方源码直接编译生成
 * 镜像内操作系统均为`Ubuntu 16.04`
-* 在部分镜像中，我们还安装了`MNLX_OFED 4.5.1`RDMA驱动
+* 我们提供了部分安装`MNLX_OFED 4.5.1`用户态驱动的镜像，以支持RDMA
 
-此repo中的Dockerfile对应于Orion vGPU软件的官方[Docker Hub Registry](https://hub.docker.com/r/virtaitech/orion-client)。
+此仓库中的Dockerfiles对应于Orion vGPU软件的官方[Docker Hub Registry](https://hub.docker.com/r/virtaitech/orion-client)。
 
-需要注意的是，每个镜像对应的路径下所需要的
-* `install-client`安装包
-* MLNX_OFED 4.5-1.0.1.0驱动
-* 以及PyTorch从源码编译得到的wheel包
-  
-需要用户自行放置到路径下，方可成功运行`docker build`。
+## TensorFlow 基础镜像
 
-## [TensorFlow 1.12 基础镜像](./client-tf1.12-base)
+### [TensorFlow 1.12](./client-tf1.12-base)
 
 ```bash
 docker pull virtaitech/orion-client:tf1.12-base
 ```
 
 此镜像中通过`pip3 install tensorflow-gpu==1.12`安装了官方TensorFlow，然后通过`install-client`安装包安装了Orion Client运行时。
 
-## [TensorFlow 1.12 带MNLX驱动，Python 3.5环境](./client-tf1.12-py3)
+为方便用户，我们将TensorFlow官方CNN benchmarks克隆到`/root/benchmarks`目录下。
+
+### [TensorFlow 1.8](./client-tf1.8-base)
+
+```bash
+docker pull virtaitech/orion-client:tf1.8-base
+```
+
+此镜像中通过`pip3 install tensorflow-gpu==1.8`安装了官方TensorFlow，然后通过`install-client`安装包安装了Orion Client运行时。
+
+为方便用户，我们将TensorFlow官方CNN benchmarks克隆到`/root/benchmarks`目录下。
+
+### 支持RDMA的TensorFlow镜像
+
+### [TensorFlow 1.12 带MNLX驱动，Python 3.5环境](./client-tf1.12-py3)
 
 ```bash
 docker pull virtaitech/orion-client:tf1.12-py3
@@ -35,7 +44,7 @@ docker pull virtaitech/orion-client:tf1.12-py3
 
 为了展示的方便，我们同样安装了Juypter Notebook和部分Python packages。
 
-## [TensorFlow 1.12 带MNLX驱动，Python 2.7环境](./client-tf1.12-py2)
+### [TensorFlow 1.12 带MNLX驱动，Python 2.7环境](./client-tf1.12-py2)
 
 ```bash
 docker pull virtaitech/orion-client:tf1.12-py2
@@ -47,23 +56,34 @@ docker pull virtaitech/orion-client:tf1.12-py2
 
 本镜像中，我们安装了部分Python packages，以便用户使用[TensorFlow Object Detection](https://github.com/tensorflow/models/tree/master/research/object_detection)模型，以及其余[官方Models](https://github.com/tensorflow/models)。
 
-## [PyTorch 1.1.0, Python 3.5环境](./client-pytorch-1.1.0-py3)
+## PyTorch 镜像
 
-由于PyTorch官方提供的`pip`源wheel包里面编译了太多组件，部分组件我们这一版的Orion vGPU软件还不支持，我们通过PyTorch的源码编译了1.1.0版本的wheel包。我们没有对源码进行任何修改，只是更改了编译选项。
+### 注意事项
+在使用PyTorch DataLoader加载训练数据时，启动容器时需要设置`--ipc=host`参数保证DataLoader进程之间可以进行IPC。本要求与Orion vGPU软件**无关**，即使用户通过`nvidia-docker`在容器中运行PyTorch也是必须的。
 
-我们同样从源码开始，使用默认编译选项编译了torchvision 0.3.0版本，打包进镜像。我们也安装了部分Python packages，使得用户可以直接在镜像里面运行PyTorch的官方examples：https://github.com/pytorch/examples
+在我们的[一篇技术博客](../blogposts/pytorch_models.md)里，我们介绍了如何让PyTorch使用多块Orion vGPU在Imagenet数据集上训练Resnet50模型。
 
-我们在镜像中已经将[官方模型例子](https://github.com/pytorch/examples)克隆后放在`/root/examples`目录下，用户可以进入其中每个模型子目录运行模型。
+### [PyTorch 1.0.1, Python 3.5环境](./client-pytorch-1.0.1-py3)
 
-最后，我们通过通过`install-client`安装包安装了Orion Client运行时。
+我们从PyTorch官方提供的Python3 wheel包安装了PyTorch 1.0.1。
 
-我们在[PyTorch 1.10 Python3.5 镜像](./client-pytorch-1.1.0-py3)中介绍了我们编译PyTorch 1.1.0，TorchVision 0.3.0，以及安装Orion Client Runtime的步骤，用户可以参考。
+```bash
+RUN pip3 install torch==1.0.1 -f https://download.pytorch.org/whl/cu90/stable
+```
 
-### 注意事项
-由于PyTorch DataLoader需要通过IPC通讯，启动容器时需要通过`--shm-size=8G`参数保证DataLoader可以正常工作。这一点对于Native环境也是一样的。
+我们在镜像中已经将[官方模型例子](https://github.com/pytorch/examples)克隆后放在`/root/examples`目录下，用户可以进入其中每个模型子目录运行模型。我们同时安装了包括torchvision 0.2.2在内的一系列Python packages。
+
+最后，我们通过`install-client`安装包安装了Orion Client运行时。
+
+### [PyTorch 1.1.0, Python 3.5环境](./client-pytorch-1.1.0-py3)
+
+PyTorch 1.1.0官方提供的`pip`源wheel包里部分组件我们这一版的Orion vGPU软件还不支持，因此我们更改了编译选项编译了精简版本的PyTorch 1.1.0 wheel包（**源代码无修改**）
 
-此外，由于我们对PyTorch的支持还在持续开发中，用户需要注意的是：
-* 我们还不支持PyTorch通过RDMA网络使用远程GPU资源
-* 在使用多卡训练时，需要用GLOO作为后端，而不是默认的NCCL
+我们同样从源码开始，使用默认编译选项编译了torchvision 0.3.0版本，打包进镜像。
+
+我们在镜像中已经将[官方模型例子](https://github.com/pytorch/examples)克隆后放在`/root/examples`目录下，用户可以进入其中每个模型子目录运行模型。
+
+最后，我们运行`install-client`安装包安装了Orion Client运行时。
+
+我们在[PyTorch 1.10 Python3.5 镜像](./client-pytorch-1.1.0-py3)中介绍了我们编译PyTorch 1.1.0，TorchVision 0.3.0，以及安装Orion Client Runtime的步骤，用户可以参考。
 
-在我们的[一篇技术博客](../blogposts/pytorch_models.md)里，我们介绍了如何让PyTorch使用多块Orion vGPU在Imagenet数据集上训练Resnet50模型。
 
@@ -0,0 +1,37 @@
+FROM ubuntu:16.04
+MAINTAINER zoumao@virtaitech.com
+
+RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
+
+RUN apt update -y &&\
+    apt install -y libcurl4-openssl-dev &&\
+    apt install -y python3-dev python3-pip &&\
+    apt install -y git wget curl bc net-tools &&\
+    apt install -y lsb-core &&\
+    apt install -y libjpeg-dev zlib1g-dev libopenmpi-dev libomp-dev &&\
+    apt clean
+
+# Setup pip source
+COPY pip.conf /etc/
+
+WORKDIR /root
+
+# Install PyTorch, torchvision and other python packages
+RUN pip3 install torch==1.0.1 -f https://download.pytorch.org/whl/cu90/stable
+COPY requirement.txt .
+RUN pip3 install -r requirement.txt && rm requirement.txt
+
+# Prepare PyTorch examples
+RUN git clone https://github.com/pytorch/examples.git
+# Also package the processed MNIST data
+# COPY data examples/data
+
+# Install Orion Client runtime
+COPY install-client .
+RUN chmod +x install-client && ./install-client -q && rm install-client
+
+# Set the num of Orion vGPU each process requests from Orion Controller
+ENV ORION_VGPU=1
+
+WORKDIR /root
+CMD ["/bin/bash"]
@@ -0,0 +1,25 @@
+# 构建镜像
+用户只需将`install-client`安装包放到Dockerfile所在的路径下，即可通过`docker build`命令构建镜像。
+
+安装PyTorch 1.0.1时使用官方提供的wheel packages：
+
+```bash
+RUN pip3 install torch==1.0.1 -f https://download.pytorch.org/whl/cu90/stable
+```
+
+此外，我们安装了包括torchvision 0.2.2在内的部分Python packages。
+
+# 使用镜像
+
+在我们的[一篇技术博客](../../blogposts/pytorch_models.md)里，我们介绍了如何在容器中运行各种[PyTorch官方模型示例](https://github.com/pytorch/examples)。
+
+# 注意事项
+
+* 在使用PyTorch DataLoader加载训练数据时，启动容器时需要设置`--ipc=host`参数保证DataLoader进程之间可以进行IPC。本要求与Orion vGPU软件**无关**，即使用户通过`nvidia-docker`在容器中运行PyTorch也是必须的。
+
+* 目前Orion vGPU软件不支持PyTorch通过RDMA网络使用远程物理GPU资源。用户如果有使用Remote Orion vGPU的需求，需要通过TCP方式。
+
+* 目前Orion vGPU软件不支持PyTorch使用NCCL作为后端进行多卡训练，因此用户需要使用Facebook GLOO作为通讯后端。
+
+
+具体地，用户可以参考[技术博客](../../blogposts/pytorch_models.md)中介绍PyTorch以GLOO作为多进程通讯后端，从而使用多块Orion vGPU在Imagenet数据集上训练Resnet50模型的例子。
@@ -0,0 +1,3 @@
+[global]
+index-url=https://pypi.doubanio.com/simple/
+trusted-host=pypi.doubanio.com
@@ -0,0 +1,9 @@
+pillow
+scipy==1.2.0
+matplotlib==3.0
+pandas
+cython
+tqdm
+contextlib2
+lxml
+torchvision==0.2.2
@@ -11,7 +11,7 @@
 
 如果要构建镜像，用户需要按照下面的步骤从源码编译PyTorch和TorchVision。
 
-## 从源码编译PyTorch 1.1.0的Python 3.5版本
+## 从源码编译PyTorch 1.1.0
 
 我们以Ubuntu 16.04环境为例。
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+[global]`
	`2`	`+index-url=https://pypi.doubanio.com/simple/`
	`3`	`+trusted-host=pypi.doubanio.com`