A light-weight project which runs pytorch on angel, providing pytorch the ability to run with high-dimensional models.
Pytorch on Angel's architecture design consists of three modules:
- python client: python client is used to generate the pytorch script module.
- angel ps: provides a common Parameter Server (PS) service, responsible for distributed model storage, communication synchronization and coordination of computing.
- spark executor: the worker process is responsible for data processing、load pytorch script module and communicate with the
Angel PS Serverto complete model training and prediction, especially pytorch c++ backend runs in native mode for actual computing backend.
- pytorch >=v1.1.0
we recommend using anaconda to install pytorch, run command:
conda install -c pytorch pytorch
pytorch detailed installation documentation can refer to pytorch installation
-
Compiling Environment Dependencies
- Jdk >= 1.8
- Maven >= 3.0.5
-
Source Code Download
git clone https://github.com/Angel-ML/PyTorch-On-Angel.git -
Compile
Run the following command in the java root directory of the source code:mvn clean package -Dmaven.test.skip=trueAfter compiling, a jar package named 'pytorch-on-angel-<version>.jar' will be generated in
targetunder the java root directory.
-
Compiling Environment Dependencies
- gcc >= 5
- cmake >= 3.12
-
LibTorch Download
- Download the
libtorchpackage from here and extract it to the user-specified directory - set TORCH_HOME(path to libtorch) in
CMakeLists.txtunder the cpp root directory
- Download the
-
Compile Run the following command in the
cmake-build-debugdirectory under the cpp root directory:cmake .. makeAfter compiling, a shared library named 'libtorch_angel.so' will be generated in
cmake-build-debugunder the cpp root directory.
pytorch on angel runs on spark on angel, so you must deploy the spark on angel client first. The specific deployment process can refer to documentation.
Use $SPARK_HOME/bin/spark-submit to submit the application to cluster in the pytorch on angel client.
Here are the submit example for deepfm.
-
Generate pytorch script model
users can implement their own algorithms using pytorch. We have implemented some algorithms in the python/recommendation under the root directory, you can run the following command to generate a deepfm model:python deepfm.py --input_dim 148 --n_fields 13 --embedding_dim 10 --fc_dims 10 5 1After executing this command, you will get a model file named deepfm.pt
-
Package c++ library files Package the
libtorch/liblibrary file with the shared library filelibtorch_angel.sogenerated by the compiled cpp submodule, for example, we packaged and named itangel_libtorch.zip -
Upload training data to hdfs upload training data python/recommendation/census_148d_train.libsvm.tmp to hdfs directory
-
Submit to Cluster
source ./spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit \ --master yarn-cluster\ --conf spark.ps.instances=5 \ --conf spark.ps.cores=1 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=5g \ --conf spark.ps.log.level=INFO \ --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/angel_libtorch \ --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/angel_libtorch \ --conf spark.executor.extraLibraryPath=./torch/angel_libtorch \ --conf spark.driver.extraLibraryPath=./torch/angel_libtorch \ --conf spark.executorEnv.OMP_NUM_THREADS=2 \ --conf spark.executorEnv.MKL_NUM_THREADS=2 \ --queue $queue \ --name "deepfm for torch on angel" \ --jars $SONA_SPARK_JARS \ --archives angel_libtorch.zip#torch\ #path to c++ library files --files deepfm.pt \ #path to pytorch script model --driver-memory 5g \ --num-executors 5 \ --executor-cores 1 \ --executor-memory 5g \ --class com.tencent.angel.pytorch.examples.ClusterExample \ ./pytorch-on-angel-1.0-SNAPSHOT.jar \ # jar from Compiling java submodule input:$input batchSize:128 torchModelPath:deepfm.pt \ stepSize:0.001 numEpoch:10 partitionNum:5 \ modulePath:$output \
Currently, Pytorch on Angel supports a series of recommendation and deep graph convolution network algorithms.
