Merge pull request #146 from iArpanPatel/master

RadhakrishnaJ · web-flow · commit 211cbd61aec2 · 2022-12-06T18:39:08.000+05:30
Community release 1.2.0 minor document updates
diff --git a/docs/User/Troubleshooting.md b/docs/User/Troubleshooting.md
@@ -1,8 +1,20 @@
+# Generic troubleshooting tips
+
+- x.509 certificates are not configured correctly – See [https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates](https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates).
+- License server is not running or Swarm licenses are not installed - See chapter "HPE AutoPass License Server License Management" in **AutoPass License Server User Guide** for details of the web GUI management interface and how to install license.
+- Swarm core components (Docker containers) are not started or errors while starting. – For more information on how to start Swarm Learning, see [Running Swarm Learning](/docs/Install/Running_Swarm_Learning.md).
+- Swarm components are not able to see each other - See the [Exposed Ports](/docs/Install/Exposed_port_numbers.md) to see if the required ports are exposed.
+- User is not using the Swarm APIs correctly – See [Swarm Wheels Package](/docs/User/Swarm_client_interface-wheels_package.md) for details of API.
+- Errors related to SWOP task definition, profile schema, or SWCI init script – These are user defined artifacts. Verify these files for correctness.
+- Any experimental release of Ubuntu greater than LTS 20.04 may result in the following error message when running SWOP tasks.
+  ```SWOP MAKE_USER_CONTAINER fails.```
+  This occurs as SWOP is not able to obtain image of itself because of Docker setup differences in this experimental Ubuntu release. Switch to 20.04 LTS to resolve  this issue.
+
 # <a name="GUID-96BB1337-2B99-45C7-BA9F-3D7D3B76663E"/> Troubleshooting
 
 Troubleshooting provides solutions to commonly observed issues during Swarm Learning set up and execution.
 
-## <a name="GUID-EDAB2731-9CF3-4770-B54C-40C56D2FFDAC"/> Error code: 6002
+## 1. <a name="GUID-EDAB2731-9CF3-4770-B54C-40C56D2FFDAC"/> Error code: 6002
 
 ```
 > Error message: Unable to connect to server. Server might be wrongly configured or down.
@@ -26,7 +38,7 @@ Error code: 6002, as shown in the following screenshot occurs when Swarm Learnin
 3.  Verify if the Swarm licenses are installed using APLS web management console. For more information, see APLS User Guide.
 
 
-## Installation of HPE Swarm Learning on air-gaped systems or if the Web UI Installer runs into any issue and not able to install
+## 2. Installation of HPE Swarm Learning on air-gaped systems or if the Web UI Installer runs into any issue and not able to install
 
 - Download the following from HPE My Support Center(MSC) on a host system that has internet access - tar file (HPE_SWARM_LEARNING_DOCS_EXAMPLES_SCRIPTS_Q2V41-11033.tar.gz) and the signature file for the above tar file.
 - Untar the tar file under `/opt/hpe/swarm-learning`.
@@ -43,26 +55,22 @@ Error code: 6002, as shown in the following screenshot occurs when Swarm Learnin
    ```
 - Copy the tar file and Docker images to the air-gaped systems.
 
-## System resource issues if too many SLs are mapped to the same SN
+## 3. System resource issues if too many SLs are mapped to the same SN
 
 When configuring Swarm Learning you may encounter system resource issues if too many SLs are mapped to same SN. For example:
     ```
     “swarm.blCnt : WARNING: SLBlackBoardObj : errCheckinNotAllowed:CHECKIN NOT ALLOWED”
     ```
 The suggested workaround is to start with mapping 4 SL to 1 SN. Then after, slowly scale no of SLs to SN
 
-## SWCI waits for task-runner indefinitely even after task completed or failed
+## 4. SWCI waits for task-runner indefinitely even after task completed or failed
 
 User to ensure no failure in ML code before Swarm training starts. Check using `SWARM_LOOPBACK ENV` and ensure, user coderuns fine and local training completes successfully.
 
-# Generic troubleshooting tips
+## 5. Error while docker pull Swarm Learning images: 'could not rotate trust to a new trusted root'
 
-- x.509 certificates are not configured correctly – See [https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates](https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates).
-- License server is not running or Swarm licenses are not installed - See chapter "HPE AutoPass License Server License Management" in **AutoPass License Server User Guide** for details of the web GUI management interface and how to install license.
-- Swarm core components (Docker containers) are not started or errors while starting. – For more information on how to start Swarm Learning, see [Running Swarm Learning](/docs/Install/Running_Swarm_Learning.md).
-- Swarm components are not able to see each other - See the [Exposed Ports](/docs/Install/Exposed_port_numbers.md) to see if the required ports are exposed.
-- User is not using the Swarm APIs correctly – See [Swarm Wheels Package](/docs/User/Swarm_client_interface-wheels_package.md) for details of API.
-- Errors related to SWOP task definition, profile schema, or SWCI init script – These are user defined artifacts. Verify these files for correctness.
-- Any experimental release of Ubuntu greater than LTS 20.04 may result in the following error message when running SWOP tasks.
-  ```SWOP MAKE_USER_CONTAINER fails.```
-  This occurs as SWOP is not able to obtain image of itself because of Docker setup differences in this experimental Ubuntu release. Switch to 20.04 LTS to resolve  this issue.
+Please remove below directories and re-try pull images: <br> </br>
+~/.docker/trust/tuf/hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swci/
+~/.docker/trust/tuf/hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sn/
+~/.docker/trust/tuf/hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swop/
+~/.docker/trust/tuf/hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sl/
diff --git a/examples/fraud-detection/Credit_card_fraud_detection.md b/examples/fraud-detection/Credit_card_fraud_detection.md
diff --git a/examples/fraud-detection/README.md b/examples/fraud-detection/README.md
@@ -17,7 +17,7 @@ This example shows the Swarm training of the credit card fraud detection model u
 
 The following image illustrates a cluster setup that uses only one host:
 
-![Credit Card Fraud Detection](/docs/User/GUID-BE2185B8-5C3B-4BD3-91FF-9ABC77D0720C-high.png)
+<img width="80%" height="100%" src="/docs/User/GUID-BE2185B8-5C3B-4BD3-91FF-9ABC77D0720C-high.png">
 
 -   This example uses one SN node. The names of the docker containers representing this node is SN1. SN1 is also the Sentinel Node. SN1 runs on the host 172.1.1.1.
 
@@ -119,8 +119,13 @@ NOTE: If required, according to environment, modify IP and proxy in the profile
 --key=workspace/fraud-detection/cert/swop-1-key.pem \
 --cert=workspace/fraud-detection/cert/swop-1-cert.pem \
 --capath=workspace/fraud-detection/cert/ca/capath \
+-e SWOP_KEEP_CONTAINERS=True \
 -e http_proxy= -e https_proxy= --apls-ip=172.1.1.1
 ```
+<blockquote>
+   NOTE: `-e SWOP_KEEP_CONTAINERS=True` is an optional argument, by default it would be `False`. 
+   SWOP_KEEP_CONTAINERS is set to True so that SWOP doesn't remove stopped SL and ML containers. With out this setting if there is any internal error in SL or ML then SWOP removes them automatically. Refer documentation of SWOP_KEEP_CONTAINERS for more details.
+</blockquote>
 
 10. Run SWCI node \(SWCI1\). It creates, finalizes and assigns below task to task-framework for sequential execution:
 
diff --git a/examples/mnist/README.md b/examples/mnist/README.md
@@ -144,9 +144,14 @@ swarm.blCnt : INFO : Starting SWARM-API-SERVER on port: 30304
 --usr-dir=workspace/mnist/swop --profile-file-name=swop1_profile.yaml \
 --key=workspace/mnist/cert/swop-1-key.pem \
 --cert=workspace/mnist/cert/swop-1-cert.pem \
---capath=workspace/mnist/cert/ca/capath -e http_proxy= -e \
-https_proxy= --apls-ip=172.1.1.1
+--capath=workspace/mnist/cert/ca/capath \
+-e SWOP_KEEP_CONTAINERS=True \
+-e http_proxy= -e https_proxy= --apls-ip=172.1.1.1
 ```
+<blockquote>
+   NOTE: `-e SWOP_KEEP_CONTAINERS=True` is an optional argument, by default it would be `False`. 
+   SWOP_KEEP_CONTAINERS is set to True so that SWOP doesn't remove stopped SL and ML containers. With out this setting if there is any internal error in SL or ML then SWOP removes them automatically. Refer documentation of SWOP_KEEP_CONTAINERS for more details.
+</blockquote>
 
    On host-2, run SWOP node (SWOP2).
 
@@ -159,9 +164,15 @@ https_proxy= --apls-ip=172.1.1.1
    --usr-dir=workspace/mnist/swop --profile-file-name=swop2_profile.yaml \
    --key=workspace/mnist/cert/swop-2-key.pem \
    --cert=workspace/mnist/cert/swop-2-cert.pem \
-   --capath=workspace/mnist/cert/ca/capath -e http_proxy= -e \
-   https_proxy= --apls-ip=172.1.1.1
+   --capath=workspace/mnist/cert/ca/capath \
+   -e SWOP_KEEP_CONTAINERS=True \
+   -e http_proxy= -e https_proxy= --apls-ip=172.1.1.1
    ```
+<blockquote>
+   NOTE: `-e SWOP_KEEP_CONTAINERS=True` is an optional argument, by default it would be `False`. 
+   SWOP_KEEP_CONTAINERS is set to True so that SWOP doesn't remove stopped SL and ML containers. With out this setting if there is any internal error in SL or ML then SWOP removes them automatically. Refer documentation of SWOP_KEEP_CONTAINERS for more details.
+</blockquote>
+   
 
 10. On host-1, run SWCI node. It creates, finalizes, and assigns two tasks sequentially for execution: