Skip to content

Commit 211cbd6

Browse files
Merge pull request #146 from iArpanPatel/master
Community release 1.2.0 minor document updates
2 parents 3f35f85 + ffc081d commit 211cbd6

File tree

4 files changed

+43
-177
lines changed

4 files changed

+43
-177
lines changed

docs/User/Troubleshooting.md

Lines changed: 22 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,20 @@
1+
# Generic troubleshooting tips
2+
3+
- x.509 certificates are not configured correctly – See [https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates](https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates).
4+
- License server is not running or Swarm licenses are not installed - See chapter "HPE AutoPass License Server License Management" in **AutoPass License Server User Guide** for details of the web GUI management interface and how to install license.
5+
- Swarm core components (Docker containers) are not started or errors while starting. – For more information on how to start Swarm Learning, see [Running Swarm Learning](/docs/Install/Running_Swarm_Learning.md).
6+
- Swarm components are not able to see each other - See the [Exposed Ports](/docs/Install/Exposed_port_numbers.md) to see if the required ports are exposed.
7+
- User is not using the Swarm APIs correctly – See [Swarm Wheels Package](/docs/User/Swarm_client_interface-wheels_package.md) for details of API.
8+
- Errors related to SWOP task definition, profile schema, or SWCI init script – These are user defined artifacts. Verify these files for correctness.
9+
- Any experimental release of Ubuntu greater than LTS 20.04 may result in the following error message when running SWOP tasks.
10+
```SWOP MAKE_USER_CONTAINER fails.```
11+
This occurs as SWOP is not able to obtain image of itself because of Docker setup differences in this experimental Ubuntu release. Switch to 20.04 LTS to resolve this issue.
12+
113
# <a name="GUID-96BB1337-2B99-45C7-BA9F-3D7D3B76663E"/> Troubleshooting
214

315
Troubleshooting provides solutions to commonly observed issues during Swarm Learning set up and execution.
416

5-
## <a name="GUID-EDAB2731-9CF3-4770-B54C-40C56D2FFDAC"/> Error code: 6002
17+
## 1. <a name="GUID-EDAB2731-9CF3-4770-B54C-40C56D2FFDAC"/> Error code: 6002
618

719
```
820
> Error message: Unable to connect to server. Server might be wrongly configured or down.
@@ -26,7 +38,7 @@ Error code: 6002, as shown in the following screenshot occurs when Swarm Learnin
2638
3. Verify if the Swarm licenses are installed using APLS web management console. For more information, see APLS User Guide.
2739

2840

29-
## Installation of HPE Swarm Learning on air-gaped systems or if the Web UI Installer runs into any issue and not able to install
41+
## 2. Installation of HPE Swarm Learning on air-gaped systems or if the Web UI Installer runs into any issue and not able to install
3042

3143
- Download the following from HPE My Support Center(MSC) on a host system that has internet access - tar file (HPE_SWARM_LEARNING_DOCS_EXAMPLES_SCRIPTS_Q2V41-11033.tar.gz) and the signature file for the above tar file.
3244
- Untar the tar file under `/opt/hpe/swarm-learning`.
@@ -43,26 +55,22 @@ Error code: 6002, as shown in the following screenshot occurs when Swarm Learnin
4355
```
4456
- Copy the tar file and Docker images to the air-gaped systems.
4557

46-
## System resource issues if too many SLs are mapped to the same SN
58+
## 3. System resource issues if too many SLs are mapped to the same SN
4759

4860
When configuring Swarm Learning you may encounter system resource issues if too many SLs are mapped to same SN. For example:
4961
```
5062
“swarm.blCnt : WARNING: SLBlackBoardObj : errCheckinNotAllowed:CHECKIN NOT ALLOWED”
5163
```
5264
The suggested workaround is to start with mapping 4 SL to 1 SN. Then after, slowly scale no of SLs to SN
5365

54-
## SWCI waits for task-runner indefinitely even after task completed or failed
66+
## 4. SWCI waits for task-runner indefinitely even after task completed or failed
5567

5668
User to ensure no failure in ML code before Swarm training starts. Check using `SWARM_LOOPBACK ENV` and ensure, user coderuns fine and local training completes successfully.
5769

58-
# Generic troubleshooting tips
70+
## 5. Error while docker pull Swarm Learning images: 'could not rotate trust to a new trusted root'
5971

60-
- x.509 certificates are not configured correctly – See [https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates](https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates).
61-
- License server is not running or Swarm licenses are not installed - See chapter "HPE AutoPass License Server License Management" in **AutoPass License Server User Guide** for details of the web GUI management interface and how to install license.
62-
- Swarm core components (Docker containers) are not started or errors while starting. – For more information on how to start Swarm Learning, see [Running Swarm Learning](/docs/Install/Running_Swarm_Learning.md).
63-
- Swarm components are not able to see each other - See the [Exposed Ports](/docs/Install/Exposed_port_numbers.md) to see if the required ports are exposed.
64-
- User is not using the Swarm APIs correctly – See [Swarm Wheels Package](/docs/User/Swarm_client_interface-wheels_package.md) for details of API.
65-
- Errors related to SWOP task definition, profile schema, or SWCI init script – These are user defined artifacts. Verify these files for correctness.
66-
- Any experimental release of Ubuntu greater than LTS 20.04 may result in the following error message when running SWOP tasks.
67-
```SWOP MAKE_USER_CONTAINER fails.```
68-
This occurs as SWOP is not able to obtain image of itself because of Docker setup differences in this experimental Ubuntu release. Switch to 20.04 LTS to resolve this issue.
72+
Please remove below directories and re-try pull images: <br> </br>
73+
~/.docker/trust/tuf/hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swci/
74+
~/.docker/trust/tuf/hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sn/
75+
~/.docker/trust/tuf/hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swop/
76+
~/.docker/trust/tuf/hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sl/

examples/fraud-detection/Credit_card_fraud_detection.md

Lines changed: 0 additions & 158 deletions
This file was deleted.

examples/fraud-detection/README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ This example shows the Swarm training of the credit card fraud detection model u
1717

1818
The following image illustrates a cluster setup that uses only one host:
1919

20-
![Credit Card Fraud Detection](/docs/User/GUID-BE2185B8-5C3B-4BD3-91FF-9ABC77D0720C-high.png)
20+
<img width="80%" height="100%" src="/docs/User/GUID-BE2185B8-5C3B-4BD3-91FF-9ABC77D0720C-high.png">
2121

2222
- This example uses one SN node. The names of the docker containers representing this node is SN1. SN1 is also the Sentinel Node. SN1 runs on the host 172.1.1.1.
2323

@@ -119,8 +119,13 @@ NOTE: If required, according to environment, modify IP and proxy in the profile
119119
--key=workspace/fraud-detection/cert/swop-1-key.pem \
120120
--cert=workspace/fraud-detection/cert/swop-1-cert.pem \
121121
--capath=workspace/fraud-detection/cert/ca/capath \
122+
-e SWOP_KEEP_CONTAINERS=True \
122123
-e http_proxy= -e https_proxy= --apls-ip=172.1.1.1
123124
```
125+
<blockquote>
126+
NOTE: `-e SWOP_KEEP_CONTAINERS=True` is an optional argument, by default it would be `False`.
127+
SWOP_KEEP_CONTAINERS is set to True so that SWOP doesn't remove stopped SL and ML containers. With out this setting if there is any internal error in SL or ML then SWOP removes them automatically. Refer documentation of SWOP_KEEP_CONTAINERS for more details.
128+
</blockquote>
124129
125130
10. Run SWCI node \(SWCI1\). It creates, finalizes and assigns below task to task-framework for sequential execution:
126131

examples/mnist/README.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -144,9 +144,14 @@ swarm.blCnt : INFO : Starting SWARM-API-SERVER on port: 30304
144144
--usr-dir=workspace/mnist/swop --profile-file-name=swop1_profile.yaml \
145145
--key=workspace/mnist/cert/swop-1-key.pem \
146146
--cert=workspace/mnist/cert/swop-1-cert.pem \
147-
--capath=workspace/mnist/cert/ca/capath -e http_proxy= -e \
148-
https_proxy= --apls-ip=172.1.1.1
147+
--capath=workspace/mnist/cert/ca/capath \
148+
-e SWOP_KEEP_CONTAINERS=True \
149+
-e http_proxy= -e https_proxy= --apls-ip=172.1.1.1
149150
```
151+
<blockquote>
152+
NOTE: `-e SWOP_KEEP_CONTAINERS=True` is an optional argument, by default it would be `False`.
153+
SWOP_KEEP_CONTAINERS is set to True so that SWOP doesn't remove stopped SL and ML containers. With out this setting if there is any internal error in SL or ML then SWOP removes them automatically. Refer documentation of SWOP_KEEP_CONTAINERS for more details.
154+
</blockquote>
150155
151156
On host-2, run SWOP node (SWOP2).
152157
@@ -159,9 +164,15 @@ https_proxy= --apls-ip=172.1.1.1
159164
--usr-dir=workspace/mnist/swop --profile-file-name=swop2_profile.yaml \
160165
--key=workspace/mnist/cert/swop-2-key.pem \
161166
--cert=workspace/mnist/cert/swop-2-cert.pem \
162-
--capath=workspace/mnist/cert/ca/capath -e http_proxy= -e \
163-
https_proxy= --apls-ip=172.1.1.1
167+
--capath=workspace/mnist/cert/ca/capath \
168+
-e SWOP_KEEP_CONTAINERS=True \
169+
-e http_proxy= -e https_proxy= --apls-ip=172.1.1.1
164170
```
171+
<blockquote>
172+
NOTE: `-e SWOP_KEEP_CONTAINERS=True` is an optional argument, by default it would be `False`.
173+
SWOP_KEEP_CONTAINERS is set to True so that SWOP doesn't remove stopped SL and ML containers. With out this setting if there is any internal error in SL or ML then SWOP removes them automatically. Refer documentation of SWOP_KEEP_CONTAINERS for more details.
174+
</blockquote>
175+
165176
166177
10. On host-1, run SWCI node. It creates, finalizes, and assigns two tasks sequentially for execution:
167178

0 commit comments

Comments
 (0)