You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/User/Troubleshooting.md
+22-14Lines changed: 22 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,20 @@
1
+
# Generic troubleshooting tips
2
+
3
+
- x.509 certificates are not configured correctly – See [https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates](https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates).
4
+
- License server is not running or Swarm licenses are not installed - See chapter "HPE AutoPass License Server License Management" in **AutoPass License Server User Guide** for details of the web GUI management interface and how to install license.
5
+
- Swarm core components (Docker containers) are not started or errors while starting. – For more information on how to start Swarm Learning, see [Running Swarm Learning](/docs/Install/Running_Swarm_Learning.md).
6
+
- Swarm components are not able to see each other - See the [Exposed Ports](/docs/Install/Exposed_port_numbers.md) to see if the required ports are exposed.
7
+
- User is not using the Swarm APIs correctly – See [Swarm Wheels Package](/docs/User/Swarm_client_interface-wheels_package.md) for details of API.
8
+
- Errors related to SWOP task definition, profile schema, or SWCI init script – These are user defined artifacts. Verify these files for correctness.
9
+
- Any experimental release of Ubuntu greater than LTS 20.04 may result in the following error message when running SWOP tasks.
10
+
```SWOP MAKE_USER_CONTAINER fails.```
11
+
This occurs as SWOP is not able to obtain image of itself because of Docker setup differences in this experimental Ubuntu release. Switch to 20.04 LTS to resolve this issue.
> Error message: Unable to connect to server. Server might be wrongly configured or down.
@@ -26,7 +38,7 @@ Error code: 6002, as shown in the following screenshot occurs when Swarm Learnin
26
38
3. Verify if the Swarm licenses are installed using APLS web management console. For more information, see APLS User Guide.
27
39
28
40
29
-
## Installation of HPE Swarm Learning on air-gaped systems or if the Web UI Installer runs into any issue and not able to install
41
+
## 2. Installation of HPE Swarm Learning on air-gaped systems or if the Web UI Installer runs into any issue and not able to install
30
42
31
43
- Download the following from HPE My Support Center(MSC) on a host system that has internet access - tar file (HPE_SWARM_LEARNING_DOCS_EXAMPLES_SCRIPTS_Q2V41-11033.tar.gz) and the signature file for the above tar file.
32
44
- Untar the tar file under `/opt/hpe/swarm-learning`.
@@ -43,26 +55,22 @@ Error code: 6002, as shown in the following screenshot occurs when Swarm Learnin
43
55
```
44
56
- Copy the tar file and Docker images to the air-gaped systems.
45
57
46
-
## System resource issues if too many SLs are mapped to the same SN
58
+
## 3. System resource issues if too many SLs are mapped to the same SN
47
59
48
60
When configuring Swarm Learning you may encounter system resource issues if too many SLs are mapped to same SN. For example:
49
61
```
50
62
“swarm.blCnt : WARNING: SLBlackBoardObj : errCheckinNotAllowed:CHECKIN NOT ALLOWED”
51
63
```
52
64
The suggested workaround is to start with mapping 4 SL to 1 SN. Then after, slowly scale no of SLs to SN
53
65
54
-
## SWCI waits for task-runner indefinitely even after task completed or failed
66
+
## 4. SWCI waits for task-runner indefinitely even after task completed or failed
55
67
56
68
User to ensure no failure in ML code before Swarm training starts. Check using `SWARM_LOOPBACK ENV` and ensure, user coderuns fine and local training completes successfully.
57
69
58
-
#Generic troubleshooting tips
70
+
## 5. Error while docker pull Swarm Learning images: 'could not rotate trust to a new trusted root'
59
71
60
-
- x.509 certificates are not configured correctly – See [https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates](https://www.linuxjournal.com/content/understanding-public-key-infrastructure-and-x509-certificates).
61
-
- License server is not running or Swarm licenses are not installed - See chapter "HPE AutoPass License Server License Management" in **AutoPass License Server User Guide** for details of the web GUI management interface and how to install license.
62
-
- Swarm core components (Docker containers) are not started or errors while starting. – For more information on how to start Swarm Learning, see [Running Swarm Learning](/docs/Install/Running_Swarm_Learning.md).
63
-
- Swarm components are not able to see each other - See the [Exposed Ports](/docs/Install/Exposed_port_numbers.md) to see if the required ports are exposed.
64
-
- User is not using the Swarm APIs correctly – See [Swarm Wheels Package](/docs/User/Swarm_client_interface-wheels_package.md) for details of API.
65
-
- Errors related to SWOP task definition, profile schema, or SWCI init script – These are user defined artifacts. Verify these files for correctness.
66
-
- Any experimental release of Ubuntu greater than LTS 20.04 may result in the following error message when running SWOP tasks.
67
-
```SWOP MAKE_USER_CONTAINER fails.```
68
-
This occurs as SWOP is not able to obtain image of itself because of Docker setup differences in this experimental Ubuntu release. Switch to 20.04 LTS to resolve this issue.
72
+
Please remove below directories and re-try pull images: <br> </br>
- This example uses one SN node. The names of the docker containers representing this node is SN1. SN1 is also the Sentinel Node. SN1 runs on the host 172.1.1.1.
23
23
@@ -119,8 +119,13 @@ NOTE: If required, according to environment, modify IP and proxy in the profile
NOTE: `-e SWOP_KEEP_CONTAINERS=True` is an optional argument, by default it would be `False`.
127
+
SWOP_KEEP_CONTAINERS is set to True so that SWOP doesn't remove stopped SL and ML containers. With out this setting if there is any internal error in SL or ML then SWOP removes them automatically. Refer documentation of SWOP_KEEP_CONTAINERS for more details.
128
+
</blockquote>
124
129
125
130
10. Run SWCI node \(SWCI1\). It creates, finalizes and assigns below task to task-framework for sequential execution:
NOTE: `-e SWOP_KEEP_CONTAINERS=True` is an optional argument, by default it would be `False`.
153
+
SWOP_KEEP_CONTAINERS is set to True so that SWOP doesn't remove stopped SL and ML containers. With out this setting if there is any internal error in SL or ML then SWOP removes them automatically. Refer documentation of SWOP_KEEP_CONTAINERS for more details.
NOTE: `-e SWOP_KEEP_CONTAINERS=True` is an optional argument, by default it would be `False`.
173
+
SWOP_KEEP_CONTAINERS is set to True so that SWOP doesn't remove stopped SL and ML containers. With out this setting if there is any internal error in SL or ML then SWOP removes them automatically. Refer documentation of SWOP_KEEP_CONTAINERS for more details.
174
+
</blockquote>
175
+
165
176
166
177
10. On host-1, run SWCI node. It creates, finalizes, and assigns two tasks sequentially for execution:
0 commit comments