Skip to content

Commit 6ac993b

Browse files
authored
Add docs detailing our ASF Jenkins usage (#2767)
Hopefully these docs will answer a lot of questions I had to dig up answers to on my own following a recent outage of our ASF Jenkins agents.
1 parent d0000b4 commit 6ac993b

File tree

1 file changed

+72
-0
lines changed

1 file changed

+72
-0
lines changed

dev-docs/asf-jenkins.adoc

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
= ASF Jenkins Setup
2+
:toc: left
3+
4+
The Solr project uses a Jenkins instance provided by the Apache Software Foundation ("ASF") for running tests, validation, etc.
5+
6+
This file aims to document our [ASF Jenkins](https://ci-builds.apache.org/job/Solr/) usage and administration, to prevent it from becoming "tribal knowledge" understood by just a few.
7+
8+
== Jobs
9+
10+
We run a number of jobs on Jenkins, each validating an overlapping set of concerns:
11+
12+
* `Solr-Artifacts-*` - daily jobs that run `./gradlew assemble` to ensure that build artifacts (except docker images) can be created successfully
13+
* `Solr-check-*` - "hourly" jobs that run all project tests and static analysis (i.e. `test`, `integrationTest`, and `check`)
14+
* `Solr-Docker-Nightly-*` - daily jobs that `./gradlew testDocker dockerPush` to validate docker image packaging. Snapshot images are pushed to hub.docker.com
15+
* `Solr-reference-guide-*` - hourly jobs that build the Solr reference guide via `./gradlew checkSite` and push the resulting artifact to the staging/preview site `nightlies.apache.org`
16+
* `Solr-Smoketest-*` - daily jobs that produce a snapshot release (via the `assembleRelease` task) and run the release smoketester
17+
18+
Most jobs that validate particular build artifacts are run "daily", which is sufficient to prevent any large breaks from creeping into the build.
19+
20+
On the other hand, jobs that run tests are triggered "hourly" in order to squeeze as many test runs as possible out of our Jenkins hardware.
21+
This is a necessary consequence of Solr's heavy use of randomization in its test-suite.
22+
"Hourly" scheduling ensures that a test run is either currently running or in the build queue at all times, and enables us to get the maximum data points from our hardware.
23+
24+
== Jenkins Agents
25+
26+
All Solr jobs run on Jenkins agents marked with the 'solr' label.
27+
Currently, this maps to two Jenkins agents:
28+
29+
* `lucene-solr-1` - available at lucene1-us-west.apache.org
30+
* `lucene-solr-2` - available (confusingly) at lucene-us-west.apache.org
31+
32+
These agents are "project-specific" VMs shared by the Lucene and Solr projects.
33+
That is: they are VMs requested by a project for their exclusive use.
34+
(INFRA policy appears to be that each Apache project may request 1 dedicated VM; it's unclear how Solr ended up with 2.)
35+
36+
Maintenance of these agent VMs falls into a bit of a gray area.
37+
INFRA will still intervene when asked: to reboot nodes, to deploy OS upgrades, etc.
38+
But some burden also falls on Lucene and Solr as project teams to monitor the the VMs and keep them healthy.
39+
40+
=== Accessing Jenkins Agents
41+
42+
With a few steps, Solr committers can access our project's Jenkins agent VMs via SSH to troubleshoot and resolve issues.
43+
44+
1. Ensure your account on id.apache.org has an SSH key associated with it.
45+
2. Ask INFRA to give your Apache ID SSH access to these boxes. (See [this JIRA ticket](https://issues.apache.org/jira/browse/INFRA-3682) for an example.)
46+
3. SSH into the desired box with: `ssh <apache-id>@$HOSTNAME` (where `$HOSTNAME` is either `lucene1-us-west.apache.org` or `lucene-us-west.apache.org`)
47+
48+
Often, SSH access on the boxes is not sufficient, and administrators require "root" access to diagnose and solve problems.
49+
Sudo/su priveleges can be accessed via a one-time pad ("OTP") challenge, managed by the "Orthrus PAM" module.
50+
Users in need of root access can perform the following steps:
51+
52+
1. Open the ASF's [OTP Generator Tool](https://selfserve.apache.org/otp-calculator.html) in your browser of choice
53+
2. Run `ortpasswd` on the machine. This will print out a OTP "challenge" (e.g. `otp-md5 497 lu6126`) and provide a password prompt. This password prompt should be given a OTP password, generated in steps 3-5 below.
54+
3. Copy the "challenge" from the previous step into the relevant field on the "OTP Generator Tool" form.
55+
4. Choose a password to use for OTP Challenges (or recall one you've used in the past), and type this into the relevant field on the "OTP Generator Tool" form.
56+
5. Click "Compute", and copy the first line from the "Response" box into your SSH session's password prompt. You're now established in the "Orthrus PAM" system.
57+
6. Run a command requesting `su` escalation (e.g. `sudo su -`). This should print another "challenge" and password prompt. Repeat steps 3-5.
58+
59+
If this fails at any point, open a ticket with INFRA.
60+
You may need to be added to the 'sudoers' file for the VM(s) in question.
61+
62+
=== Known Jenkins Issues
63+
64+
One recurring problem with the Jenkins agents is that they periodically run out of disk-space.
65+
Usually this happens when enough "workspaces" are orphaned or left behind, consuming all of the agent's disk space.
66+
67+
Solr Jenkins jobs are currently configured to clean up the previous workspace at the *start* of the subsequent run.
68+
This avoids orphans in the common case but leaves workspaces behind any time a job is renamed or deleted (as happens during the Solr release process).
69+
70+
Luckily, this has an easy fix: SSH into the agent VM and delete any workspaces no longer needed in `/home/jenkins/jenkins-slave/workspace/Solr`.
71+
Any workspace that doesn't correspond to a [currently existing job](https://ci-builds.apache.org/job/Solr/) can be safely deleted.
72+
(It may also be worth comparing the Lucene workspaces in `/home/jenkins/jenkins-slave/workspace/Lucene` to [that project's list of jobs](https://ci-builds.apache.org/job/Lucene/).)

0 commit comments

Comments
 (0)