Skip to content

Commit 2998460

Browse files
ndokosportante
authored andcommitted
Decouple pbench-move-results from server
pbench-move-results knows how to query a server to get a destination for the tarballs that it has to move. It then checks that it can get to the destination. If that is successful it packages up a tarball, and calculates an md5. These two are copied to the destination. It checks that the tarball's md5 is correct and if so, it marks the transaction complete by renaming the md5 file at the destination. The implementation in this commit will apply to version-002 clients: when they query the server they will get a different destination than the version-001 clients. See below for the server-side handling. Version-001 agents will continue to do things as they do today (but see below for modified server-side handling of version-001 agents). pbench-move-results gets a new --user option. The "user = <value>" option gets added to the metadata.log in the [run] section before the tarball is created. Its default value comes from the env variable PBENCH_USER (if it exists), but may be overridden on the command line. Any prefix specified on the command line is also handled the same way: the "prefix = <value>" option gets added to the [run] section of the metadata log. Tarballs and md5 sum files are created in a temp directory which is copied (recursively) to the server. The md5 file is called <resultname>.tar.xz.md5.check and is renamed to omit the .check suffix after a successful md5 check, thus signalling completion of the moving and checking of the tarball. The number of ssh invocations has thus been reduced to one initial one to check connectivity plus two for each tarball: one to copy the temp directory over and one to check the MD5 sum and rename the file. Duplicate detection has been eliminated from the agent side. It is done on the server side. pbench-server-prep-shim-002 is a new script: it will scour the version-002 reception directory for xxx.tar.xz.md5 files (i.e. files that the agent has already checked and renamed). For each file, it will check the md5 sum and, if successful, it will copy the corresponding tarball and its md5 to the archive directory and then create the link in the TODO state directory. That will get the ball rolling on the rest of the processing. Errors like missing tarball, missing or bad md5, duplicate names, are detected and handled appropriately (mostly by quarantining the tarball). This takes care of version-002 agents. Version-001 agents are now also taken care of by a server shim script: they end up in a different reception directory, and the script uses the TODO link that version 001 agents create as an indication that a tarball is ready to process. Most of the procession is similar to the version 002 shim, except that this shim also has to handle an optional prefix file. This commit also includes changes to the example config files for the server to accommodate both version 001 and 002 agents, and the corresponding changes to various pbench-server-activate-* scripts which are called on server installation to set up the appropriate structures. Handling --user and --prefix options server-side: Version 002 pbench-move-results options --user and --prefix (the first new, the second one preexisting) are handled by inserting their values into the metadata.log file. The server side has to be modified to handle them. At the same time, older agents that package the prefix option into a separate file that is copied to the server, also need to be handled properly. N.B. that older agents do not have the user option capability at all. The handling is done at several levels: - The version 001 shim renames the prefix file (if there is one) to $resultname.prefix from prefix.$resultname (the latter was an unfortunate choice that required name surgery in later processing). - pbench-dispatch looks for the (now renamed) prefix file and moves it into the .prefix subdirectory, just as it did before (except for the renaming). - pbench-unpack-tarballs takes the prefix (either out of the metadata.log file for version 002 agents or the prefix file for version 001 agents) and makes a link at the appropriate subdirectory of the results/ directory. In addition, it retrieves the user option out of the metadata.log file and (if non-empty) uses it to make a link in a new hierarchy: users/$user/$controller/$prefix/$resultname Of course, the metadata log is indexed into ES, so these values (in particular the user value) are going to be available for the dashboard to use. - pbench-sync-package-tarballs now uses the modified prefix form. One problem that is *NOT* addressed by this PR at all is what the (agent-side) pbench-edit-prefix is supposed to do. For now, we punt. Revert inotify stuff from pbench-base.sh and pbench-sync-satellite: It will go in as part of a different PR. Clean up header comment in pbench-sync-satellite. (After review) Fix error handling in the shims. There is a "quarantine" directory and three subdirs for each version (001 and 002): md5, duplicates and errors. The handling goes as follows: - Errors in quarantine are fatal. - MD5 errors go to "md5". - Duplicate errors go to "duplicates". - Operational errors (mkdir/mv/ln failures) in the shims quarantine into a a different subdir "errors". After whatever caused any of these errors is fixed, the quarantined tarball should be retried by moving them into the appropriate reception area. - A quarantine setting is added to the config file. - create-results-dir-structure is modified to create the quarantine directory and its subdirs. Emit error messages before calling quarantine. Status formatting: all status on a single line. Forget about prefix stats. Exit with code $nerrs. The quarantine function now logs some information: it makes an assumption that it is called within a log_init/log_finish context and logs any error to the error file of the program that called it. Fix prefix handling in pbench-move-unpacked: duplicate the handling of the prefix in pbench-unpack-tarballs into pbench-move-unpacked. (After further review) Fixes to the two shims after review and discussion. - Add more error checking. - Simplify the counting of various error conditions to maintain the condition ntotal = ntbs + nquarantines + ndups + nerrs - Avoid pushd/popd when fixing up the prefix in the -001 shim. - Annotate the error messages that go into the status file and the error log with "Quarantined", "Duplicate", or "Error" tags, as the case might be. - Fix bug in -002: check $qdir for existence, not $quarantine.
1 parent 55a92a0 commit 2998460

14 files changed

+769
-226
lines changed

agent/util-scripts/pbench-move-results

Lines changed: 67 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,15 @@ pbench_bin="`cd ${script_path}/..; /bin/pwd`"
88
# source the base script
99
. "$pbench_bin"/base
1010

11+
controller=$hostname
12+
1113
function usage() {
1214
printf "usage:\n"
1315
printf "$script_name [--prefix=<path>] [--xz-single-threaded] [--show-server]\n"
1416
}
1517

1618
# Process options and arguments
17-
opts=$(getopt -q -o p:xS --longoptions "prefix:,xz-single-threaded,show-server" -n "getopt.sh" -- "$@");
19+
opts=$(getopt -q -o u:p:xS --longoptions "user:,prefix:,xz-single-threaded,show-server" -n "getopt.sh" -- "$@");
1820
if [ $? -ne 0 ]; then
1921
printf "\n"
2022
printf "%s\n" $*
@@ -24,11 +26,20 @@ if [ $? -ne 0 ]; then
2426
exit 1
2527
fi
2628

29+
user=${PBENCH_USER}
30+
prefix=
2731
xz_single_threaded=
2832
show_server=
2933
eval set -- "$opts";
3034
while true; do
3135
case "$1" in
36+
-u|--user)
37+
shift;
38+
if [ -n "$1" ]; then
39+
user="$1"
40+
shift;
41+
fi
42+
;;
3243
-p|--prefix)
3344
shift;
3445
if [ -n "$1" ]; then
@@ -43,7 +54,7 @@ while true; do
4354
-S|--show-server)
4455
shift;
4556
show_server=1
46-
;;
57+
;;
4758
--)
4859
shift;
4960
break;
@@ -61,6 +72,7 @@ if [ ! -f "$pbench_bin/id_rsa" ]; then
6172
exit 1
6273
fi
6374

75+
# ask the server where to send the tarballs
6476
results_webserver=$(getconf.py webserver results)
6577
if [ -z "$results_webserver" ]; then
6678
error_log "ERROR: No web server host configured from which we can fetch the FQDN of the host to which we copy/move results"
@@ -123,7 +135,6 @@ if [ -z "$results_path_prefix" ]; then
123135
debug_log "expected the results_host_info to have the form: <results_user>@<results_host(FQDN)>:<results_path_prefix>"
124136
exit 1
125137
fi
126-
results_full_path="$results_path_prefix/$hostname"
127138

128139
if [[ ! -z "$show_server" ]] ;then
129140
echo ${results_repo}
@@ -137,17 +148,20 @@ if [ $? -ne 0 ]; then
137148
debug_log "the following ssh command failed: \"ssh -q -i $pbench_bin/id_rsa $ssh_opts $results_repo exit\""
138149
exit 1
139150
fi
140-
ssh -i $pbench_bin/id_rsa $ssh_opts $results_repo "mkdir -p $results_full_path"
141-
if [ $? -ne 0 ]; then
142-
error_log "ERROR: unable to create remote results path, $results_repo:$results_full_path"
143-
exit 1
144-
fi
145151

146152
let runs_copied=0
147153
let failures=0
148154

149-
trap "rm -f $pbench_tmp/prefix.*" EXIT INT QUIT
155+
tmp=${pbench_tmp}/${script_name}.$$
156+
trap "rm -rf $tmp" EXIT INT QUIT
150157

158+
mkdir -p $tmp/$controller
159+
sts=$?
160+
if [ $sts -ne 0 ] ;then
161+
error_log "Failed: \"mkdir -p $tmp/$controller\", status $sts"
162+
exit 1
163+
fi
164+
# We can now start copying tarballs to the server
151165

152166
# Move into pbench run collection directory
153167
pushd $pbench_run >/dev/null
@@ -172,120 +186,85 @@ for dir in `/bin/ls -ort -d */ | awk '{print $8}' | grep -v "^tools-" | grep -v
172186
/bin/cp pbench.log $pbench_run_name/
173187
fi
174188

189+
# if -u was specified, store the specified user in metadata.log
190+
if [ ! -z $user ] ;then
191+
mdlog=${pbench_run_name}/metadata.log
192+
echo $user | pbench-add-metalog-option ${mdlog} run user
193+
fi
194+
195+
# if -p was specified, store the specified prefix in metadata.log
196+
if [ ! -z $prefix ] ;then
197+
mdlog=${pbench_run_name}/metadata.log
198+
echo $prefix | pbench-add-metalog-option ${mdlog} run prefix
199+
fi
200+
175201
results_size=`du -sm $pbench_run_name | awk '{print $1}'`
176202
debug_log "preparing to copy $results_size MB of data from $pbench_run/$pbench_run_name"
177203

178-
tarball="$pbench_run_name.tar.xz"
204+
# Create a temp directory $tmp/$controller to contain the tarball
205+
# and the md5 file (as ${tb}.tar.xz.md5.check). Copy the directory
206+
# with scp -r $tmp/$controller $remote: that will create the
207+
# $controller subdirectory on the remote (if necessary) OR fail.
208+
209+
# If it does not fail, then check the MD5 sum and rename the foo.tar.xz.md5.check file
210+
# to foo.tar.xz.md5. That's the signal that the agent has finished with this tarball.
211+
212+
tarball="$tmp/$controller/$pbench_run_name.tar.xz"
179213
if [[ ${xz_single_threaded} != "1" ]] ;then
180214
echo "tar --create --force-local \"$pbench_run_name\" | xz -T0 > \"$tarball\" "
181215
tar --create --force-local "$pbench_run_name" | xz -T0 > "$tarball"
182216
else
183217
echo "tar --create --xz --force-local --file=\"$tarball\" \"$pbench_run_name\" "
184-
tar --create --xz --force-local --file="$tarball" "$pbench_run_name"
218+
tar --create --xz --force-local --file="$tarball" "$pbench_run_name"
185219
fi
186-
220+
187221
if [ $? -ne 0 ]; then
188222
error_log "ERROR: tar failed for $pbench_run/$pbench_run_name, skipping"
189223
rm -f "$tarball"
190224
let failures=failures+1
191225
continue
192226
fi
193-
md5sum "$tarball" > "$tarball.md5"
227+
228+
tarballmd5="$tarball.md5.check"
229+
# we need to calculate the md5 sum in the temp directory
230+
# in order to get the filename right.
231+
pushd $(dirname $tarball) > /dev/null
232+
md5sum "$(basename $tarball)" > "$tarballmd5"
194233
if [ $? -ne 0 ]; then
195234
error_log "ERROR: md5sum failed for $tarball, skipping"
196-
rm -f $tarball $tarball.md5
235+
rm -f "$tarball" "$tarballmd5"
197236
let failures=failures+1
237+
popd >/dev/null
198238
continue
199239
fi
200-
201-
# Perform the actual copy
202-
# if a prefix is provided, copy it to the other side - maybe this should be part of the tarball?
203-
prefixfile=""
204-
if [ ! -z "$prefix" ] ;then
205-
prefixfile=$pbench_tmp/prefix.$pbench_run_name
206-
echo "$prefix" > $prefixfile
207-
fi
208-
209-
ssh -i $pbench_bin/id_rsa $ssh_opts $results_repo "mkdir -p $results_full_path"
210-
211-
# check if there is a name collision and resolve it
212-
typeset -i i=1
213-
rtarball=$tarball
214-
rprefixfile=$prefixfile
215-
while (( 1 )) ; do
216-
ssh -i $pbench_bin/id_rsa $ssh_opts $results_repo "test -f $results_full_path/$rtarball"
217-
if [ $? -eq 0 ] ;then
218-
# collision - warn the first time around
219-
if [ $i -eq 1 ] ;then
220-
log "WARNING: name collision - $results_repo:$results_full_path/$rtarball exists"
221-
fi
222-
else
223-
# collision resolution found
224-
if [ "$tarball" != "$rtarball" ] ;then
225-
mv $tarball $rtarball
226-
mv $tarball.md5 $rtarball.md5
227-
tarball=$rtarball
228-
if [ ! -z "$prefixfile" ]; then
229-
mv $prefixfile $rprefixfile
230-
prefixfile=$rprefixfile
231-
fi
232-
fi
233-
break
234-
fi
235-
rtarball="DUPLICATE__NAME.$i.$pbench_run_name.tar.xz"
236-
if [ ! -z "$prefixfile" ] ;then
237-
rprefixfile="$prefixfile.$i"
238-
fi
239-
i=$i+1
240-
done
241-
242-
# FIXME: don't assume final path contains /incoming in it
243-
if [[ $i -eq 1 ]]; then
244-
if [[ ! -z "$prefix" ]]; then
245-
debug_log "copying $tarball to http://$results_webserver/results/$hostname/$prefix/..."
246-
else
247-
debug_log "copying $tarball to http://$results_webserver/results/$hostname/..."
248-
fi
249-
else
250-
debug_log "archiving $tarball to $results_webserver, but not being made available via the web"
251-
fi
240+
popd >/dev/null
252241

253242
# finally do the copy
254-
scp $scp_opts -i $pbench_bin/id_rsa $ssh_opts ./$tarball ./$tarball.md5 $prefixfile $results_repo:$results_full_path
243+
scp -r $scp_opts -i $pbench_bin/id_rsa $ssh_opts $tmp/$controller $results_repo:$results_path_prefix
255244
if [ $? -ne 0 ]; then
256-
error_log "ERROR: unable to copy results tarball, $tarball, to $results_repo:$results_full_path"
257-
rm -f $tarball $tarball.md5
245+
error_log "ERROR: unable to copy results tarball, $tarball, to $results_repo:$results_path_prefix"
246+
rm -f $tarball $tarballmd5
258247
let failures=failures+1
259248
continue
260249
fi
261250

262-
# clean up the prefix file (if present)
263-
if [ ! -z "$prefixfile" -a -f "$prefixfile" ] ;then
264-
rm -f $prefixfile
265-
fi
266-
267251
# Verify the bits copied are good
268-
ssh -i $pbench_bin/id_rsa $ssh_opts $results_repo "cd $results_full_path; md5sum --check $pbench_run_name.tar.xz.md5"
252+
md5name=$(basename $tarball).md5
253+
ssh -i $pbench_bin/id_rsa $ssh_opts $results_repo "cd $results_path_prefix/$controller; md5sum --check ${md5name}.check && mv ${md5name}.check ${md5name}"
269254
chk_res=$?
270-
rm -f $tarball $tarball.md5
271255
if [ $chk_res -ne 0 ]; then
272256
error_log "ERROR: remote copy failed, remote tarball MD5 does not match original"
273-
rm -f $tarball $tarball.md5
257+
rm -f $tarball $tarballmd5
274258
let failures=failures+1
275259
continue
276-
else
277-
if [ "$script_name" == "pbench-move-results" ]; then
278-
rm -rf $pbench_run_name
279-
else
280-
touch $pbench_run_name.copied
281-
fi
282260
fi
261+
rm -f $tarball $tarballmd5
283262

284-
# set the state of the result appropriately so that it will be processed
285-
# by the server scripts.
286-
ssh -i $pbench_bin/id_rsa $ssh_opts $results_repo \
287-
/opt/pbench-server/bin/pbench-server-set-result-state $results_full_path $tarball
288-
263+
if [ "$script_name" == "pbench-move-results" ]; then
264+
rm -rf $pbench_run_name
265+
else
266+
touch $pbench_run_name.copied
267+
fi
289268
let runs_copied=runs_copied+1
290269
done
291270

server/pbench/bin/pbench-base.sh

Lines changed: 25 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,10 @@ else
4242
fi
4343

4444
ARCHIVE=${TOP}/archive/fs-version-001
45-
INOTIFY_STATE_DIR=${ARCHIVE}/inotify_state
4645
INCOMING=${TOP}/public_html/incoming
4746
# this is where the symlink forest is going to go
4847
RESULTS=${TOP}/public_html/results
49-
48+
USERS=${TOP}/public_html/users
5049

5150
if [[ -z "$_PBENCH_SERVER_TEST" ]]; then
5251
function timestamp {
@@ -125,27 +124,29 @@ function log_finish {
125124
exec 4>&- # Close error file
126125
}
127126

128-
# The inotify script runs the server scripts like dispatch
129-
# and unpack asynchronously (more will be added in future),
130-
# results in multiple instances of those scripts running in
131-
# parallel. If every instance tries to write in the same file
132-
# then it will be chaos and make things difficult to debug.
133-
# In that case, this function will acquire a lock on the main
134-
# log file and allow every instance to append the log saved in
135-
# the /tmp directory (with different PID) to the main log file.
136-
137-
function log_append {
138-
#log_append $TMP/$(basename $0).$$ $LOGSDIR/$(basename $0)
139-
TMP_DIR=$1
140-
LOG_DIR=$2
141-
mkdir -p $LOG_DIR
142-
if [[ $? -ne 0 || ! -d "$LOG_DIR" ]]; then
143-
doexit "Unable to find/create logging directory, $LOG_DIR"
127+
# Function used by the shims to quarantine problematic tarballs. It
128+
# is assumed that the function is called within a log_init/log_finish
129+
# context. Errors here are fatal but we log an error message to help
130+
# diagnose problems.
131+
function quarantine () {
132+
dest=$1
133+
shift
134+
files="$@"
135+
136+
mkdir -p $dest
137+
sts=$?
138+
if [ $sts -ne 0 ] ;then
139+
# log error
140+
echo "$TS: quarantine $dest $files: \"mkdir -p $dest\" failed with status $sts" >&4
141+
log_finish
142+
exit 101
143+
fi
144+
mv $files $dest
145+
sts=$?
146+
if [ $sts -ne 0 ] ;then
147+
# log error
148+
echo "$TS: quarantine $dest $files: \"mv $files $dest\" failed with status $sts" >&4
149+
log_finish
150+
exit 102
144151
fi
145-
146-
log_file=$LOG_DIR/$(basename $0).log
147-
error_file=$LOG_DIR/$(basename $0).error
148-
149-
flock -n $log_file cat $TMP_DIR/$(basename $0).log >> $log_file
150-
flock -n $error_file cat $TMP_DIR/$(basename $0).error >> $error_file
151152
}

server/pbench/bin/pbench-dispatch

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
# this tarball again; but if there are errors, we may
1717
# keep it in TODO and try again, if the error is recoverable.
1818
# Any errors are reported for possible action by an admin.
19-
#
19+
#
2020

2121
# assumptions:
2222
# - this script runs as a cron job
@@ -97,7 +97,7 @@ else
9797

9898
link=$(readlink -e $result)
9999
if [ ! -f "$link" ] ;then
100-
echo "$TS: $link does not exist" >&4
100+
echo "$TS: $result->$link does not exist" >&4
101101
nerrs=$nerrs+1
102102
continue
103103
fi
@@ -119,15 +119,15 @@ else
119119
nerrs=$nerrs+1
120120
continue
121121
fi
122-
122+
123123
mkdir -p $TMP/$PROG/$hostname
124124
status=$?
125125
if [[ $status -ne 0 ]] ;then
126126
echo "$TS: mkdir -p $TMP/$PROG/$hostname failed: code $status" >&4
127127
nerrs=$nerrs+1
128128
continue
129129
fi
130-
130+
131131
# XXXX - for now, if it's a duplicate name, just punt and avoid producing the error - the full
132132
# solution will involve renaming the unpacked directory appropriately.
133133
if [ ${resultname%%.*} == "DUPLICATE__NAME" ] ;then
@@ -148,7 +148,7 @@ else
148148

149149
# move any prefix file to the .prefix subdir
150150
basedir=$(dirname $link)
151-
prefixfile=$basedir/prefix.$resultname
151+
prefixfile=$basedir/$resultname.prefix
152152
if [ -f $prefixfile ] ;then
153153
mkdir -p $basedir/.prefix
154154
mv $prefixfile $basedir/.prefix

0 commit comments

Comments
 (0)