Replies: 1 comment
-
|
In the first section of your logs, 12 out of 17 jobs hit the walltime and were subsequently resubmitted by the merge job. However, it seems that at least one of these resubmitted jobs failed. It would be helpful to understand what happened with that same job during its initial run (the one that reached the walltime). In most cases, re-launching jobs that hit the walltime does not solve the problem. Typically, the better approach is to cancel or stop the run, increase the walltime, and then re-run the job. There are only a few special cases where re-launching is the right option. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I got an error during particle refinement and was looking for advice. Here is the command I ran:
"/opt/pyp/bin/run/csp" -data_parent "/archive/qnap/nextpyp/shared/users/Jake/projects/MLVs-24jun04a-24Sm8YFT1CR2sLEx5V/tomo-preprocessing-fYHvdxz3b5w42nGF" -particle_mw 135.0 -particle_rad 70.0 -particle_sym "C3" -extract_box 56 -extract_bin 4 -extract_fmt frealign -refine_model "/archive/qnap/Jake/Active_projects/24jun04a_MMTV-MLVs_krios/eman_project_folder/spt_06/MLV-MA-ref_EMAN_spt_06.mrc" -refine_rhref "20" -no-refine_resume -refine_maxiter 5 -refine_srad 120.0 -reconstruct_cutoff "0.75" -csp_NumberOfRandomIterations 50000 -csp_ToleranceParticlesShifts 50.0 -slurm_tasks 36 -slurm_memory 300 -slurm_walltime "24:00:00" -slurm_merge_tasks 6 -slurm_merge_memory 20 -slurm_merge_walltime "24:00:00"Here is the error from the merge logs:
After, it resubmitted the split job which got this error
21 | Traceback (most recent call last):
22 | File "/opt/pyp/bin/run/pyp", line 3877, in
23 | csp_swarm(args.file, parameters, int(args.iter), args.skip, args.debug)
24 | File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
25 | return func(*args, **kwargs)
26 | File "/opt/pyp/bin/run/pyp", line 2395, in csp_swarm
27 | [allboxes, allparxs] = csp_extract_coordinates(
28 | File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
29 | return func(*args, **kwargs)
30 | File "/opt/pyp/src/pyp/inout/metadata/core.py", line 1890, in csp_extract_coordinates
31 | shutil.copy2("csp/{}_boxes3d.txt".format(filename), working_path)
32 | File "/usr/local/envs/pyp/lib/python3.8/shutil.py", line 435, in copy2
33 | copyfile(src, dst, follow_symlinks=follow_symlinks)
34 | File "/usr/local/envs/pyp/lib/python3.8/shutil.py", line 264, in copyfile
35 | with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
36 | FileNotFoundError: [Errno 2] No such file or directory: 'csp/MMTV-MLVs-24jun04a-TS-71_boxes3d.txt'
37 | 2025-02-10 19:15:09 [ERROR] /opt/pyp/bin/run/pyp:3891 | PYP (cspswarm) failed
38 | 2025-02-10 19:15:13 [INFO] /opt/pyp/bin/run/pyp:3232 | Job 5182196512542810902_1 (v0.6.4) launching on eggg.medchem.washington.edu using 36 task(s)
39 | 2025-02-10 19:15:13 [INFO] pyp/refine/csp/particle_cspt.py:57 | Total time elapsed (csp_local_merge): 00h 00m 00s
40 | 2025-02-10 19:15:13 [INFO] /opt/pyp/bin/run/pyp:3178 | Filesystem Size Used Avail Use% Mounted on
41 | 2025-02-10 19:15:13 [INFO] /opt/pyp/bin/run/pyp:3178 | /dev/nvme1n1p1 7.3T 2.8T 4.1T 41% /home/nextpyp/tmp/pyp
42 | Traceback (most recent call last):
43 | File "/opt/pyp/bin/run/pyp", line 3957, in
44 | particle_cspt.merge_movie_files_in_job_arr(
45 | File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
46 | return func(*args, **kwargs)
47 | File "/opt/pyp/src/pyp/refine/csp/particle_cspt.py", line 380, in merge_movie_files_in_job_arr
48 | with open(movie_file) as f:
49 | FileNotFoundError: [Errno 2] No such file or directory: 'stacks.txt'
50 | 2025-02-10 19:15:13 [ERROR] /opt/pyp/bin/run/pyp:3978 | PYP (csp_local_merge) failed
And then the merge job after that got a similar error to the first time and the job timed out.
Thanks!
Jake Croft
Beta Was this translation helpful? Give feedback.
All reactions