Error in Particle Refinement #137

jakecroft · 2025-02-12T19:34:29Z

jakecroft
Feb 12, 2025

Hello,

I got an error during particle refinement and was looking for advice. Here is the command I ran:
"/opt/pyp/bin/run/csp" -data_parent "/archive/qnap/nextpyp/shared/users/Jake/projects/MLVs-24jun04a-24Sm8YFT1CR2sLEx5V/tomo-preprocessing-fYHvdxz3b5w42nGF" -particle_mw 135.0 -particle_rad 70.0 -particle_sym "C3" -extract_box 56 -extract_bin 4 -extract_fmt frealign -refine_model "/archive/qnap/Jake/Active_projects/24jun04a_MMTV-MLVs_krios/eman_project_folder/spt_06/MLV-MA-ref_EMAN_spt_06.mrc" -refine_rhref "20" -no-refine_resume -refine_maxiter 5 -refine_srad 120.0 -reconstruct_cutoff "0.75" -csp_NumberOfRandomIterations 50000 -csp_ToleranceParticlesShifts 50.0 -slurm_tasks 36 -slurm_memory 300 -slurm_walltime "24:00:00" -slurm_merge_tasks 6 -slurm_merge_memory 20 -slurm_merge_walltime "24:00:00"

Here is the error from the merge logs:

1	2025-02-09 19:24:51 [INFO] /opt/pyp/bin/run/pyp:3232 \| Job 4276347891569213900 (v0.6.4) launching on eggg.medchem.washington.edu using 6 task(s)
2	2025-02-09 19:24:51 [INFO] pyp/refine/csp/particle_cspt.py:1589 \| Start live-merging intermediate reconstructions
3	2025-02-09 19:24:51 [INFO] pyp/system/mpi.py:178 \| Running 17 function(s)
4
5	Progress: 0%\| \| 0/1 [00:00<?, ?it/s]
6
7	Progress: 100%\|##########\| 1/1 [00:00<00:00, 1.17it/s]
8
9	Progress: 3it [00:01, 3.57it/s]
10
11	Progress: 5it [00:01, 5.03it/s]
12
13	Progress: 8it [00:01, 8.79it/s]
14
15	Progress: 10it [00:01, 8.71it/s]
16
17	Progress: 14it [00:01, 12.99it/s]
18
19	Progress: 15it [00:01, 11.09it/s]
20
21	Progress: 16it [00:02, 10.10it/s]
22
23	Progress: 17it [00:02, 8.26it/s]
24
25	Progress: 17it [00:02, 7.50it/s]
26	2025-02-09 19:24:54 [INFO] pyp/system/mpi.py:188 \| 17 functions(s) finished
27	2025-02-10 19:15:02 [INFO] pyp/refine/csp/particle_cspt.py:1659 \| Done live-merging intermediate reconstructions
28	2025-02-10 19:15:02 [WARNING] pyp/refine/csp/particle_cspt.py:1704 \| Job reached walltime. Attempting to resubmit remaining jobs but you may need to increase the walltime (merge task)
29	2025-02-10 19:15:02 [WARNING] pyp/refine/csp/particle_cspt.py:939 \| The following micrographs/tiltseries failed and will be re-submitted
30	2025-02-10 19:15:02 [INFO] pyp/refine/csp/particle_cspt.py:941 \| MMTV-MLVs-24jun04a-TS-71,MMTV-MLVs-24jun04a-TS-78,MMTV-MLVs-24jun04a-TS-80,MMTV-MLVs-24jun04a-TS-70,MMTV-MLVs-24jun04a-TS-85,MMTV-MLVs-24jun04a-TS-79,MMTV-MLVs-24jun04a-TS-74,MMTV-MLVs-24jun04a-TS-77,MMTV-MLVs-24jun04a-TS-83,MMTV-MLVs-24jun04a-TS-81,MMTV-MLVs-24jun04a-TS-82,MMTV-MLVs-24jun04a-TS-75
31	2025-02-10 19:15:02 [INFO] pyp/system/slurm.py:442 \| Submitting 12 job(s) (fj01ptWJS3MKxsgb)
32	2025-02-10 19:15:02 [INFO] pyp/system/slurm.py:442 \| Submitting 1 job(s) (fj01ptWJS3MKxsgd)
33	2025-02-10 19:15:02 [INFO] pyp/refine/csp/particle_cspt.py:57 \| Check error and resubmit took: 00h 00m 00s
34	2025-02-10 19:15:02 [INFO] pyp/refine/csp/particle_cspt.py:57 \| Live decompress and merge took: 23h 50m 10s
35	2025-02-10 19:15:02 [INFO] pyp/refine/csp/particle_cspt.py:57 \| Total time elapsed: 23h 50m 10s
36	2025-02-10 19:15:02 [INFO] /opt/pyp/bin/run/pyp:3178 \| Filesystem Size Used Avail Use% Mounted on
37	2025-02-10 19:15:02 [INFO] /opt/pyp/bin/run/pyp:3178 \| /dev/nvme1n1p1 7.3T 2.8T 4.1T 41% /home/nextpyp/tmp/pyp
38	Traceback (most recent call last):
39	File "/opt/pyp/bin/run/pyp", line 3919, in
40	csp_merge(parameters)
41	File "/opt/pyp/bin/run/pyp", line 2480, in csp_merge
42	particle_cspt.run_merge(path)
43	File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
44	return func(args, *kwargs)
45	File "/opt/pyp/src/pyp/refine/csp/particle_cspt.py", line 1283, in run_merge
46	csp_class_merge(class_index=1, input_dir=input_dir)
47	File "/opt/pyp/src/pyp/refine/csp/particle_cspt.py", line 1806, in csp_class_merge
48	collect_all_cspswarm = live_decompress_and_merge(class_index, input_dir, mp, micrographs, all_jobs, merge=(
49	File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
50	return func(args, *kwargs)
51	File "/opt/pyp/src/pyp/refine/csp/particle_cspt.py", line 1708, in live_decompress_and_merge
52	merge_check_err_and_resubmit(parameters, input_dir, micrographs, int(parameters["refine_iter"]))
53	File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
54	return func(args, *kwargs)
55	File "/opt/pyp/src/pyp/refine/csp/particle_cspt.py", line 963, in merge_check_err_and_resubmit
56	raise Exception(message)
57	Exception: Successfully re-submitted failed jobs
58	2025-02-10 19:15:02 [ERROR] /opt/pyp/bin/run/pyp:3937 \| PYP (cspmerge) failed

After, it resubmitted the split job which got this error

21 | Traceback (most recent call last):
22 | File "/opt/pyp/bin/run/pyp", line 3877, in
23 | csp_swarm(args.file, parameters, int(args.iter), args.skip, args.debug)
24 | File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
25 | return func(*args, **kwargs)
26 | File "/opt/pyp/bin/run/pyp", line 2395, in csp_swarm
27 | [allboxes, allparxs] = csp_extract_coordinates(
28 | File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
29 | return func(*args, **kwargs)
30 | File "/opt/pyp/src/pyp/inout/metadata/core.py", line 1890, in csp_extract_coordinates
31 | shutil.copy2("csp/{}_boxes3d.txt".format(filename), working_path)
32 | File "/usr/local/envs/pyp/lib/python3.8/shutil.py", line 435, in copy2
33 | copyfile(src, dst, follow_symlinks=follow_symlinks)
34 | File "/usr/local/envs/pyp/lib/python3.8/shutil.py", line 264, in copyfile
35 | with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
36 | FileNotFoundError: [Errno 2] No such file or directory: 'csp/MMTV-MLVs-24jun04a-TS-71_boxes3d.txt'
37 | 2025-02-10 19:15:09 [ERROR] /opt/pyp/bin/run/pyp:3891 | PYP (cspswarm) failed
38 | 2025-02-10 19:15:13 [INFO] /opt/pyp/bin/run/pyp:3232 | Job 5182196512542810902_1 (v0.6.4) launching on eggg.medchem.washington.edu using 36 task(s)
39 | 2025-02-10 19:15:13 [INFO] pyp/refine/csp/particle_cspt.py:57 | Total time elapsed (csp_local_merge): 00h 00m 00s
40 | 2025-02-10 19:15:13 [INFO] /opt/pyp/bin/run/pyp:3178 | Filesystem Size Used Avail Use% Mounted on
41 | 2025-02-10 19:15:13 [INFO] /opt/pyp/bin/run/pyp:3178 | /dev/nvme1n1p1 7.3T 2.8T 4.1T 41% /home/nextpyp/tmp/pyp
42 | Traceback (most recent call last):
43 | File "/opt/pyp/bin/run/pyp", line 3957, in
44 | particle_cspt.merge_movie_files_in_job_arr(
45 | File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
46 | return func(*args, **kwargs)
47 | File "/opt/pyp/src/pyp/refine/csp/particle_cspt.py", line 380, in merge_movie_files_in_job_arr
48 | with open(movie_file) as f:
49 | FileNotFoundError: [Errno 2] No such file or directory: 'stacks.txt'
50 | 2025-02-10 19:15:13 [ERROR] /opt/pyp/bin/run/pyp:3978 | PYP (csp_local_merge) failed

And then the merge job after that got a similar error to the first time and the job timed out.
Thanks!
Jake Croft

abartesaghi · 2025-02-16T13:13:28Z

abartesaghi
Feb 16, 2025
Maintainer

In the first section of your logs, 12 out of 17 jobs hit the walltime and were subsequently resubmitted by the merge job. However, it seems that at least one of these resubmitted jobs failed. It would be helpful to understand what happened with that same job during its initial run (the one that reached the walltime). In most cases, re-launching jobs that hit the walltime does not solve the problem. Typically, the better approach is to cancel or stop the run, increase the walltime, and then re-run the job. There are only a few special cases where re-launching is the right option.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nextPYP

Error in Particle Refinement #137

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

nextPYP

Error in Particle Refinement #137

Uh oh!

jakecroft Feb 12, 2025

Replies: 1 comment

Uh oh!

abartesaghi Feb 16, 2025 Maintainer

jakecroft
Feb 12, 2025

abartesaghi
Feb 16, 2025
Maintainer