Skip to content

icon sims don't run with gnu_openmpi and gnu_psmpi #2

@AGonzalezNicolas

Description

@AGonzalezNicolas

FYI @kvrigor

Note:

  • icon works for the intel_psmpi build.
  • using grid 0070

1. Path to folders for the "bin" and "simulation_run"

gnu_openmpi

/p/scratch/cslts/gonzalez5/TSMP2/BUILDS/TSMP2/bin/JURECADC_ICON_2025_gnu_openmpi
/p/scratch/cslts/gonzalez5/TSMP2/tsmp2_eclm-parflow_tests/TSMP2_WFE_simexp_ideal_scal/run/sim_pft13-sid02-sv06_0070_icon_20150701_gnu_openmpi

gnu_psmpi

/p/scratch/cslts/gonzalez5/TSMP2/BUILDS/TSMP2/bin/JURECADC_ICON_2025_gnu_psmpi
/p/scratch/cslts/gonzalez5/TSMP2/tsmp2_eclm-parflow_tests/TSMP2_WFE_simexp_ideal_scal/run/sim_pft13-sid02-sv06_0070_icon_20150701_gnu_psmpi

2. ERRORS

gnu_openmpi

The combination icon with gnu_openmpi produces the following error:

adding new var_list ext_data_atm_td_D01
corrupted double-linked list
corrupted double-linked list

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
 mo_ext_data_state:construct_ext_data: Construction of data structure for external data finished
 mo_ext_data_init:init_ext_data: Running with analytical topography
corrupted double-linked list
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
 mo_ext_data_init:init_ext_data: read_ext_data_atm completed

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
corrupted double-linked list
corrupted double-linked list
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
corrupted double-linked list
corrupted double-linked list
 (mo_nh_testcases) init_nh_testtopo:: running Convective Boundary Layer Experiment

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
corrupted double-linked list

Program received signal SIGABRT: Process abort signal.
corrupted double-linked list
corrupted double-linked list

gnu_psmpi

icon with gnu_psmpi produces the following error:

adding new var_list ext_data_atm_td_D01
[jrc0715:930019:0:930019] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
 mo_ext_data_state:construct_ext_data: Construction of data structure for external data finished
 mo_ext_data_init:init_ext_data: Running with analytical topography
 mo_ext_data_init:init_ext_data: read_ext_data_atm completed
[jrc0715:930099:0:930099] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
 (mo_nh_testcases) init_nh_testtopo:: running Convective Boundary Layer Experiment
[jrc0715:930015:0:930015] Caught signal 7 (Bus error: Sent by the kernel)
[jrc0715:930086:0:930086] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[jrc0715:929988:0:929988] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
malloc(): unaligned tcache chunk detected

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
double free or corruption (out)

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
==== backtrace (tid: 930015) ====
 0 0x000000000003ebf0 __GI___sigaction()  :0
 1 0x00000000011a5dc3 __mo_hash_table_MOD_hashtable_destruct()  ???:0
 2 0x0000000000573e69 __mo_key_value_store_MOD_key_value_store_destruct()  ???:0
 3 0x000000000067243c __mo_dictionary_MOD_dict_finalize()  ???:0
 4 0x00000000004df782 __mo_ext_data_init_MOD_init_ext_data()  ???:0
 5 0x0000000000433c37 __mo_atmo_model_MOD_construct_atmo_model()  ???:0
 6 0x0000000000434729 __mo_atmo_model_MOD_atmo_model()  ???:0
 7 0x000000000040c448 MAIN__()  icon.f90:0
 8 0x000000000040be1d main()  ???:0
 9 0x00000000000295d0 __libc_start_call_main()  ???:0
10 0x0000000000029680 __libc_start_main_alias_2()  :0
11 0x000000000040be75 _start()  ???:0
=================================

Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:
#0  0x14f32de2bbef in ???
#1  0x14f32de78edc in ???
#2  0x14f32de2bb45 in ???
#3  0x14f32de15832 in ???
#4  0x14f32de16171 in ???
#5  0x14f32de82f86 in ???
#6  0x14f32de86ffb in ???
#7  0x14f2bddd187e in pscom_req_create
	at /dev/shm/swmanage/jurecadc/pscom/5-default/GCCcore-13.3.0/pscom-5.8.0-1/lib/pscom/pscom_req.c:152
#8  0x14f2bddc90d8 in pscom_request_create
	at /dev/shm/swmanage/jurecadc/pscom/5-default/GCCcore-13.3.0/pscom-5.8.0-1/lib/pscom/pscom_io.c:1602
#9  0x14f2c02d457b in ???
#10  0x14f2c02ca2ce in ???
#11  0x14f2c017f404 in ???
#12  0x14f32e080a2b in ???
#13  0x423a53 in ???
#14  0xa12017 in ???
#15  0x55c6a7 in ???
#16  0x490d63 in ???
#17  0x7b001c in ???
#18  0x53b9c2 in ???
#19  0x433dbc in ???
#20  0x434728 in ???
#21  0x40c447 in ???
#22  0x40be1c in ???
#23  0x14f32de165cf in ???
#24  0x14f32de1667f in ???
#25  0x40be74 in ???
#26  0xffffffffffffffff in ???
#0  0x1535da2c7bef in ???
#1  0x1535da314edc in ???
#2  0x1535da2c7b45 in ???
#3  0x1535da2b1832 in ???
#4  0x1535da2b2171 in ???
#5  0x1535da31ef86 in ???
#6  0x1535da320c6f in ???
#7  0x1535da3232c4 in ???
#8  0x15356739d16e in ???
#9  0x15356739eea5 in ???
#10  0x1535673a0b70 in ???
#11  0x1535673722e6 in ???
#12  0x153567372430 in ???
#13  0x153567488759 in find_address_in_section
	at debug/debug.c:338
#14  0x153567345bab in ???
#15  0x1535674893ca in get_line_info
	at debug/debug.c:370
#16  0x1535674893ca in ucs_debug_backtrace_create
	at debug/debug.c:401
#17  0x1535674898c4 in ucs_debug_backtrace_create
	at debug/debug.c:390
#18  0x153567489d61 in ucs_debug_show_innermost_source_file
	at debug/debug.c:551
#19  0x15356748ae6f in ucs_handle_error
	at debug/debug.c:1091
#20  0x15356748b043 in ucs_debug_handle_error_signal
	at debug/debug.c:1044
#21  0x15356748b1e9 in ucs_error_signal_handler
	at debug/debug.c:1066
#22  0x1535da2c7bef in ???
#23  0x4f4d14 in ???
#24  0x4346e5 in ???
#25  0x434728 in ???
#26  0x40c447 in ???
#27  0x40be1c in ???
#28  0x1535da2b25cf in ???
#29  0x1535da2b267f in ???
#30  0x40be74 in ???
#31  0xffffffffffffffff in ???
#0  0x152adb23fbef in ???
#1  0x11a5dc3 in ???
#2  0x573e68 in ???
#3  0x67243b in ???
#4  0x4df781 in ???
#5  0x433c36 in ???
#6  0x434728 in ???
#7  0x40c447 in ???
#8  0x40be1c in ???
#9  0x152adb22a5cf in ???
#10  0x152adb22a67f in ???
#11  0x40be74 in ???
#12  0xffffffffffffffff in ???
srun: error: jrc0715: tasks 0-39,41-73,75,77-127: Terminated
srun: error: jrc0715: task 40: Bus error (core dumped)
srun: error: jrc0715: tasks 74,76: Aborted (core dumped)
srun: Force Terminated StepId=14189510.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions