-
Notifications
You must be signed in to change notification settings - Fork 27
Configuration Examples
Default Configuration
Using FTI Processes
Using only selected ckpt level with FTI_Snapshot
Keeping last checkpoint
Using different IO mode
Restart after a failure
[basic]
head = 0
node_size = 2
ckpt_dir = ./Local
glbl_dir = ./Global
meta_dir = ./Meta
ckpt_l1 = 3
ckpt_l2 = 5
ckpt_l3 = 7
ckpt_l4 = 11
dcp_l4 = 0
inline_l2 = 1
inline_l3 = 1
inline_l4 = 1
keep_last_ckpt = 0
keep_l4_ckpt = 0
group_size = 4
max_sync_intv = 0
ckpt_io = 1
enable_staging = 0
enable_dcp = 0
dcp_mode = 0
dcp_block_size = 16384
verbosity = 2
[restart]
failure = 0
exec_id = 2018-09-17_09-50-30
[injection]
rank = 0
number = 0
position = 0
frequency = 0
[advanced]
block_size = 1024
transfer_size = 16
general_tag = 2612
ckpt_tag = 711
stage_tag = 406
final_tag = 3107
local_test = 1
lustre_striping_unit = 4194304
lustre_striping_factor = -1
lustre_striping_offset = -1
DESCRIPTION
This configuration is made of default values (see: 5). FTI processes are not created (
head = 0, notice: if there is no FTI processes, all post-checkpoints must be done by application processes, thusinline_L2,inline_L3andinline_L4are set to 1), last checkpoint won’t be kept (keep_last_ckpt = 0),FTI_Snapshot()will take L1 checkpoint every 3 min,L2 - every 5 min, L3 - every 7 min and L4 - every 11 min, FTI will print errors and some few important information (verbosity = 2) and IO mode is set to POSIX (ckpt_io = 1). This is a normal launch of a job, because failure is set to 0 andexec_idisNULL.local_test = 1makes this a local test.
[ Basic ]
head = 1
node_size = 2
ckpt_dir = /scratch/username/
glbl_dir = /work/project/
meta_dir = /home/username/.fti/
ckpt_L1 = 3
ckpt_L2 = 5
ckpt_L3 = 7
ckpt_L4 = 11
inline_L2 = 0
inline_L3 = 0
inline_L4 = 0
keep_last_ckpt = 0
group_size = 4
max_sync_intv = 0
ckpt_io = 1
verbosity = 2
[ Restart ]
failure = 0
exec_id = NULL
[ Advanced ]
block_size = 1024
transfer_size = 16
mpi_tag = 2612
lustre_striping_unit = 4194304
lustre_striping_factor = -1
lustre_striping_offset = -1
local_test = 1
DESCRIPTION
FTI processes are created (
head = 1) and all post-checkpointing is done by them, thusinline_L2,inline_L3andinline_L4are set to 0. Note that it is possible to select which checkpoint levels should be post-processed by heads and which by application processes (e.g.inline_L2 = 1,inline_L3 = 0,inline_L4 = 0). L1 post-checkpoint is always done by application processes, because it’s a local checkpoint. Be aware, whenhead = 1, andinline_L2,inline_L3andinline_L4are set to 1 all post-checkpoint is still made by application processes.
[ Basic ]
head = 0
node_size = 2
ckpt_dir = /scratch/username/
glbl_dir = /work/project/
meta_dir = /home/username/.fti/
ckpt_L1 = 0
ckpt_L2 = 5
ckpt_L3 = 0
ckpt_L4 = 0
inline_L2 = 1
inline_L3 = 1
inline_L4 = 1
keep_last_ckpt = 0
group_size = 4
max_sync_intv = 0
ckpt_io = 1
verbosity = 2
[ Restart ]
failure = 0
exec_id = NULL
[ Advanced ]
block_size = 1024
transfer_size = 16
mpi_tag = 2612
lustre_striping_unit = 4194304
lustre_striping_factor = -1
lustre_striping_offset = -1
local_test = 1
DESCRIPTION
FTI_Snapshot()will take only L2 checkpoint every 5 min Notice that other configurations are also possible (e.g. take L1 ckpt every 5 min and L4 ckpt every 30 min).
[ Basic ]
head = 0
node_size = 2
ckpt_dir = /scratch/username/
glbl_dir = /work/project/
meta_dir = /home/username/.fti/
ckpt_L1 = 3
ckpt_L2 = 5
ckpt_L3 = 7
ckpt_L4 = 11
inline_L2 = 1
inline_L3 = 1
inline_L4 = 1
keep_last_ckpt = 1
group_size = 4
max_sync_intv = 0
ckpt_io = 1
verbosity = 2
[ Restart ]
failure = 0
exec_id = NULL
[ Advanced ]
block_size = 1024
transfer_size = 16
mpi_tag = 2612
lustre_striping_unit = 4194304
lustre_striping_factor = -1
lustre_striping_offset = -1
local_test = 1
DESCRIPTION
FTI will keep last checkpoint (
Keep_last_ckpt = 1), thus after finishing the job Failure will be set to 2.
For instance MPI-I/O:
[ Basic ]
head = 0
node_size = 2
ckpt_dir = /scratch/username/
glbl_dir = /work/project/
meta_dir = /home/username/.fti/
ckpt_L1 = 3
ckpt_L2 = 5
ckpt_L3 = 7
ckpt_L4 = 11
inline_L2 = 1
inline_L3 = 1
inline_L4 = 1
keep_last_ckpt = 0
group_size = 4
max_sync_intv = 0
ckpt_io = 2
verbosity = 2
[ Restart ]
failure = 0
exec_id = NULL
[ Advanced ]
block_size = 1024
transfer_size = 16
mpi_tag = 2612
lustre_striping_unit = 4194304
lustre_striping_factor = -1
lustre_striping_offset = -1
local_test = 1
DESCRIPTION
FTI IO mode is set to MPI IO (
ckpt_io = 2). Third option is SIONlib IO mode (ckpt_io = 3).
[ Basic ]
head = 0
node_size = 2
ckpt_dir = /scratch/username/
glbl_dir = /work/project/
meta_dir = /home/username/.fti/
ckpt_L1 = 3
ckpt_L2 = 5
ckpt_L3 = 7
ckpt_L4 = 11
inline_L2 = 1
inline_L3 = 1
inline_L4 = 1
keep_last_ckpt = 0
group_size = 4
max_sync_intv = 0
ckpt_io = 1
verbosity = 2
[ Restart ]
failure = 1
exec_id = 2017-07-26_13-22-11
[ Advanced ]
block_size = 1024
transfer_size = 16
mpi_tag = 2612
lustre_striping_unit = 4194304
lustre_striping_factor = -1
lustre_striping_offset = -1
local_test = 1
DESCRIPTION
This config tells FTI that this job is a restart after a failure (
failureset to 1 andexec_idis some date in a formatYYYY-MM-DD_HH-mm-ss, whereYYYY- year,MM- month,DD- day,HH- hours,mm- minutes,ss- seconds). When recovery is not possible, FTI will abort the job (when usingFTI_Snapshot()) and/or signal failed recovery byFTI_Status().