-
Notifications
You must be signed in to change notification settings - Fork 27
API Reference
FTI Datatypes
FTI Constants
FTI_Init
FTI_InitType
FTI_Protect
FTI_GetStoredSize
FTI_Realloc
FTI_Checkpoint
FTI_Status
FTI_Recover
FTI_Snapshot
FTI_Finalize
FTI_CHAR : FTI data type for chars
FTI_SHRT : FTI data type for short integers.
FTI_INTG : FTI data type for integers.
FTI_LONG : FTI data type for long integers.
FTI_UCHR : FTI data type for unsigned chars.
FTI_USHT : FTI data type for unsigned short integers.
FTI_UINT : FTI data type for unsigned integers.
FTI_ULNG : FTI data type for unsigned long integers.
FTI_SFLT : FTI data type for single floating point.
FTI_DBLE : FTI data type for double floating point.
FTI_LDBE : FTI data type for long double floating point.
FTI_BUFS : 256
FTI_DONE : 1
FTI_SCES : 0
FTI_NSCS : -1
FTI_NREC : -2
- Reads configuration file.
- Creates checkpoint directories.
- Detects topology of the system.
- Regenerates data upon recovery.
DEFINITION
int FTI_Init ( char * configFile , MPI_Comm globalComm )INPUT
| Variable | What for? |
|---|---|
char * configFile |
Path to the config file |
MPI_Comm globalComm |
MPI communicator used for the execution |
OUTPUT
| Value | Reason |
|---|---|
FTI_SCES |
Success |
FTI_NSCS |
No Success |
FTI_NREC |
FTI could not recover ckpt files |
DESCRIPTION
FTI_Init initializes the FTI context. It must be called before any other FTI
function and after MPI_Init.
EXAMPLE
int main ( int argc , char **argv ) {
MPI_Init (&argc , &argv );
char *path = "config.fti"; // config file path
int res = FTI_Init ( path , MPI_COMM_WORLD );
if (res == FTI_NREC) {
printf("Recovery not possible, terminating...");
FTI_Finalize();
MPI_Finalize();
return 1;
}
.
.
.
return 0;
}
- Initializes a data type.
DEFINITION
int FTI_InitType ( FTIT_type *type , int size )INPUT
| Variable | What for? |
|---|---|
FTIT_type * type |
The data-type to be initialized |
int size |
The size of the data-type to be initialized |
OUTPUT
| Value | Reason |
|---|---|
FTI_SCES |
Success |
DESCRIPTION
FTI_InitType initializes a FTI data-type. A data-type which is not defined by default by FTI (see: FTI Datatypes), must be defined using this function in order to protect variables of that type with FTI_Protect.
EXAMPLE
typedef struct A {
int a;
int b;
} A;
FTIT_type structAinfo ;
FTI_InitType (& structAinfo , 2 * sizeof ( int ));
- Stores metadata concerning the variable to protect.
DEFINITION
int FTI_Protect ( int id, void *ptr, long count, FTIT_type type )INPUT
| Variable | What for? |
|---|---|
int id |
Unique ID of the variable to protect |
void * ptr |
Pointer to memory address of variable |
long count |
Number of elements at memory address |
FTIT_type type |
FTI data type of variable to protect |
OUTPUT
| Value | Reason |
|---|---|
FTI_SCES |
Success |
FTI_NSCS |
No success |
DESCRIPTION
FTI_Protect is used to add data fields to the list of protected
variables. Data, protected by this function will be stored during a call to FTI_Checkpoint or FTI_Snapshot and restored during a call to FTI_Recover.
If the dimension of a protected variable changes during the execution, a subsequent call to FTI_Protect will update the meta-data whithin FTI in order to store the correct dimensions during a successive call to FTI_Checkpoint or FTI_Snapshot.
EXAMPLE
int A;
float *B = malloc (sizeof(float) * 10) ;
FTI_Protect(1, &A, 1, FTI_INTG );
FTI_Protect(2, B, 10, FTI_SFLT );
// changing B size
B = realloc(B, sizeof(float) * 20) ;
// updating B size in protected list
FTI_Protect(2, B, 20, FTI_SFLT);
- Returns size of protected variable saved in metadata
DEFINITION
long FTI_GetStoredSize ( int id )INPUT
| Variable | What for? |
|---|---|
int id |
ID of the protected variable |
OUTPUT
| Value | Reason |
|---|---|
long |
Size of a variable |
0 |
No success |
DESCRIPTION
FTI_GetStoredSize returns the size of a protected variable with id from the FTI metadata. The result may differ from the size of the variable known to the application in that moment. If the function is called on a restart, it returns the size stored in the metadata file. Called during the execution, it returns the value stored in the FTI runtime metadata, that is the size of the variable at the moment of the last checkpoint.
The function is needed to manually reallocate memory for protected variables with variable size on a recovery. Another possibility for the reallocation of memory is provided by FTI_Realloc.
EXAMPLE
...
long* array = calloc(arraySize, sizeof(long));
FTI_Protect(1, array, arraySize, FTI_LONG);
if (FTI_Status() != 0) {
long arraySizeInBytes = FTI_GetStoredSize(1);
if (arraySizeInBytes == 0) {
printf("No stored size in metadata!\n");
return GETSTOREDSIZE_FAILED;
}
array = realloc(array, arraySizeInBytes);
int res = FTI_Recover();
if (res != 0) {
printf("Recovery failed!\n");
return RECOVERY_FAILED;
}
//update arraySize
arraySize = arraySizeInBytes / sizeof(long);
}
for (i = 0; i < max; i++) {
if (i % CKTP_STEP) {
//update FTI array size information
FTI_Protect(1, array, arraySize, FTI_LONG);
int res = FTI_Checkpoint((i % CKTP_STEP) + 1, 1);
if (res != FTI_DONE) {
printf("Checkpoint failed!.\n");
return CHECKPOINT_FAILED;
}
}
...
//add element to array
arraySize += 1;
array = realloc(array, arraySize * sizeof(long));
}
...Reallocates dataset to last checkpoint size.
DEFINITION
void* FTI_Realloc ( int id, void* ptr )INPUT
| Variable | What for? |
|---|---|
int id |
ID of the protected variable |
void * ptr |
Pointer to memory address of variable |
OUTPUT
| Value | Reason |
|---|---|
void* |
Pointer to reallocated data |
NULL |
On failure |
DESCRIPTION
FTI_Realloc is called for protected variables with dynamic size on recovery. It reallocates sufficient memory to store the checkpoint data to the pointed memory address. It must be called before FTI_Recover to prevent segmentation faults. If the reallocation must/is wanted to be done within the application, FTI provides the function FTI_GetStoredSize to request the variable size of the
checkpoint to recover.
EXAMPLE
...
FTI_Protect(1, &arraySize, 1, FTI_INTG);
long* array = calloc(arraySize, sizeof(long));
FTI_Protect(2, array, arraySize, FTI_LONG);
if (FTI_Status() != 0) {
array = FTI_Realloc(2, array);
if (array == NULL) {
printf("Reallocation failed!\n");
return REALLOC_FAILED;
}
int res = FTI_Recover();
if (res != 0) {
printf("Recovery failed!\n");
return RECOVERY_FAILED;
}
}
for (i = 0; i < max; i++) {
if (i % CKTP_STEP) {
//update FTI array size information
FTI_Protect(2, array, arraySize, FTI_LONG);
int res = FTI_Checkpoint((i % CKTP_STEP) + 1, 1);
if (res != FTI_DONE) {
printf("Checkpoint failed!.\n");
return CHECKPOINT_FAILED;
}
}
...
//add element to array
arraySize += 1;
array = realloc(array, arraySize * sizeof(long));
}
...
- Stores protected variables in the checkpoint of a desired safety level.
DEFINITION
int FTI_Checkpoint( int id, int level )INPUT
| Variable | What for? |
|---|---|
int id |
Unique checkpoint ID |
int level |
Checkpoint level (1=L1, 2=L2, 3=L3, 4=L4) |
OUTPUT
| Value | Reason |
|---|---|
FTI_DONE |
Success |
FTI_NSCS |
Failure |
DESCRIPTION
FTI_Checkpoint is used to store the current values of protected variables into a
checkpoint of safety level level (see Multilevel-Checkpointing for descritions of the particular levels).
NOTICE: The checkpoint id must be different from 0!
EXAMPLE
int i;
for (i = 0; i < 100; i ++) {
if (i % 10 == 0) {
FTI_Checkpoint ( i /10 + 1, 1) ;
}
.
. // some computations
.
}
- Returns the current status of the recovery flag.
DEFINITION
int FTI_Status()OUTPUT
| Value | Reason |
|---|---|
|
No checkpoints taken yet or recovered successfully |
|
At least one checkpoint is taken. If execution fails, the next start will be a restart |
|
The execution is a restart from checkpoint level L4 and keep_last_checkpoint was enabled during the last execution |
DESCRIPTION
FTI_Status returns the current status of the recovery flag.
EXAMPLE
if ( FTI_Status () != 0) {
.
. // this section will be executed during restart
.
}
- Recovers the data of the protected variables from the checkpoint file.
DEFINITION
int FTI_Recover()OUTPUT
| Value | Reason |
|---|---|
FTI_SCES |
Success |
FTI_NSCS |
Failure |
DESCRIPTION
FTI_Recover loads the data from the checkpoint file to the protected variables. It only recovers variables which are protected by a preceeding call to FTI_Protect. If a variable changes its size during execution, the proper amount of memory has to be allocated for that variable before the call to FTI_Recover. FTI provides the API functions
FTI_GetStoredSize and
FTI_Realloc for this case.
EXAMPLE
Basic example:
if ( FTI_Status() == 1 ) {
FTI_Recover() ;
}
- Invokes the recovery of protected variables on a restart.
- Writes multilevel checkpoints regarding their requested frequencies during execution.
DEFINITION
int FTI_Snapshot()OUTPUT
| Value | Reason |
|---|---|
FTI_SCES |
Successfull call (without checkpointing) or if recovery successful |
FTI_NSCS |
Failure of FTI_Checkpoint
|
FTI_DONE |
Success of FTI_Checkpoint
|
FTI_NREC |
Failure on recovery |
DESCRIPTION
On a restart, FTI_Snapshot loads the data from the checkpoint file to the protected variables. During execution it performs checkpoints according to the checkpoint frequencies for the various safety levels. The frequencies may be set in the configuration file (see e.g.: ckpt_L1).
FTI_Snapshotcan only take care of variables which are protected by a preceding call to FTI_Protect.
EXAMPLE
int res = FTI_Snapshot();
if ( res == FTI_SCES ) {
.
. // executed after successful recover
. // or when checkpoint is not required
}
else { // res == FTI_DONE
.
. // executed after successful checkpointing
.
}
- Frees the allocated memory.
- Communicates the end of the execution to dedicated threads.
- Cleans checkpoints and metadata.
DEFINITION
int FTI_Finalize()OUTPUT
| Value | Reason |
|---|---|
FTI_SCES |
For application process |
exit(0) |
For FTI process |
DESCRIPTION
FTI_Finalize notifies the FTI processes that the execution is over, frees
FTI internal data structures and it performs a clean up of the checkpoint folders at a normal execution. If the setting keep_last_ckptis set, it flushes local checkpoint files (if present) to the PFS. If the setting headis set to 1, it will also terminate the FTI processes. It should be called before MPI_Finalize().
EXAMPLE
int main ( int argc , char ** argv ) {
.
.
.
FTI_Finalize () ;
MPI_Finalize () ;
return 0;
}