Skip to content

DAS scripting needs to clean up bkg files from previous segment before GEOSgcm.x runs #188

@bena-nasa

Description

@bena-nasa

@rtodling
I was attempting to run essentially the develop branch of the GEOSadas for reasons that are not pertinent to this issue and run the basic C48f test case which must have not been tested in this branch.
I hit an issue that you will need to resolve that will require proper removal of the bkg files needed by mkiau.x after it runs and BEFORE GEOSgcm.x runs. Here are the details:

I when I ran that test case I was seeing errors when the GEOSgcm.x ran from netcdf at 3z saying it was trying to inquire for the varid given a name from an ncid for a file it was in the process of writing. The error code was that the variable did not exist in the file. I put a print in to determine what the variable was and as you can see it was evap:

 AGCM Date: 2019/01/17  Time: 02:52:30  Throughput(days/day)[Avg Tot Run]:    303.1    338.0    338.8  TimeRemaining(Est) 000:01:12   26.8% :  14.9% Mem Comm:Used

 Writing:    549 Slices to File:  C48f_ben.inst3_3d_asm_Np.20190117_0300z.nc4
 Writing:   1083 Slices to File:  C48f_ben.inst3_3d_asm_Nv.20190117_0300z.nc4
 Writing:    378 Slices to File:  C48f_ben.tavg3_3d_cld_Cp.20190117_0130z.nc4
 Writing:    365 Slices to File:  C48f_ben.tavg3_3d_mst_Ne.20190117_0130z.nc4
 Writing:    729 Slices to File:  C48f_ben.bkg.eta.20190117_0300z.nc4
 Writing:     52 Slices to File:  C48f_ben.bkg.sfc.20190117_0300z.nc4
 Writing:    290 Slices to File:  C48f_ben.cbkg.eta.20190117_0300z.nc4
 Writing:    587 Slices to File:  C48f_ben.vtx.mix.20190117_03z.nc4
 Writing:    729 Slices to File:  C48f_ben.asm.eta.20190117_0300z.nc4
 bmaa failed write variable EVAP
pe=00085 FAIL at line=00030    NetCDF4_put_var.H                        <status=-49>
pe=00085 FAIL at line=00842    ServerThread.F90                         <status=-49>
pe=00085 FAIL at line=00138    BaseServer.F90                           <status=-49>

This was weird and the only plausible way it could be not finding the variable in the file is if the file already existed so I put more prints in and said, if it tries to open an already existing file that contains the experiment id stop. I saw this:

 AGCM Date: 2019/01/17  Time: 02:52:30  Throughput(days/day)[Avg Tot Run]:    311.5    353.6    354.4  TimeRemaining(Est) 000:01:10   31.8% :  28.1% Mem Comm:Used

 Writing:    549 Slices to File:  C48f_ben.inst3_3d_asm_Np.20190117_0300z.nc4
 Writing:   1083 Slices to File:  C48f_ben.inst3_3d_asm_Nv.20190117_0300z.nc4
 Writing:    378 Slices to File:  C48f_ben.tavg3_3d_cld_Cp.20190117_0130z.nc4
 Writing:    365 Slices to File:  C48f_ben.tavg3_3d_mst_Ne.20190117_0130z.nc4
 Writing:    729 Slices to File:  C48f_ben.bkg.eta.20190117_0300z.nc4
 Writing:     52 Slices to File:  C48f_ben.bkg.sfc.20190117_0300z.nc4
 Writing:    290 Slices to File:  C48f_ben.cbkg.eta.20190117_0300z.nc4
 Writing:    587 Slices to File:  C48f_ben.vtx.mix.20190117_03z.nc4
 Writing:    729 Slices to File:  C48f_ben.asm.eta.20190117_0300z.nc4
pe=00049 FAIL at line=00265    NetCDF4_FileFormatter.F90                <file exists: C48f_ben.bkg.eta.20190117_0300z.nc4>
pe=00006 FAIL at line=00265    NetCDF4_FileFormatter.F90                <file exists: C48f_ben.cbkg.eta.20190117_0300z.nc4>
pe=00095 FAIL at line=00265    NetCDF4_FileFormatter.F90                <file exists: C48f_ben.bkg.sfc.20190117_0300z.nc4>

I thought, that was odd; why does the file exist? I started re-ran the experiment and stopped it as soon as the GSI started. When I did an

ls C48f_ben*.nc4

in the fvwork I saw this:

(noback/fvwork.48856) > ls C48f_ben.*.nc4
C48f_ben.bkg.eta.20190116_2100z.nc4  C48f_ben.bkg.sfc.20190116_2100z.nc4  C48f_ben.cbkg.eta.20190116_2100z.nc4
C48f_ben.bkg.eta.20190117_0000z.nc4  C48f_ben.bkg.sfc.20190117_0000z.nc4  C48f_ben.cbkg.eta.20190117_0000z.nc4
C48f_ben.bkg.eta.20190117_0300z.nc4  C48f_ben.bkg.sfc.20190117_0300z.nc4  C48f_ben.cbkg.eta.20190117_0300z.nc4

so those files were already there at the time the experiment was created. I realized they must be the background from the previous segment needed for mikau.x. If you look at the History.rc.tmpl you get with the develop branch of the GEOSadas, you will see that the bkg.sfc collection has an "EVAP" variable and that collection does not start writing until 3z to produce the backgrounds for the next segment, which is when the GEOSgcm.x was crashing. BUT the bkg.eta files that get copied in to produce the increments for the current segment when the experiment is created don't have EVAP.

So what is going on is that at 3z, History tries to write the bkg.eta file but it already exists and if the file already exists the server just opens it and tries to write to it so of course the varid inquiry for EVAP fails!

This is really a problem with the DAS scripting
The DAS scripting should be removing the old background files before the GEOSgcm.x runs, a file should not be there that History will try to write; the fact that this worked before means you were just lucky and apparently were not changing the contents of the bkg files.

HistoryGridComp should check when it decides to write a file, if it already exists and error out as it just could lead to a problem at different point in the code when the error is less clear. I will make that change in our development branch so that the existence of the file is caught when History decides it is time to write to a new file and report the file already exists, rather than during the actual writing process when the error is more confusing.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions