-
Notifications
You must be signed in to change notification settings - Fork 336
Description
Brief summary of bug
In my work on memory scaling for #2995 I have to set the memory padding that the OS allows on allocates to zero in order to see deallocates decrease memory. When I do that I also can see a segmentation fault that occurs later on in UrbanParamsType in a data read.
NOTE:
It turns out the above statement is incorrect and the problem really is due to my changes in decompInit code. the discussion below on malloc_trim is still a valid concern, but not relevant right now.
General bug information
CTSM version you are using: ctsm5.3.071-93-g92c9a3c10
Specifically this is my ekluzek/decomp_init_for_testing_work branch in #3412
Does this bug cause significantly incorrect results in the model's science? No (although memory overruns can do random things).
Configurations affected:
Only my highly specialized case where I use the C++ function malloc_trim to set the trim size to 0. (from John Dennis.).
Which is pretty much just my decomp_init work branches.
The testcase I'm running here is:
SMS_D_Ln1_Mmpi-serial.1x1_brazil.I2000Clm45Sp.derecho_intel.clm-run_self_tests
This does pass in vanilla ctsm5.3.071
Details of bug
My setup is merely exposing an existing bug, that we don't normally see because the OS has allocated extra memory that the memory overrun is accessing.
Important details of your setup / configuration so we can reproduce the bug
Use my add_jdennis_procstatus_module branch of cesm_share code, which adds the proc_status_vm.F90 module. Make a call to
call shr_malloc_trim()
in decompInit_lnd
Later on the code aborts with a seg fault.
What John tells me is that this means the code is doing a memory overrun (subscript overflow or some other kind of access of memory outside of what's allocated) -- that isn't caught under normal conditions because the OS has extra memory available as "trim" -- so it doesn't recognize it's out of memory. But, as going outside of memory bounds is never good behavior, it can be something useful to track down. Hopefully, it's not doing anything harmful or changing answers -- but you can't know until you track down the problem. He also suggests that tracking this kind of thing down can be difficult to do.
Important output or errors that show the problem
(t_initf) profile_papi_enable= F
GPTLprint_memusage: Using Kbytesperpage=4
sysmem size=1164.9 MB rss=726.6 MB share=41.9 MB text=83.2 MB datastack=0.0 MB
sysmem size=1164.9 MB rss=726.6 MB share=42.0 MB text=83.2 MB datastack=0.0 MB
sysmem size=1164.9 MB rss=726.7 MB share=42.0 MB text=83.2 MB datastack=0.0 MB
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread-2.31.s 00001484F0AFD8C0 Unknown Unknown Unknown
libpioc.so 00001484FE318B8C pio_sorted_copy Unknown Unknown
libpioc.so 00001484FE314E80 PIOc_read_darray Unknown Unknown
libpiof.so 00001484FE5AB8F3 piodarray_mp_read Unknown Unknown
libpiof.so 00001484FE5A872C piodarray_mp_read Unknown Unknown
cesm.exe 0000000000A35A29 ncdio_pio_mp_read 2262 ncdio_pio.F90.in
cesm.exe 0000000000A31348 ncdio_pio_mp_read 2237 ncdio_pio.F90.in
cesm.exe 0000000000A2DF00 ncdio_pio_mp_ncd_ 2166 ncdio_pio.F90.in
cesm.exe 0000000003A4F2AB urbanparamstype_m 572 UrbanParamsType.F90
cesm.exe 0000000000AAAFB0 clm_initializemod 244 clm_initializeMod.F90
cesm.exe 00000000009B0340 lnd_comp_nuopc_mp 681 lnd_comp_nuopc.F90
libesmf.so 00001484F35CC0F9 callVFuncPtr 2187 ESMCI_FTable.C
libesmf.so 00001484F35CB138 ESMCI_FTableCallE 844 ESMCI_FTable.C
libesmf.so 00001484F3A9A4B2 enter 2557 ESMCI_VMKernel.C
libesmf.so 00001484F3A81DA6 enter 1303 ESMCI_VM.C
The line in question is the read of ALB_IMPROAD_DIR from the surface dataset here:
call ncd_io(ncid=ncid, varname='ALB_IMPROAD_DIR', flag='read', data=urbinp%alb_improad_dir, &
dim1name=grlnd, readvar=readvar)
There are plenty of reads that happen before this, so I don't know what the problem might be here. The problem also could be in the ncdio_pio layer in CTSM or the PIO layer?
The line in ncdio_pio at the lowest level is:
call pio_read_darray(ncid, vardesc, iodesc, data, status)
The memory overrun could be in one of the PIO types: ncdio, vardesc, or iodesc
For CTSM it could be data, where data is double precision, and 3 dimensional with an allocate to this size:
urbinp%alb_improad_dir(begg:endg, numurbl, numrad)
NOTE THE FOLLOWING COMMENT TURN OUT TO BE WRONG:
so if numurbl or numrad is wrong there that could be the explaination. And actually they aren't input -- so how does this even work at all?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status