Skip to content

seg fault in UrbanParamType, that happens as a result of changing the decompInit code #3447

@ekluzek

Description

@ekluzek

Brief summary of bug

In my work on memory scaling for #2995 I have to set the memory padding that the OS allows on allocates to zero in order to see deallocates decrease memory. When I do that I also can see a segmentation fault that occurs later on in UrbanParamsType in a data read.

NOTE:
It turns out the above statement is incorrect and the problem really is due to my changes in decompInit code. the discussion below on malloc_trim is still a valid concern, but not relevant right now.

General bug information

CTSM version you are using: ctsm5.3.071-93-g92c9a3c10
Specifically this is my ekluzek/decomp_init_for_testing_work branch in #3412

Does this bug cause significantly incorrect results in the model's science? No (although memory overruns can do random things).

Configurations affected:

Only my highly specialized case where I use the C++ function malloc_trim to set the trim size to 0. (from John Dennis.).

Which is pretty much just my decomp_init work branches.

The testcase I'm running here is:

SMS_D_Ln1_Mmpi-serial.1x1_brazil.I2000Clm45Sp.derecho_intel.clm-run_self_tests

This does pass in vanilla ctsm5.3.071

Details of bug

My setup is merely exposing an existing bug, that we don't normally see because the OS has allocated extra memory that the memory overrun is accessing.

Important details of your setup / configuration so we can reproduce the bug

Use my add_jdennis_procstatus_module branch of cesm_share code, which adds the proc_status_vm.F90 module. Make a call to

   call shr_malloc_trim()

in decompInit_lnd

Later on the code aborts with a seg fault.

What John tells me is that this means the code is doing a memory overrun (subscript overflow or some other kind of access of memory outside of what's allocated) -- that isn't caught under normal conditions because the OS has extra memory available as "trim" -- so it doesn't recognize it's out of memory. But, as going outside of memory bounds is never good behavior, it can be something useful to track down. Hopefully, it's not doing anything harmful or changing answers -- but you can't know until you track down the problem. He also suggests that tracking this kind of thing down can be difficult to do.

Important output or errors that show the problem

 (t_initf)       profile_papi_enable=      F
GPTLprint_memusage: Using Kbytesperpage=4
  sysmem size=1164.9 MB rss=726.6 MB share=41.9 MB text=83.2 MB datastack=0.0 MB
  sysmem size=1164.9 MB rss=726.6 MB share=42.0 MB text=83.2 MB datastack=0.0 MB
  sysmem size=1164.9 MB rss=726.7 MB share=42.0 MB text=83.2 MB datastack=0.0 MB
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread-2.31.s  00001484F0AFD8C0  Unknown               Unknown  Unknown
libpioc.so         00001484FE318B8C  pio_sorted_copy       Unknown  Unknown
libpioc.so         00001484FE314E80  PIOc_read_darray      Unknown  Unknown
libpiof.so         00001484FE5AB8F3  piodarray_mp_read     Unknown  Unknown
libpiof.so         00001484FE5A872C  piodarray_mp_read     Unknown  Unknown
cesm.exe           0000000000A35A29  ncdio_pio_mp_read        2262  ncdio_pio.F90.in
cesm.exe           0000000000A31348  ncdio_pio_mp_read        2237  ncdio_pio.F90.in
cesm.exe           0000000000A2DF00  ncdio_pio_mp_ncd_        2166  ncdio_pio.F90.in
cesm.exe           0000000003A4F2AB  urbanparamstype_m         572  UrbanParamsType.F90
cesm.exe           0000000000AAAFB0  clm_initializemod         244  clm_initializeMod.F90
cesm.exe           00000000009B0340  lnd_comp_nuopc_mp         681  lnd_comp_nuopc.F90
libesmf.so         00001484F35CC0F9  callVFuncPtr             2187  ESMCI_FTable.C
libesmf.so         00001484F35CB138  ESMCI_FTableCallE         844  ESMCI_FTable.C
libesmf.so         00001484F3A9A4B2  enter                    2557  ESMCI_VMKernel.C
libesmf.so         00001484F3A81DA6  enter                    1303  ESMCI_VM.C

The line in question is the read of ALB_IMPROAD_DIR from the surface dataset here:

       call ncd_io(ncid=ncid, varname='ALB_IMPROAD_DIR', flag='read', data=urbinp%alb_improad_dir, &
            dim1name=grlnd, readvar=readvar)

There are plenty of reads that happen before this, so I don't know what the problem might be here. The problem also could be in the ncdio_pio layer in CTSM or the PIO layer?

The line in ncdio_pio at the lowest level is:

    call pio_read_darray(ncid, vardesc, iodesc, data, status)

The memory overrun could be in one of the PIO types: ncdio, vardesc, or iodesc

For CTSM it could be data, where data is double precision, and 3 dimensional with an allocate to this size:

 urbinp%alb_improad_dir(begg:endg, numurbl, numrad)

NOTE THE FOLLOWING COMMENT TURN OUT TO BE WRONG:
so if numurbl or numrad is wrong there that could be the explaination. And actually they aren't input -- so how does this even work at all?

Metadata

Metadata

Assignees

Labels

bfbbit-for-bitbugsomething is working incorrectly

Type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions