Skip to content

ConnectX-3 Pro query segfaults after commit 4f91c645 from v4.30.0-1 #1157

@cebtenzzre

Description

@cebtenzzre

System Information

  • OS: Arch Linux
  • Kernel: linux-zen 6.13.4
# lspci -nn | grep Mellanox
0c:00.0 Ethernet controller [0200]: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] [15b3:1007]

The build is configured with --disable-inband.

Problem Description

On v4.30.0-1 (or 0975751, which is the newest commit I can get to build from source):

# mstflint -d 0c:00.0 query
Segmentation fault (core dumped)

On v4.29.0-1:

# mstflint -d 0c:00.0 query
Image type:            FS2
FW Version:            2.42.5044
FW Release Date:       21.10.2018
Product Version:       02.42.50.44
Rom Info:              version_id=8025 type=CLP 
                       type=UEFI version=14.11.48 cpu=AMD64
                       type=PXE version=3.4.754
Device ID:             4103
Description:           Node             Port1            Port2            Sys image
GUIDs:                 b88303ffff93f430 b88303ffff93f431 b88303ffff93f432 b88303ffff93f433 
MACs:                                       b8830393f431     b8830393f432
VSD:                   
PSID:                  HP_1370110017
INI revision:          0xc592a23f

git bisect points to commit 4f91c64:

$ git bisect bad
4f91c64509a0eeab2c53e0d59bf8b9039b1def42 is the first bad commit
commit 4f91c64509a0eeab2c53e0d59bf8b9039b1def42 (HEAD)
Author: Shelly Sela <[email protected]>
Date:   Sun Sep 8 09:06:00 2024

    [CX8, QTM3 and above] - add support for Late LF mode
    
    Description:
    Late LF support: the driver will detect that the device is in Late LF mode to determine the right access method.
    LF in newer devices: in contrast to legacy LF, newer devices have VSC exposed. The driver will distinguish between a functional device and a LF device by the VSC type.
    When running MFT tools on a device in late LF mode, the user will get the same behavior received from a device in LF mode.
    
    HLD - https://confluence.nvidia.com/pages/viewpage.action?pageId=3113761891
    
    Tested OS: linux
    Tested devices: ConnectX8
    Tested flows: see detailed test and outputs in the bottom of the HLD
    
    Known gaps (with RM ticket): n/a
    
    Issue: 3909481 4043079

 include/mtcr_ul/mtcr_com_defs.h |  2 +-
 include/mtcr_ul/mtcr_mf.h       |  3 +-
 kernel/mst.h                    |  2 +-
 kernel/mst_kernel.h             |  9 +++--
 kernel/mst_main.c               | 85 +++++++++++++++++++++++++++++++++-------------
 mtcr_freebsd/mtcr_ul.c          | 38 +++++++++++++++------
 mtcr_ul/mtcr_ul.c               |  2 +-
 mtcr_ul/mtcr_ul_com.c           | 88 ++++++++++++++++++++++++++++--------------------
 mtcr_ul/mtcr_ul_com.h           |  6 ++++
 mtcr_ul/mtcr_ul_icmd_cif.c      | 12 +++----
 small_utils/mtserver.c          |  2 +-
 11 files changed, 164 insertions(+), 85 deletions(-)

Backtrace on 0975751:

>>> bt
#0  0x0000555555650ef8 in mtcr_pcicr_mread4 (mf=<optimized out>, offset=<optimized out>, 
    value=0x7fffffffced8) at mtcr_ul_com.c:585
#1  mtcr_pcicr_mread4 (mf=0x55555572df60, offset=<optimized out>, value=0x7fffffffced8)
    at mtcr_ul_com.c:570
#2  0x0000555555651142 in mread4_ul (mf=mf@entry=0x55555572df60, offset=<optimized out>, 
    value=value@entry=0x7fffffffced8) at mtcr_ul_com.c:274
#3  0x000055555564b58d in mread4 (mf=mf@entry=0x55555572df60, offset=<optimized out>, 
    value=value@entry=0x7fffffffced8) at mtcr_ul.c:46
#4  0x0000555555657b69 in read_device_id (mf=mf@entry=0x55555572df60, 
    device_id=device_id@entry=0x7fffffffced8) at mtcr_ul_com.c:4073
#5  0x000055555564a5af in dm_get_device_id_inner (mf=0x55555572df60, 
    ptr_dm_dev_id=ptr_dm_dev_id@entry=0x7fffffffcfdc, ptr_hw_dev_id=0x7fffffffcfe0, 
    ptr_hw_rev=0x7fffffffcfe4) at tools_dev_types.c:696
#6  0x000055555564a825 in dm_get_device_id (mf=<optimized out>, ptr_dm_dev_id=0x7fffffffcfdc, 
    ptr_hw_dev_id=<optimized out>, ptr_hw_rev=<optimized out>) at tools_dev_types.c:715
#7  0x00005555555bcc7e in FwOperations::IsDeviceSupported (fwParams=...) at fw_ops.cpp:930
#8  0x00005555555bf7a8 in FwOperations::FwOperationsCreate (fwParams=...) at fw_ops.cpp:949
#9  0x0000555555591eea in SubCommand::openOps (this=this@entry=0x5555556fb3a0, 
    ignoreSecurityAttributes=ignoreSecurityAttributes@entry=false, ignoreDToc=ignoreDToc@entry=false)
    at subcommands.cpp:633
#10 0x00005555555aa500 in SubCommand::preFwOps (this=0x5555556fb3a0, 
    ignoreSecurityAttributes=<optimized out>, ignoreDToc=<optimized out>) at subcommands.cpp:847
#11 0x00005555555ae94d in QuerySubCommand::executeCommand (this=0x5555556fb3a0) at subcommands.cpp:4713
#12 0x000055555558e659 in Flint::run (this=this@entry=0x5555556f8d50, argc=argc@entry=4, 
    argv=argv@entry=0x7fffffffe758) at flint.cpp:278
#13 0x0000555555577dc8 in main (argc=4, argv=0x7fffffffe758) at flint.cpp:287

On this line:

u_int32_t tmp = ((u_int32_t*)mf->bar_virtual_addr)[offset / 4];

mf->bar_virtual_addr is 0xffffffffffffffff here, which doesn't seem right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions