• Faisal Latif's avatar
    RDMA/nes: Fix hang issues for large cluster dynamic connections · 109d67e4
    Faisal Latif authored
    Running large cluster setup, we are hanging after many hours of
    testing.  Fixing this required going over the code and making sure the
    rexmit entry was properly removed based on the cm_node's state and
    packet received.  Also when receiving a FIN packet, check seq# and
    make sure there were no errors before calling handle_fin().
    
    Following are the changes done in nes_cm.c:
    
    * handle_ack_pkt() needs to return error value, so in case of error,
      handle_fin() is not called. Some cleanup done while going over the code.
    
    * handle_rst_pkt(), handling of cm_node's NES_CM_STATE_LAST_ACK is missing.
    
    * process_packet(), in case of FIN only packet is received, call
      check_seq() before processing.
    
    * in handle_fin_pkt(), we are calling cleanup_retrans_entry() for all
      conditions, even if the packets need to be dropped.
    Signed-off-by: default avatarFaisal Latif <faisal.latif@intel.com>
    Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
    109d67e4
nes_cm.c 99 KB