OpenMPI  0.1.1
btl_openib_failover.c File Reference

Functions specific to implementing failover support. More...

#include "ompi_config.h"
#include "opal_stdint.h"
#include "btl_openib.h"
#include "btl_openib_endpoint.h"
#include "btl_openib_proc.h"
#include "btl_openib_failover.h"
#include "opal/util/opal_sos.h"

Functions

static void error_out_all_pending_frags (mca_btl_base_endpoint_t *ep, struct mca_btl_base_module_t *module, bool errout)
 This function will find all the pending fragments on an endpoint and call the callback function with OMPI_ERROR. More...
 
static void mca_btl_openib_endpoint_notify (mca_btl_base_endpoint_t *endpoint, uint8_t type, int index)
 This function is used to send a message to the remote side indicating the endpoint is broken and telling the remote side to brings its endpoint down as well. More...
 
void mca_btl_openib_dump_all_local_rdma_frags (mca_btl_openib_device_t *device)
 
void mca_btl_openib_dump_all_internal_queues (bool errout)
 This function is a debugging tool. More...
 
static void dump_local_rdma_frags (mca_btl_openib_endpoint_t *endpoint)
 
void mca_btl_openib_handle_endpoint_error (mca_btl_openib_module_t *openib_btl, mca_btl_base_descriptor_t *des, int qp, ompi_proc_t *remote_proc, mca_btl_openib_endpoint_t *endpoint)
 This function is called when we get an error on the completion event of a fragment. More...
 
void mca_btl_openib_handle_btl_error (mca_btl_openib_module_t *openib_btl)
 This functions allows an error to map out the entire BTL. More...
 
void btl_openib_handle_failover_control_messages (mca_btl_openib_control_header_t *ctl_hdr, mca_btl_openib_endpoint_t *ep)
 This function gets called when a control message is received that is one of the following types: MCA_BTL_OPENIB_CONTROL_EP_BROKEN MCA_BTL_OPENIB_CONTROL_EP_EAGER_RDMA_ERROR message Note that we are using the working connection to send information about the broken connection. More...
 
static void mca_btl_openib_endpoint_notify_cb (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint, struct mca_btl_base_descriptor_t *descriptor, int status)
 

Detailed Description

Functions specific to implementing failover support.

This file is conditionally copiled into the BTL when one configures it in with –enable-openib-failover. When this file is compiled in, the multi-BTL configurations can handle errors. The requirement is that there needs to be more than one openib BTL in use so that all the traffic can move to the other BTL. This does not support failing over to a different BTL like TCP.

Function Documentation

void btl_openib_handle_failover_control_messages ( mca_btl_openib_control_header_t ctl_hdr,
mca_btl_openib_endpoint_t ep 
)

This function gets called when a control message is received that is one of the following types: MCA_BTL_OPENIB_CONTROL_EP_BROKEN MCA_BTL_OPENIB_CONTROL_EP_EAGER_RDMA_ERROR message Note that we are using the working connection to send information about the broken connection.

That is why we have to look at the various information in the control message to figure out which endpoint is broken. It is (obviously) not the one the message was received on, because we would not have received the message in that case. In the case of the BROKEN message, that means the remote side is notifying us that it has brought down its half of the connection. Therefore, we need to bring out half down. This is done because it has been observed that there are cases where only one side of the connection actually sees the error. This means we can be left in a state where one side believes it has two BTLs, but the other side believes it only has one. This can cause problems. In the case of the EAGER_RDMA_ERROR, see elsewhere in the code what we are doing.

Parameters
ctl_hdrPointer control header that was received

References mca_btl_base_endpoint_t::eager_rdma_local, mca_btl_base_endpoint_t::endpoint_proc, mca_btl_base_endpoint_t::endpoint_state, mca_btl_openib_device_t::endpoints, mca_btl_openib_module_t::error_cb, error_out_all_pending_frags(), mca_btl_openib_eager_rdma_local_t::head, mca_btl_openib_component_t::ib_num_btls, mca_btl_openib_module_t::lid, mca_btl_base_endpoint_t::nbo, opal_output_verbose(), opal_pointer_array_get_item(), opal_pointer_array_get_size(), mca_btl_openib_component_t::openib_btls, ORTE_PROC_MY_NAME, ompi_proc_t::proc_name, mca_btl_elan_proc_t::proc_ompi, mca_btl_base_endpoint_t::rem_info, and orte_process_name_t::vpid.

static void error_out_all_pending_frags ( mca_btl_base_endpoint_t ep,
struct mca_btl_base_module_t module,
bool  errout 
)
static

This function will find all the pending fragments on an endpoint and call the callback function with OMPI_ERROR.

It walks through each qp with each priority and looks for both no_credits_pending_frags and no_wqe_pending_frags. It then looks for any pending_lazy_frags, pending_put_frags, and pending_get_frags. This function is only called when running with failover support enabled. Note that the errout parameter allows the function to also be used as a debugging tool to see if there are any fragments on any of the queues.

Parameters
epPointer to endpoint that had error
modulePointer to module that had error
erroutBoolean which says whether to error them out or not

References mca_btl_base_descriptor_t::des_cbfunc, mca_btl_base_descriptor_t::des_flags, mca_btl_base_endpoint_t::endpoint_btl, mca_btl_openib_endpoint_qp_t::no_credits_pending_frags, mca_btl_openib_endpoint_qp_t::no_wqe_pending_frags, mca_btl_openib_component_t::num_qps, opal_list_get_size(), opal_list_remove_first(), opal_output_verbose(), mca_btl_base_endpoint_t::pending_get_frags, mca_btl_base_endpoint_t::pending_lazy_frags, and mca_btl_base_endpoint_t::pending_put_frags.

Referenced by btl_openib_handle_failover_control_messages(), mca_btl_openib_dump_all_internal_queues(), mca_btl_openib_handle_btl_error(), and mca_btl_openib_handle_endpoint_error().

void mca_btl_openib_dump_all_internal_queues ( bool  errout)

This function is a debugging tool.

If you notify a hang, you can call this function from a debugger and see if there are any messages stuck in any of the queues. If you call it with errout=true, then it will error them out. Otherwise, it will just print out the size of the queues with data in them.

References mca_btl_openib_device_t::endpoints, error_out_all_pending_frags(), mca_btl_openib_component_t::ib_num_btls, opal_pointer_array_get_item(), opal_pointer_array_get_size(), and mca_btl_openib_component_t::openib_btls.

static void mca_btl_openib_endpoint_notify ( mca_btl_base_endpoint_t endpoint,
uint8_t  type,
int  index 
)
static

This function is used to send a message to the remote side indicating the endpoint is broken and telling the remote side to brings its endpoint down as well.

This is needed because there are cases where only one side of the connection determines that the there was a problem.

Parameters
endpointPointer to endpoint with error
typeType of message to be sent, can be one of two types
indexWhen sending RDMA error message, index is non zero

References mca_btl_base_endpoint_t::endpoint_btl, mca_btl_base_endpoint_t::endpoint_proc, mca_btl_openib_device_t::endpoints, mca_btl_openib_component_t::ib_num_btls, mca_btl_base_endpoint_t::nbo, opal_output_verbose(), opal_pointer_array_get_item(), opal_pointer_array_get_size(), mca_btl_openib_component_t::openib_btls, ORTE_PROC_MY_NAME, and mca_btl_elan_proc_t::proc_ompi.

Referenced by mca_btl_openib_handle_btl_error(), and mca_btl_openib_handle_endpoint_error().

void mca_btl_openib_handle_btl_error ( mca_btl_openib_module_t openib_btl)

This functions allows an error to map out the entire BTL.

First a call is made up to the PML to map out all connections from this BTL. Then a message is sent to all the endpoints connected to this BTL. This function is enabled by the btl_openib_port_error_failover MCA parameter. If that parameter is not set, then this function does not do anything.

Parameters
openib_btlPointer to BTL that had the error

References mca_btl_base_endpoint_t::endpoint_state, mca_btl_openib_device_t::endpoints, mca_btl_openib_module_t::error_cb, error_out_all_pending_frags(), mca_btl_openib_module_t::lid, mca_btl_openib_endpoint_notify(), opal_pointer_array_get_item(), and opal_pointer_array_get_size().

void mca_btl_openib_handle_endpoint_error ( mca_btl_openib_module_t openib_btl,
mca_btl_base_descriptor_t des,
int  qp,
ompi_proc_t remote_proc,
mca_btl_openib_endpoint_t endpoint 
)

This function is called when we get an error on the completion event of a fragment.

We check to see what type of fragment it is and act accordingly. In most cases, we first call up into the PML and have it map out this connection for any future communication. In addition, this function will possibly send some control messages over the other openib BTL. The first control message will tell the remote side to also map out this connection. The second control message makes sure the eager RDMA connection remains in a sane state. See that function for more details.

Parameters
openib_btlPointer to BTL that had the error
desPointer to descriptor that had the error
qpQueue pair that had the error
remote_procPointer to process that had the error
endpointPointer to endpoint that had the error

References mca_btl_base_descriptor_t::des_cbfunc, mca_btl_base_descriptor_t::des_flags, mca_btl_base_descriptor_t::des_src, mca_btl_base_endpoint_t::endpoint_state, mca_btl_openib_module_t::error_cb, error_out_all_pending_frags(), mca_btl_base_endpoint_t::get_tokens, mca_btl_openib_module_t::lid, mca_btl_openib_endpoint_notify(), opal_list_remove_first(), opal_output_verbose(), and OPAL_THREAD_ADD32.