OpenMPI
0.1.1
|
Functions called by BTL to handle error events. More...
Go to the source code of this file.
Functions | |
BEGIN_C_DECLS void | mca_btl_openib_handle_endpoint_error (mca_btl_openib_module_t *openib_btl, mca_btl_base_descriptor_t *des, int qp, ompi_proc_t *remote_proc, mca_btl_openib_endpoint_t *endpoint) |
This function is called when we get an error on the completion event of a fragment. More... | |
void | mca_btl_openib_handle_btl_error (mca_btl_openib_module_t *openib_btl) |
This functions allows an error to map out the entire BTL. More... | |
void | btl_openib_handle_failover_control_messages (mca_btl_openib_control_header_t *ctl_hdr, mca_btl_openib_endpoint_t *ep) |
This function gets called when a control message is received that is one of the following types: MCA_BTL_OPENIB_CONTROL_EP_BROKEN MCA_BTL_OPENIB_CONTROL_EP_EAGER_RDMA_ERROR message Note that we are using the working connection to send information about the broken connection. More... | |
Functions called by BTL to handle error events.
void btl_openib_handle_failover_control_messages | ( | mca_btl_openib_control_header_t * | ctl_hdr, |
mca_btl_openib_endpoint_t * | ep | ||
) |
This function gets called when a control message is received that is one of the following types: MCA_BTL_OPENIB_CONTROL_EP_BROKEN MCA_BTL_OPENIB_CONTROL_EP_EAGER_RDMA_ERROR message Note that we are using the working connection to send information about the broken connection.
That is why we have to look at the various information in the control message to figure out which endpoint is broken. It is (obviously) not the one the message was received on, because we would not have received the message in that case. In the case of the BROKEN message, that means the remote side is notifying us that it has brought down its half of the connection. Therefore, we need to bring out half down. This is done because it has been observed that there are cases where only one side of the connection actually sees the error. This means we can be left in a state where one side believes it has two BTLs, but the other side believes it only has one. This can cause problems. In the case of the EAGER_RDMA_ERROR, see elsewhere in the code what we are doing.
ctl_hdr | Pointer control header that was received |
References mca_btl_base_endpoint_t::eager_rdma_local, mca_btl_base_endpoint_t::endpoint_proc, mca_btl_base_endpoint_t::endpoint_state, mca_btl_openib_device_t::endpoints, mca_btl_openib_module_t::error_cb, error_out_all_pending_frags(), mca_btl_openib_eager_rdma_local_t::head, mca_btl_openib_component_t::ib_num_btls, mca_btl_openib_module_t::lid, mca_btl_base_endpoint_t::nbo, opal_output_verbose(), opal_pointer_array_get_item(), opal_pointer_array_get_size(), mca_btl_openib_component_t::openib_btls, ORTE_PROC_MY_NAME, ompi_proc_t::proc_name, mca_btl_elan_proc_t::proc_ompi, mca_btl_base_endpoint_t::rem_info, and orte_process_name_t::vpid.
void mca_btl_openib_handle_btl_error | ( | mca_btl_openib_module_t * | openib_btl | ) |
This functions allows an error to map out the entire BTL.
First a call is made up to the PML to map out all connections from this BTL. Then a message is sent to all the endpoints connected to this BTL. This function is enabled by the btl_openib_port_error_failover MCA parameter. If that parameter is not set, then this function does not do anything.
openib_btl | Pointer to BTL that had the error |
References mca_btl_base_endpoint_t::endpoint_state, mca_btl_openib_device_t::endpoints, mca_btl_openib_module_t::error_cb, error_out_all_pending_frags(), mca_btl_openib_module_t::lid, mca_btl_openib_endpoint_notify(), opal_pointer_array_get_item(), and opal_pointer_array_get_size().
BEGIN_C_DECLS void mca_btl_openib_handle_endpoint_error | ( | mca_btl_openib_module_t * | openib_btl, |
mca_btl_base_descriptor_t * | des, | ||
int | qp, | ||
ompi_proc_t * | remote_proc, | ||
mca_btl_openib_endpoint_t * | endpoint | ||
) |
This function is called when we get an error on the completion event of a fragment.
We check to see what type of fragment it is and act accordingly. In most cases, we first call up into the PML and have it map out this connection for any future communication. In addition, this function will possibly send some control messages over the other openib BTL. The first control message will tell the remote side to also map out this connection. The second control message makes sure the eager RDMA connection remains in a sane state. See that function for more details.
openib_btl | Pointer to BTL that had the error |
des | Pointer to descriptor that had the error |
qp | Queue pair that had the error |
remote_proc | Pointer to process that had the error |
endpoint | Pointer to endpoint that had the error |
References mca_btl_base_descriptor_t::des_cbfunc, mca_btl_base_descriptor_t::des_flags, mca_btl_base_descriptor_t::des_src, mca_btl_base_endpoint_t::endpoint_state, mca_btl_openib_module_t::error_cb, error_out_all_pending_frags(), mca_btl_base_endpoint_t::get_tokens, mca_btl_openib_module_t::lid, mca_btl_openib_endpoint_notify(), opal_list_remove_first(), opal_output_verbose(), and OPAL_THREAD_ADD32.