This memory pool is used for getting the memory handle of remote GPU memory when using CUDA.
More...
This memory pool is used for getting the memory handle of remote GPU memory when using CUDA.
Hence, the name is "rgpusm" for "remote
CUDA" GPU memory. There is a cache that can be used to store the remote handles in case they are reused to save on the registration cost as that can be expensive, on the order of 100 usecs. The cache can also be used just to track how many handles are in use at a time. It is best to look at this with the three different scenarios that are possible.
- mpool_rgpusm_leave_pinned=0, cache_size=unlimited
- mpool_rgpusm_leave_pinned=0, cache_size=limited
- mpool_rgpusm_leave_pinned=1, cache_size=unlimited (default)
- mpool_rgpusm_leave_pinned=1, cache_size=limited.
Case 1: The cache is unused and remote memory is registered and unregistered for each transaction. The amount of outstanding registered memory is unlimited. Case 2: The cache keeps track of how much memory is registered at a time. Since leave pinned is 0, any memory that is registered is in use. If the amount to register exceeds the amount, we will error out. This could be handled more gracefully, but this is not a common way to run, so we will leave as is. Case 3: The cache is needed to track current and past transactions. However, there is no limit on the number that can be stored. Therefore, once memory enters the cache, and gets registered, it stays that way forever. Case 4: The cache is needed to track current and past transactions. In addition, a list of most recently used (but no longer in use) registrations is stored so that it can be used to evict registrations from the cache. In addition, these registrations are deregistered.
I also want to capture how we can run into the case where we do not find something in the cache, but when we try to register it, we get an error back from the CUDA library saying the memory is in use. This can happen in the following scenario. The application mallocs a buffer of size 32K. The library loads this in the cache and registers it. The application then frees the buffer. It then mallocs a buffer of size 64K. This malloc returns the same base address as the first 32K allocation. The library searches the cache, but since the size is larger than the original allocation it does not find the registration. It then attempts to register this. The CUDA library returns an error saying it is already mapped. To handle this, we return an error of OMPI_ERR_WOULD_BLOCK to the memory pool. The memory pool then looks for the registration based on the base address and a size of 4. We use the small size to make sure that we find the registration. This registration is evicted, and we try to register again.