OpenMPI  0.1.1
ras.h
Go to the documentation of this file.
1 /*
2  * Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
3  * University Research and Technology
4  * Corporation. All rights reserved.
5  * Copyright (c) 2004-2005 The University of Tennessee and The University
6  * of Tennessee Research Foundation. All rights
7  * reserved.
8  * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
9  * University of Stuttgart. All rights reserved.
10  * Copyright (c) 2004-2005 The Regents of the University of California.
11  * All rights reserved.
12  * $COPYRIGHT$
13  *
14  * Additional copyrights may follow
15  *
16  * $HEADER$
17  */
18 /** @file:
19  *
20  * The Open RTE Resource Allocation Subsystem (RAS)
21  *
22  * The resource allocation subsystem is responsible for determining
23  * what (if any) resources have been allocated to the specified job
24  * (via some prior action), and to obtain an allocation (if possible)
25  * if resources have NOT been previously allocated. It is anticipated
26  * that ORTE users will execute an "mpirun" or other command that
27  * invokes ORTE through one of two channels:
28  *
29  * 1. local: the user will login to the computing resource they intend
30  * to use, request a resource allocation from that system, and then
31  * execute the mpirun or other command. Thus, the allocation has
32  * already been obtained prior to ORTE's initialization. In most
33  * cases, systems pass allocation information via environmental
34  * parameters. Thus, the RAS components must know the correct
35  * environmental parameter to look for within the environment they
36  * seek to support (e.g., an LSF component should know that LSF passes
37  * allocation parameters as a specific LSF-named entity).
38  *
39  * 2. remote: the user issues an mpirun command on their notebook or
40  * desktop computer, indicating that the application is to be executed
41  * on a specific remote resource. In this case, the allocation may
42  * not have been previously requested or made. Thus, the associated
43  * RAS component must know how to request an allocation from the
44  * designated resource. To assist in this process, the RAS can turn to
45  * the information provided by the resource discovery subsystem (RDS)
46  * to learn what allocator resides on the designated resource.
47  *
48  * The RAS operates on a per-job basis - i.e., it serves to allocate
49  * the resources for a specific job. It takes several inputs,
50  * depending upon what is provided and desired:
51  *
52  * - the jobid for which the resources are to be allocated. There are
53  * two options here: (a) the jobid can be predefined and provided to
54  * the allocator. In this case, the allocator will simply allocate
55  * resources to the job; or (b) the jobid can be set by the allocator
56  * via a request to the ORTE name services (NS) subsystem. This option
57  * is selected by calling the allocate function with the illegal jobid
58  * of ORTE_JOBID_MAX. In this case, the new jobid (set by the
59  * allocator) will be returned in the provided address (the allocate
60  * function takes a pointer to the jobid as its argument).
61  *
62  * - MCA parameters specifying preallocated resources. These resources
63  * are allocated to the specified jobid (whether set by the allocator
64  * or not) on the first request. However, subsequent requests for
65  * allocation do NOT use these parameters - the parameters are "unset"
66  * after initial use. This is done to prevent subsequent allocation
67  * requests from unintentionally overloading the specified resources
68  * in cases where the univese is persistent and therefore servicing
69  * multiple applications.
70  *
71  * - MCA parameters specifying the name of the application(s) and the
72  * number of each application to be executed. These will usually be
73  * taken from the command line options, but could be provided via
74  * environmental parameters.
75  *
76  * - the resources defined in the ORTE_RESOURCE_SEGMENT by the
77  * RDS. When an allocation is requested for resources not previously
78  * allocated, the RAS will attempt to obtain an allocation that meets
79  * the specified requirements. For example, if the user specifies that
80  * the application must run on an Intel Itanium 2 resource under the
81  * Linux operating system, but doesn't provide the allocation or
82  * resource identification, then the allocator can (if possible)
83  * search the ORTE_RESOURCE_SEGMENT for resources meeting those
84  * specifications and attempt to obtain an allocation from them.
85  *
86  * The RAS outputs its results into three registry segments:
87  *
88  * (a) the ORTE_NODE_STATUS_SEGMENT. The segment consists of a
89  * registry container for each node that has been allocated to a job -
90  * for proper operation, each container MUST be described by the
91  * following set of tokens:
92  *
93  * - nodename: a unique name assigned to each node, usually obtained
94  * from the preallocated information in the environmental variables or
95  * the resource manager for the specified compute resource (e.g.,
96  * LSF). For those cases where specific nodenames are not provided,
97  * the allocator can use the info provided by the RDS to attempt to
98  * determine the nodenames (e.g., if the RDS learned that the nodes
99  * are name q0-q1024 and we obtain an allocation of 100 nodes
100  * beginning at node 512, then the RAS can derive the nodenames from
101  * this information).
102  *
103  * For each node, the RAS stores the following information on the segment:
104  *
105  * - number of cpus allocated from this node to the user. This will
106  * normally be the number of cpus/node as obtained from the data
107  * provided by the RDS, but could differ in some systems.
108  *
109  * - the jobids that are utilizing this node. In systems that allow
110  * overloading of processes onto nodes, there may be multiple jobs
111  * sharing a given node.
112  *
113  * - the status of the node (up, down, rebooting, etc.). This
114  * information is provided and updated by the state-of-health (SOH)
115  * monitoring subsystem.
116  *
117  * (b) the ORTE_JOB_SEGMENT. The RAS preallocates this segment,
118  * initializing one container for each process plus one container to
119  * store information that spans the job. This latter container houses
120  * information such as the application names, number of processes per
121  * application, process context (including argv and enviro arrays),
122  * and i/o forwarding info. The RAS does NOT establish nor fill any of
123  * the individual process info containers - rather, it preallocates
124  * the storage for those containers and places some of the job-wide
125  * information into that container. This info includes:
126  *
127  * - application names and number of processes per application
128  *
129  * - process context
130  *
131  * The remainder of the information in that container will be supplied
132  * by other subsystems.
133  *
134  * (c) the ORTE_RESOURCE_SEGMENT. The RAS adds information to this
135  * segment to indicate consumption of an available resource. In
136  * particular, the RAS updates fields in the respective compute
137  * resource to indicate the portion of that resource that has been
138  * allocated and therefore can be presumed consumed. This includes
139  * info on the number of nodes and cpus allocated to existing jobs -
140  * these numbers are updated by the RAS when resources are deallocated
141  * at the completion of a job.
142  *
143  * The information provided by the RAS is consumed by the resource
144  * mapper subsystem (RMAPS) that defines which process is executed
145  * upon which node/cpu, the process launch subsystem (PLS) that
146  * actually launches each process, and others.
147  *
148  * Because the RAS operates as a multi-component framework (i.e.,
149  * multiple components may be simultaneously instantiated), the RAS
150  * functions should NOT be called directly. Instead, they should be
151  * accessed via the ORTE resource manager (RMGR) subsystem.
152  *
153  *
154  */
155 
156 #ifndef ORTE_MCA_RAS_H
157 #define ORTE_MCA_RAS_H
158 
159 #include "orte_config.h"
160 #include "orte/constants.h"
161 #include "orte/types.h"
162 
163 #include "opal/mca/mca.h"
164 #include "opal/class/opal_list.h"
165 
167 
168 #include "ras_types.h"
169 
170 BEGIN_C_DECLS
171 
172 /* define the API functions */
173 typedef int (*orte_ras_base_API_allocate_fn_t)(orte_job_t *jdata);
174 
175 /* global structure for accessing RAS API's */
176 typedef struct {
177  orte_ras_base_API_allocate_fn_t allocate;
178 } orte_ras_t;
179 
180 ORTE_DECLSPEC extern orte_ras_t orte_ras;
181 
182 
183 /*
184  * ras module functions - these are not accessible to the outside world,
185  * but are defined here by convention
186  */
187 
188 /**
189  * Allocate resources to a job.
190  */
192 
193 /**
194  * Cleanup module resources.
195  */
197 
198 /**
199  * ras module
200  */
202  /** Allocation function pointer */
204  /** Finalization function pointer */
206 };
207 /** Convenience typedef */
209 /** Convenience typedef */
211 
212 /*
213  * ras component
214  */
215 
216 /**
217  * Component init / selection
218  * ras component
219  */
221  /** Base MCA structure */
223  /** Base MCA data */
225 };
226 /** Convenience typedef */
228 /** Convenience typedef */
230 
231 
232 /**
233  * Macro for use in components that are of type ras
234  */
235 #define ORTE_RAS_BASE_VERSION_2_0_0 \
236  MCA_BASE_VERSION_2_0_0, \
237  "ras", 2, 0, 0
238 
239 
240 END_C_DECLS
241 
242 #endif
243 
mca_base_component_data_t base_data
Base MCA data.
Definition: ras.h:224
Common type for all MCA components.
Definition: mca.h:250
int(* orte_ras_base_module_finalize_fn_t)(void)
Cleanup module resources.
Definition: ras.h:196
orte_ras_base_component_2_0_0_t orte_ras_base_component_t
Convenience typedef.
Definition: ras.h:229
mca_base_component_t base_version
Base MCA structure.
Definition: ras.h:222
orte_ras_base_module_allocate_fn_t allocate
Allocation function pointer.
Definition: ras.h:203
Definition: ras.h:176
The opal_list_t interface is used to provide a generic doubly-linked list container for Open MPI...
Component init / selection ras component.
Definition: ras.h:220
Top-level interface for all MCA components.
ras module
Definition: ras.h:201
Meta data for MCA v2.0.0 components.
Definition: mca.h:309
Definition: orte_globals.h:316
orte_ras_base_module_finalize_fn_t finalize
Finalization function pointer.
Definition: ras.h:205
Definition: opal_list.h:147
Global params for OpenRTE.
int(* orte_ras_base_module_allocate_fn_t)(opal_list_t *nodes)
Allocate resources to a job.
Definition: ras.h:191
orte_ras_base_module_2_0_0_t orte_ras_base_module_t
Convenience typedef.
Definition: ras.h:210