Note - The commands used in this file are troff commands which are
to be removed >>
A Generalized Software Reliability Process Simulation Technique and Tool
\f3Robert C. Tausworthe Michael R. Lyu\f2
Jet Propulsion Laboratory AT&T Bell Laboratories
4800 Oak Grove Drive 600 Mountain Avenue
Pasadena, CA 91109 Murray Hill, NJ 07974
(818)306-6284 (908)582-5366
Key Words: Software Reliability, Simulation, Software Process, Tool.
\f3ABSTRACT\f1
This paper describes the structure and rationale of the generalized
software reliability process and a set of simulation techniques that may
be applied for the purpose of software reliability modeling.
These techniques establish a convenient means for studying a realistic,
end-to-end software life cycle that includes intricate subprocess
interdependencies, multiple defect categories, many factors of influence,
and schedule and resource dependencies, subject to only a few fundamental
assumptions, such as the independence of causes of failures.
The goals of this research are dual:
first, to generate data for truly satisfying the simplified assumptions
of various existing models for the purpose of studying their comparative
merits, and second, to enable these models to extend their merits to a less
idealized, more realistic reliability life cycle.
This simulation technique has been applied to data from a spacecraft project
at the Jet Propulsion Laboratory; results indicate that the simulation
technique potentially may lead to more accurate tracking and more timely
prediction of software reliability than obtainable from
analytic modeling techniques.
.B
.ce
1. INTRODUCTION
.R
Software reliability has been the subject of wide study
over the past 20 years.
At least 40 different models have been published in the literature so far [1].
The primary focus of these studies has been on proposing, analyzing, and
evaluating the performance of models that assess current reliability and
forecast future operability from observable failure data using statistical
inference techniques. However, none of these
models extends over the entire reliability process; most tend to focus only on
failure observance during testing or operations. Moreover, none of these
reliability models has emerged as ``the best'' predictor in all cases [2].
This may be due to a number of factors, such as oversimplification
of the failure process, the quality of observed data, the lack of
sufficient data to make sound inferences, and/or serious differences
between the proposed model and the true underlying reliability process(es).
It is conceivable that the basic nature of the
failure process(es) may differ among individual software developments.
This paper proposes a general simulation technique that relaxes or removes
many of the usual reliability modeling assumptions, and expands the
reliability process
to encompass the \f2entire\f1 software life cycle.
Typical assumptions in reliability modeling are:
.IP (1)
Testing (or operations) randomly encounters failures.
.IP (2)
Failures in non-overlapping time intervals are independent.
.IP (3)
The test space ``covers'' the use space (i.e., the operational profile).
.IP (4)
All failures are observed when they occur.
.IP (5)
Faults are immediately removed upon failure, or not counted again.
.IP (6)
Execution time is the relevant independent variable.
.LP
Many of these assumptions invoke controversy and require further
qualifications; thus, there is advantage to their dismissal.
In particular, the second assumption above can be weakened to
.IP (2)
Faults produce independent failures.
.LP
and the final four assumptions are not necessary to the technique
presented here at all. The degree of commonality among test space and use
space is rarely known, but can be modeled, if needed. Simulation can mimic
the failure to observe an error when it has, in fact, occurred, and,
additionally, mimic any system outage due to
an observed failure [3].
Furthermore, simulation can easily distinguish those
faults that have been removed and those that have not, so multiple failures
from the same unremoved fault can be readily reproduced. Finally, while
execution time is pertinent to some activities in the software life cycle,
it is not appropriate to all (e.g., inspections);
simulation can translate all
model-pertinent times to wall-clock (or calendar) time by appropriate
use of work load, computer utilization, and resource schedules.
This composite process is embodied in a Monte
Carlo simulation tool, \f2SoftRel\f1 [4],
available through NASA's Computer Software Management Information Center
as well as a diskette in [5].
But of what interest is reliability process simulation tool to the
software practitioner? One powerful way of understanding a pattern
in nature is to recreate it in a simulation or other representative
model. Because reliability is one of the first-cited indicators of
quality, a tool that can reproduce the characteristics of the process
that builds reliability offers a potential for optimization via
tradeoffs involving scheduling, allocation of resources, and usages of
alternative technologies and methodologies. The parameters that
characterize that process become metrics to be managed as means to
achieve prescribed levels of quality.
A simulation tool may vary from simple to complex, depending on the
scope of the process being modeled and the fidelity required by the
user. Most of the analytic models referred to above would require
only a few inputs; the more general tool reported here, however, can
use up to about 70. The former models would report only a few facts
about the unfolding process, but the latter can report up to about 90.
Of course, it is possible for the more complex tool to simulate the
simple models, as well.
The 70 input parameters are spread over 14 activities (described
later) that comprise the reliability process. Thus, each subprocess
only uses 5 of the parameters, on average, some of which quantify
interrelationships among activities. Each of the activity submodels
is thus fairly straightforward. It is not necessary to simulate the
entire process all at once. If data are only available on the failure
and repair portions of the process, then only the parameters that
characterize these activities need to be input to the simulator.
And while giving values to all of the development environment
parameters and producing a project resource allocation schedule may
seem daunting tasks if never done before, these should become
progressively easier as histories and experience locate the stable and
volatile elements of projects being undertaken by the task membership.
It can be argued that the elements that characterize and control the
process need to be estimated anyway, whether used by the simulator or
not, in order to understand and manage the reliability process
effectively. Parameters such as the expected fault density and the
average defect repair cost are familiar values extracted from prior
project histories.
The remaining paper is organized as follows:
Section 2 provides the basis for simulating the software reliability
process; Section 3 briefly describes the structures and interactions of the
reliability simulation package \f2SoftRel\f1; Section 4 presents a case
study in which the implemented simulation technology was applied
to a real-world project. Conclusions and future directions are presented in
Section 5.
.ce
.B
2. SIMULATION BUILDING BLOCKS
.R
.B
2.1 Event Simulation
.R
The very definition of conditional event-rate processes suggests
the rather simple computer simulation illustrated in
the following C language segment [6]:
.nf
.ls 1
.ps 11
.vs 13
.ls 1
.ft C
/* t and dt are assumed set prior to this point */
events = 0;
T = 0.;
while (T < t)
{ T += dt;
if (chance(beta(events, T) * dt))
events++;
}
/* the event has occurred a number of times at this point */
.ft R
.ls 2
.ps 12
.vs 12
.fi
.sp .5
The $dt$ in such simulations must always be chosen such that the
variations in the failure rate $beta (t)$ (see Theory Sidebar)
over the incremental time intervals $(t,~t~+~dt)$ are
negligible, and such that $beta (t)~dt~<~1$ (so that the instantaneous event
probability does not reach unity).
In the code segment above, \fCchance(x)\fR compares a $[0, 1)$-uniform
\fCrandom()\fR value with \fCx\fR, thus attaining the specified instantaneous
probability function. The form of \fCbeta(events, T)\fR acknowledges that
the event rate function may change over time and may be sensitive to the
number of event occurrences up to the current time.
The computational complexity of this algorithm is $O( beta T / DELTA t)$,
in constant space. The $beta$ component represents the maximum time
required to compute $ beta sub n (t)$. This complexity presents no
problems for computers of even moderate speed today.
The above illustration of simulation is simple, and yet very powerful.
For example, some published analytic models treat (or approximate)
the overall reliability growth as a
NHPP in execution time, while others focus on Markoff execution-time
interval statistics. Many of these differ only in the forms of their
rate functions [1].
Some examples are:
.sp .5
.IP 1. 2
The Jelinski-Moranda model [7]
deals with adjacent time-interval subprocesses in which
$beta sub n (t) ~~=~~ phi ~ (n sub 0~-~n)$, where $ n sub 0$ is the (unknown)
number of initial software faults and $phi$ is the per-fault failure rate.
.sp .5
.IP 2. 2
The Goel-Okumoto model [8]
deals with the overall reliability growth process, in which
$beta (t) ~~=~~ n sub 0 ~ phi ~ exp(- phi t)$,
where $n sub 0$ and $phi$ are constant parameters. It could be shown
that this model produces results very much like the Jelinski-Moranda model with
$n ~~=~~ n sub 0 (1 ~-~ exp( - phi t ))$.
.sp .5
.IP 3. 2
The Musa-Okumoto model [9]
describes the overall reliability growth process, in which
$beta (t) ~~=~~ beta sub 0 / (1 ~+~ theta t )$, where $beta sub 0$
is the initial failure rate and $theta$ is a rate decay factor. Both
$beta sub 0$ and $theta$ are constant parameters.
.sp .5
.IP 4. 2
The Duane model [10]
is an overall reliability growth model with
$beta (t) ~~=~~ k b t sup { b - 1}$, where $k$ and $b$ are constant
parameters.
.sp .5
.IP 5. 2
The Littlewood-Verrall inverse linear model [11]
is an overall reliability growth model with
$beta (t) ~~= tilde~~ phi ~/~ {sqrt { 1 ~+~ kt}}$,
where $phi$ and $k$ are constant parameters.
.sp .5
.IP 6. 2
The Yamada delayed S-shape model [12]
is yet another overall reliability growth model, with
$beta (t) ~~=~~ phi gamma t~ exp(1 ~-~ gamma t)$,
where $phi$ (the maximum failure rate) and $gamma$ are constant parameters.
.LP
.pn +2
Simulating the reliability process underlying these models is
straightforward.
Figure 1 illustrates the results obtained from simulating the above six
models. In each of these six simulation diagrams, the rate function
($beta$) and its associated parameters are listed. The parameters are
set up such that there are roughly 25 software faults initially remain
in the system.
The value of 25 was chosen to emphasize the variability of the failure
processes at low fault rates; at higher fault concentrations, the
decreasing $sigma / n bar$ produces less deviations.
<< Insert Figure 1 About Here >>
Figure 1: Simulation Results Based on Six Software Reliability Models
A number of simulations were conducted for each model
to simulate the occurrences of failures versus testing time.
Each diagram shows the mean of the simulation results (the line marked
"m") and the confidence intervals above (the line marked "m+s") and below
the standard deviation (the line marked by "m-s"). The standard
deviation along the time line is also presented (the line marked by
"s" in the bottom). These simulations neither validate nor invalidate
whether a particular model fits an actual project's data. They merely
illustrate the ease with which the characteristics of such process can
be comparatively analyzed.
.B
2.2 Poisson Process Simulation
.R
The NHPP is also easily simulated when the hazard function
$lambda (t sub a , ~t sub b )$ is known
in closed form. The program for counting the overall number of NHPP events
that will occur over a given time interval is merely
.ft C
.nf
.ps 11
.vs 13
.ls 1
#define produce(x) random_poisson(x)
events = produce(lambda(ta, tb));
.ps 12
.vs 12
.ls 2
.fi
.ft R
where \fCrandom_poisson(x)\fR is a subprogram that produces a
Poisson-distributed random value when passed the parameter \fCx\fR.
An algorithm for generating Poisson random numbers may be found in [13].
The time profile of an NHPP may be simulated by slicing the
$(0, ~t)$ interval into $dt$ time slots, recording the behavior in each slot,
and progressively accumulating the details to obtain the overall event
count profile, as in the following algorithm:
.nf
.ps 11
.vs 13
.ls 1
.ft C
t = 0.;
while (t < t_max)
{ n = produce(lambda(t, t + dt)); /* n is the fine structure */
events += n;
t += dt;
}
.fi
.ps 12
.vs 12
.ls 2
.ft R
The form of the cumulative rate function \fClambda(t, t + dt)\fR
may be extended to include a dependence on \fCevents\fR,
thereby causing the algorithm above to approximate a non-homogeneous
Markoff event-count process with
increasing fidelity as \fCdt\fR becomes sufficiently small that multiple events
per \fCdt\fR interval become rare. As mentioned above, however, the
behavior of
such simulations may be indistinguishable, even at larger \fCdt\fR, on the
basis of single realizations of the event process. This hybrid form
can speed up the simulation by removing the necessity of slicing time
into extremely small intervals.
This modified form of the simulation algorithm is called the
\f2piecewise-Poisson approximation\f1 of the Markoff event-count process.
.B
2.3 Multiple Categories of Events
.R
If the set of events {$epsilon sub i ~:~ i~=~1,~ ... ~, ~n$} that were classed
together above are now partitioned into categorized subsets according to
some given differentiation criteria (as for example, faults distinguished
as being \f2critical\f1, \f2major\f1, or \f2minor\f1), then the partitioning
of events into categories likewise partitions their rate functions into
corresponding categories, to which integers could be used as indices.
.\"and equivalently, the bracketed indices of the
.\"rate functions into sets of integers.
When an event occurs, the algorithm in Theory Sidebar produces the
index of a rate function. Finding this index among the categorized subsets of
integers relates the event to the distinguished category of occurrences.
The behavior of multiple categories of events is thus easily simulated
by changing from a single event counter, \fCevents\fR, to an array of
event counters, \fCevents[ ]\fR, and altering the program as follows:
.ft C
.ps 11
.vs 13
.ls 1
.nf
i = event_index(n, t);
c = event_category(n, i);
events[c]++;
.fi
.ft R
.ps 12
.vs 12
.ls 2
The overall event classification scheme is thus encapsulated within a single
\fCevent_category()\fR function for the entire categorization of events.
.B
2.4 Other Event Processes
.R
In the software life cycle, it is often the case that, if an event of one
type occurs, then there is a uniform probability $p < 1$ that another event of
a different type will be triggered. (For example, suppose that for each
unit of code is generated, there is a probability $p$ that a fault is created.)
If there are $n$ events of the first type, then the $k$ events of the second
type are governed by the binomial distribution function, which is also
easily simulated [14].
.\"Moreover, when $n$ itself is a Poisson random variable with parameter
.\"$lambda$, the distribution of $k$ is also Poisson, with
.\"parameter $p lambda$. Thus,
.\"occurrences of events of the second type may be simulated without
.\"actually counting events of the first type by using
.\"the \fCproduce()\fR function with parameter $p lambda$.
.\"
.ft C
.ps 11
.vs 13
.ls 1
.nf
#define select(n, p) random_binomial(n, p)
. . .
n = produce(lambda(t, t + dt);
k = select(n, p);
.fi
.ft R
.ps 12
.vs 12
.ls 2
Finally, when there is an ultimate number of events $N$ that a Poisson process
may reach before it is terminated, and $N$ is specified in advance, then
the growth of \fCevents\fR over time must be stopped after the $N$th
occurrence. This type of \f2goal-limited processes\f1 is also easily
simulated.
.B
2.5 General Event-Rate Processes
.R
The simulation method of this paper is more general than is required for
mere production of Markoff processes and NHPPs. Since the algorithm of
Subsection 2.2 springs directly from the definition,
the method is capable of simulating all event-rate random processes.
It thus is possible to simulate life cycle activities that may have
event count dependencies between non-overlapping time intervals and
rate functions that depend on variable schedules and other
irregularities over time. Whenever event functions produce homogeneous Markoff
processes in a piecewise fashion, then the event processes simulated
during each of these segments will follow the piecewise-Poisson approximation.
The programs presented above are thus capable of simulating a much more
general and realistic reliability process than has been
hypothesized by any analytic model known to the authors.
Programs such as those shown above are typical of
methods traditionally used to analyze stochastic processes over a
variety of input conditions. From a programming perspective, then,
very little sophistication is required to simulate a reliability
process. Insight, care, and validation are required, however, in
modeling the intricate system-dynamic interrelationships
among the various rate functions that characterize that process.
.ce
.B
3. STRUCTURE OF THE SIMULATION TOOL
.R
.B
3.1 Overall Simulation Context
.R
The techniques described in the previous Section are embodied in a software
reliability process simulation package, called \fISoftRel\fR.
\fISoftRel\fR simulates the entire software reliability life cycle, including
the effects of interrelationships among activities. For example,
\fISoftRel\fR provides for an increased likelihood of faults injected
into code as the
result of missing or defective requirements specifications. \fISoftRel\fR
also acknowledges that testing requires the preparation
and consumption of test cases, and that repairs must follow identification
and isolation. \fISoftRel\fR further requires that human and computer
resources be scheduled for all activities.
The \fISoftRel\fR package is a prototype, currently configured to simulate
processes having constant event rates per causal unit. The authors do not
advocate that constant-rate processes necessarily model software reliability,
nor do
they endorse the prototype as a model ready for industrial use. Rather,
it is regarded as a framework for experimentation, for generating data
typical of analytic model assumptions for comparison with actual collected
project data, and for inference of project characteristics from comparisons.
Other event-rate functions will be accommodated in later versions by
changing current constant rates and other parameters to
properly defined functions indicated by project histories.
The current input to \f2SoftRel\f1 consists of a single file that specifies
the \fCdt\fR time slice, about 70 traits of the software project and
its reliability process, and a list of activity, schedule, and resource
allocations. Internally, these form a data structure called the \fCmodel\fR.
Also internally, the set of status monitors at any given time are stored in
a data structure called \fCfacts\fR, which records the overall clock
time, the time and resources consumed by each activity (42 measures in
total), and a snapshot of 48 measures of project status.
The output from \fISoftRel\fR is a
single file containing the series of \fCfacts\fR produced at each \fCdt\fR
interval of time.
.\"The input and output parameters of \fISoftRel\fR
.\"are listed in the Parameter-List Sidebar.
\f2SoftRel\f1 simulates two types of failure events, namely,
defects in specification documents and faults in code.
Figure 2 shows the execution context of \f2SoftRel\f1.
.B
3.2 The Major Components of the Simulator
.R
\f2SoftRel\f1 is initialized by setting
sizes of items for construction, integration, and
inspection. These could have been designed just to equal the goal values
given in the \fCmodel\fR, but the \fCmodel\fR values are considered only
approximate. Sizes are set to Poisson random values, with the \fCmodel\fR
input values as means.
.ps 11
.vs 11
.ls 1
.PS < simu.pic
.ls 2
.ps 12
.vs 12
.ce
.ft H
Figure 2: \fISoftRel\fH Execution Context
.ft R
In a typical software engineering life cycle, several interrelated
software reliability subprocesses are taking place concurrently.
The activities in these subprocesses are characterized by 14
major components in the simulator, with appropriate
staffing and resource levels devoted to each activity:
.IP (1)
\f2Document Construction\f1:
Document generation and integration are assumed to be piecewise-Poisson
approximations with constant mean rates per workday specified in the
\fCmodel\fR, not to exceed the goal values. Defects are assumed injected at
a constant probability per documentation unit.
At each injection of a defect, the document hazard increases according to
the defect detection characteristic, which will be discussed further,
under Document Inspection.
.IP (2)
\f2Document Integration\f1:
Document integration consists of acquisition of reusable documentation,
deletion of unwanted portions, addition of new material, and minor changes.
Each of these subactivities is assumed to be a goal-limited piecewise-Poisson
approximation of a type similar to the construction process described above.
Defects are created as a result of each subactivity. Documentation
is integrated at a constant mean rate per workday, and defects are injected at
a constant probability per documentation unit. Hazard increases at each
defect according to the defect detection characteristic assumed.
The total current documentation units consist of new, reused minus
deleted, and added units; mere changes are deemed not to alter the total
volume of documentation.
.IP (3)
\f2Document Inspection\f1:
Document inspection is a goal-limited piecewise-Poisson approximation
of a type similar to
document construction; both new and integrated reused documentation are
assumed to be inspected at the same rate, and with the same efficiency.
Documentation is inspected
at a mean constant rate per workday.
Inspected units are allocated among new documents and reused documents in
proportion to the relative amounts of documentation in these two categories.
Defects detected during inspections may not
exceed those injected; the discovery of defects
is characterized as a goal-limited binomial process.
The defect discovery rate is assumed to be
proportional to the current accumulated document hazard and
the inspection efficiency.
.IP (4)
\f2Document Correction\f1:
Defect corrections are produced at a rate determined by the staff level and
the attempted-fix rate given in the \fCmodel\fR; actual corrections take place
according to the defect-fix adequacy, not to exceed the actual number of
defects discovered (a goal-limited binomial situation).
Attempted fixes can also inject new defects and
can change the overall amount of documentation via the numbers of
documentation units deleted, added, and changed. True corrections
decrease the document hazard, while the injection of new defects increases
it.
.IP (5)
\f2Code Construction\f1:
Production of code follows the same formulation as does document construction.
However, the average pace at which faults are created is influenced not only
by the usual
fault density that may occur as a normal consequence of
coding, but also by the density of undiscovered defects in
documentation, and by the amount of missing documentation.
Each fault injected increases the code hazard. But whereas document defects
are only found by inspection, code faults may be found by both inspection
and testing, and at different rates.
.IP (6)
\f2Code Integration\f1:
Simulation of code integration is comparable in structure to document
integration, except that code units replace document units and coding
rates replace documentation rates. The fault injection
rate is of the same form as that for code construction, above.
Each fault increases the code hazard.
.IP (7)
\f2Code Inspection\f1:
Code inspection mirrors the document inspection process.
The number of faults discovered will not exceed the total
number of as-yet undiscovered faults. The fault discovery rate is
assumed to be proportional to the current accumulated fault hazard
and the inspection efficiency. Since previously discovered faults
may not yet have been removed at the time of discovery, the number of
newly discovered faults is
assumed to be in proportion to the number of as-yet undiscovered faults.
.IP (8)
\f2Code Correction\f1:
Code correction simulation follows the same algorithm given for document
correction, translated to code units. Fault hazard is reduced upon correction
of a fault, and increased if any new faults are injected by the correction
process. Documentation changes are produced at assumed constant mean
rates per attempted correction.
.IP (9)
\f2Test Preparation\f1:
Test preparation consists merely of producing a number of test cases in
each \fCdt\fR slot in proportion to the test preparation rate, which is
a constant mean number of test cases per workday.
.IP (10)
\f2Testing\f1:
The testing activity simulation has two parts: If a test outage is in
effect, the outage time indicator decrements and the time and effort
increment. If an outage is not in effect, failures occur
at the \fCmodel\fRed rate; the number observed is computed as a binomial
process that is regulated by the probability of observation.
The failure \fCrate\fR function returns a value proportional to the
current hazard level. The function additionally consumes
computer resources and test cases, the latter at
a mean constant rate.
.IP (11)
\f2Fault Identification\f1:
The total number of failures analyzed may not exceed the number of failures
observed. Failures are analyzed at a mean constant rate per workday.
The identification of faults is limited
in number to those still remaining in the system. The isolation process
is regulated by the fraction of faults remaining undiscovered,
the adequacy of the analysis process, and the probability of faithful
isolation.
.IP (12)
\f2Fault Repair\f1:
The number of attempted repairs may not exceed the number of faults identified
by inspections and testing, less those corrected after inspection,
plus those identified for rework by validation
and retesting. Of those attempted, a \fCselect\fR number will really be
repaired, while the rest will mistakenly be reported as repaired. Repairs
are assumed here to be made on faults identified for rework first. A
\fCselect\fR number of new
faults may be created by the attempt, and code units may be altered (deleted,
added, or changed). Attempted repairs take place at a mean constant rate
per workday.
.IP (13)
\f2Validation of Repairs\f1:
The validation of attempted repairs takes place at an assumed mean
constant rate per workday.
The number of repairs validated may not exceed the number of attempted
repairs. The number of faulty repairs detected is a \fCselect\fR number
determined by the probability that validation will recognize an unrepaired
fault when one exists and the probability that unrepaired faults are among
those attempted repairs being validated (the repair adequacy);
the detected bad fixes cannot exceed the actual number of mis-repaired faults.
Those detected are designated for rework and removed from the
unrepaired, undiscovered fault count.
.IP (14)
\f2Retesting\f1:
Retesting takes place at a mean constant number of retests per workday
and consumes computer resources at the scheduled rate per day.
No new test cases are
generated (or consumed), since the original test cases are assumed
available for regression.
Retesting is assumed to encounter only those failures due to unrepaired faults.
.LP
.B
3.3 Simulator Input and Output
.R
It is beyond the scope of this paper to describe each of the 70 input
\fCmodel\fR parameters and the 90 output \fCfacts\fR parameters.
Interested readers will find these described more fully in [4].
The input file additionally contains
a list of staffing and computer resource packets,
each of which allocates resources to specified activities and time slots.
Slot times may overlap or leave gaps, at the discretion of the user.
Such schedules are the natural outcomes of development process planning
and are of fundamental importance in shaping the reliability process.
At least 14 schedule packets are needed to allocate resources and
time slots to each of the 14 assumed reliability process activities.
More packets may appear when an activity repeats or has a non-constant
resource allocation profile.
Output values consist of all product, work, CPU, resource, fault, failure,
and outage values. These are time-tagged in the form of a \fCfacts\fR data
structure and written to the output file at each \fCdt\fR
time interval for later scrutiny (e.g., plotting, trending, and
\fCmodel\fR readjustments) by other application programs, such as
spreadsheets.
The reliability process embodied in the prototype is meant to be fairly
comprehensive with respect to what really transpires during a software
development. The simulator therefore requires parameters relating to
the ways in which people and processes interact. The large number of
parameters in the simulator might, at first, seem to present an overwhelming,
impractical barrier to modeling, but it must be remembered that the
true reliability process is even more complex. It was felt that the
number of parameters used was the least that would be capable of reproducing
the realism hoped for. Reducing the number of parameters might either
reduce the fidelity of the simulation or the generality of the reliability
process model. This belief may change after sufficient experimentation
has taken place, whereupon selective alterations or combinations
of the parameters may be indicated. In any case, these parameters could be
independently estimated and continuously refined with ease.
If projects do not have sufficient data about past projects to give values
to certain parameters, then sensitivity analyses using \fISoftRel\fR can
indicate which are the most influential and thereby where a metrics effort may
prove most useful in reliability management. Alternatively, users may
simplify the model to focus only on one or two activities at a time by
making some of the parameters inactive. This may be done by assigning
typical or default values (usually 0 or 1) to them, thereby reducing the
number of measured parameters to only those that are deemed
pertinent and realistic within the project context.
.B
.ce
4. APPLICATIONS: A PROJECT CASE STUDY
.R
This Section describes the application of \fISoftRel\fR to a real-world project.
Project data and parameters from a subsystem of the Galileo Flight
Project at the Jet Propulsion Laboratory were collected as a case study for
evaluating the simulation technique. The remainder of this Section describes
the project, applies the simulation technique, and compares the results with
those obtained from several traditional reliability models.
.B
4.1 Project Description
.R
Galileo is an outer planet spacecraft project that began at the start of
fiscal year 1977, a mission that was originally entitled ``Jupiter Orbiter
and Probe,'' or JOP. Unlike previous outer solar system missions, the Galileo
orbiter was intended to remain in Jovian orbit for an extended interval of
time. This would allow observations of variations in planetary and
satellite features over time to augment the information obtained by
single-observation opportunities afforded by previous fly-by missions.
Galileo was launched in October of 1989, and will reach the Jovian
system in 1995.
There are two major on-board flight computers in the Galileo spacecraft:
The Attitude and Articulation Control Subsystem (AACS), and
the Command and Data System (CDS). A significant portion of each of these
systems is embodied in software. This case study focuses on the CDS
software reliability profile.
The CDS flight software is characterized as real-time embedded software,
written in 17,000 lines of assembly language code (16.5K lines new,
500 lines reused), with about 1400 pages of documentation
(1300 pages new, 100 pages reused), produced over a period of approximately
1500 days (300 weeks).
.\"1620 days (interpreted as 240 weeks at
.\"5 days per week before testing, and 60 weeks at 7 days per week during
.\"testing).
This project went through several design reviews and code inspections,
performed structured analysis and design,
and recorded and tracked failures during its software testing phase.
.B
4.2 Parameter Estimations and Simulation Results
.R
This Subsection presents a simulation of an end-to-end development project
based on data from the Galileo CDS project.
Some of the CDS project parameters were taken from project records;
other values were estimated by personnel within the project; the remaining
values were chosen by the authors as being probably typical of this project's
behavior, but for which there were no immediately available data.
Believed-typical values were adopted, for example, for
parameters related to the injection of faults in the correction and repair
processes. None of the \fCmodel\fR input parameters was set to zero.
Thus, even though few verifiable \fCmodel\fR parameters were available
outside the software testing phase, an entire plausible hypothetical model was
nevertheless formed in order to illustrate simulation of an end-to-end
reliability process. For lack of better development life cycle data,
all CDS activities other than testing (e.g., construction, inspection, and
anomaly removal) were presumed to take place serially, merely to observe
what their simulated behaviors would be. This overall study also presumed
that each activity took place without resource and schedule variations,
in order to view typical Markoff reliability behavior.
.\"The \fCmodel\fR parameters that were used are available
.\"from the authors upon request.
Figures 3 through 6 show the simulated documentation, code, defect, and fault
profiles of the software, sampled every 10 days. Of particular note
are the behaviors of the documentation, code, injected defects, and
injected faults (precisely those activities where no project data exists).
Because the numbers of units are comparatively large, the relative
irregularity levels are low, as predicted from (Eq. 3).
Figure 3 shows that the volume of documentation units (\fCDU_t\fR) did reach
its goal, but in this case, only about 63% of the documentation was actually
inspected (\fCDI_t\fR), even though the \fCmodel\fR placed a goal of 95%
on inspection. This is an instance where inadequate resources were allocated
to the inspection process. More resources would have been required to
reach the goal. The effects of correcting defects on page count are not
visible. The second rise in documentation is due to the integration of the
reused 100 pages.
Figure 4 similarly shows that the volume of code units (\fCCU_t\fR)
did reach its goal and that the 90% inspection goal was met as well.
The effects of correcting and repairing faults on code size, however, are
again not visible.
The injection, detection, and removal of defects (\fCE_d, D, d\fR), shown in
Figure 5, are a little noisier than documentation and code production,
but not much. All the detected defects were corrected (\fCD\fR = \fCd\fR),
but a sizable number of defects were inserted during
the correction period (days 520-580). Finally, more than 100 defects were
left in the documents.
The fault activity is shown in Figure 6. It exhibits the noisiest
behavior of all, but is still fairly regular. The initial rise in injected
faults (\fCE_f\fR) is due to construction; the second rise, due to integration,
is not visible; the third, a sharp rise again, is due to the imperfect fault
correction process; and the final gradual rise is due to the imperfect
fault repair process. By the end of the
.\" 1620-day
1500-day project, about 7 faults per kiloline of code (\fCe\fR)
had been found in inspections and corrected (\fCh\fR),
and about 22 faults per kiloline of
code (\fCf\fR) had been uncovered by testing and removed (\fCr\fR);
the fault density at
delivery was about 0.2 faults per kiloline of code.
.ls 1
.F+
figure Figure2.plot height 3.9i
.F-
.ft H
.ce
Figure 3: CDS Simulated Document Construction, Integration, and Inspection
.ft R
.F+
figure Figure3.plot height 3.9i
.F-
.ft H
.ce
Figure 4: CDS Simulated Code Construction, Integration, and Inspection
.ft R
.bp
\&
.F+
figure Figure4.plot height 3.9i
.F-
.ft H
.ce
Figure 5: CDS Simulated Defect Discovery and Correction
.ft R
.F+
figure Figure5.plot height 3.9i
.F-
.ft H
.ce
Figure 6: CDS Simulated Fault Injection, Discovery, Correction, and Repair
.ft R
.ls 2
.ps 12
.vs 12
Although the final fault discovery count is considered to be accurate,
the time profile of the simulation results do not appear to be as irregular
as the actual project data. It seems likely, then, that the fault
discovery process here is probably not homogeneous, either.
On the basis of this case study, it appears that the
simulation of all reliability subprocesses will require the use of
non-homogeneous event-rate models that reflect irregular
workloads and schedules of life cycle activities.
An example of this is given in the next Subsection.
.B
4.3 Comparisons with Other Software Reliability Models
.R
To simulate the details of Galileo CDS testing activity,
its testing phase
was separated into five subactivities with constant staffing, but having
irregular CPU and schedule allocations, as shown in Table 1.
These schedule parameters were obtained as those necessary to fit the
simulator output to project data. The fit appears adequate to
describe the underlying nature of the random failure process.
.sp .5
.ps 11
.vs 13
.ls 1
.TS
center, box, tab(#);
c c l l l c
l n n n n n.
activity#accumulated failures#begin week#end week#staffing#CPU
_
1 (functional test) #90 #0 #5 #2.0 #0.4
2 (feature test) #150 #5 #13 #2.0 #0.4
3 (operation test1) #300 #13 #23 #2.0 #1.2
4 (operation test2) #325 #23 #33 #2.0 #1.0
5 (operation test3) #341 #33 #40 #2.0 #2.0
.TE
.ps 12
.vs 12
.ls 2
.fi
.ft H
.ce
Table 1: Schedule of the CDS Testing
.ft R
Figure 7 shows the field data and the results obtained from
the piecewise-homogeneous simulation process, as well as those from
three other models, Jelinski-Moranda Model(JM), Musa-Okumoto Model(MO),
and Littlewood-Verrall Model(LV). For better visibility of process granularity,
data is shown in the form of failures per week, rather than cumulatively.
The JM, MO, and LV statistics were calculated to be "one-week-ahead"
predictions, in which all the failure data up to a given week were
used to predict the number of failures for the next week.
.bp
\&
.sp
.ls 1
.F+
figure all.plot height 3.9i
.F-
.ls 2
.ce
.ft H
Figure 7: Various Predictions for CDS Failures per Week Data
.ft R
As shown in the figure, the simulation technique produced a very
good forecast that could have been used for tracking the reliability
status during the entire testing phase.
The \f2Sum of Square Errors\f1 (SSE) for the simulation results in
Figure 7 is 24.5. As a comparison with the analytical software
reliability model results, the SSEs for JM, MO, and LV are
38.8, 43.3, and 99.7, respectively.
We conjecture that the reliability
forecast could have been accurately simulated prior to the start of testing,
had actual schedule and resource plans been used \f2a priori\f1.
The other models above were inadequate to predict even one week ahead,
particularly the LV model.
.ce
.B
5. CONCLUSION
.R
Reliability modelers seldom have the luxury of several realizations
of the same failure process to test their hypotheses concerning the nature
of a system's reliability. Nor are they ever provided with data that
faithfully match the assumed natures of their models. Nor are they able to
probe into the underlying error and failure mechanisms in a controlled way.
Rather, they are faced with the problem of not only guessing the forms and
particulars of the underlying error and failure random processes from the
scant, uncertain data they possess, but also with the problem of best
forecasting future failures from this single data set.
The assumptions of the simulation approach are certainly less restrictive
than those underlying analytic models. The simulation approach
solves software reliability prediction problems by producing data conforming
precisely to the software failure assumptions. Simulation enables
investigation of questions such as,
"How does a project's observed data compare with that emanating from a
NHPP having the following characteristics? ..." and "Which analytic prediction
model is the best under the following assumptions? ..." We believe
that the \f2SoftRel\f1
tool and its offspring will offer significant potential to researchers and
practitioners in answering such questions, in evaluating
sensitivities of predictions to various error and failure modeling
assumptions, and in forecasting software project status profiles,
such as time-lines of work products and the progress of testing, fault
isolation, repair, validation, and retest efforts.
Simulation of a real-world project has reinforced our confidence in
the validity of the approach.
We believe that homogeneous Markoff event-count
models that uniformly consume resources do not
adequately model the statistical failure profile of an
actual project. The non-homogeneous variable-resource-schedule
event-rate simulation
model was capable of producing very good early forecasts of reliability growth
that could prove very useful for process status assessment.
.\"
.\"The authors expect further collaborations between government agencies
.\"and industry will continue to refine the reliability simulation technique
.\"and lead to a better understanding of the reliability process and to
.\"improvements in the \f2SoftRel\f1 genre of tools.
.B
Acknowledgements
.R
We wish to thank Prof. Yi-Bing Lin of National Chiao Tung University
and Dr. Yu-Yun Ho of Bellcore for their valuable inputs.
Portions of the research described in this paper were carried out by
the Jet Propulsion Laboratory, California Institute of Technology, under a
contract with the National Aeronautics and Space Administration.
.\".nr PS 11
.\".nr VS 11
.ta 1i
.nr TA 1i
.sp
.ce 3
.ls 1
.ps 12
.vs 14
******************
Theory Sidebar
******************
\(bu Framework of the Discrete Event Simulation
The fundamental assumption of reliability process simulation is that every
stochastic event is the result of an underlying instantaneous
conditional event-rate random process [14].
A conditional event-rate process is one for which the probability that
an event occurs in the interval $(t, ~t ~+ ~dt)$, given that it had
not occurred prior to time $t$, is equal to $beta ( t ) ~ dt$ for some
function $beta ( t )$.
The statistical behavior of this process is well-known [15]:
The probability
that an event $epsilon$ will have occurred prior to a given time $t$ is
related by the expression
.EQ (Eq.1)
Prob "{ " epsilon ~ occurs~in ~ (0,~ t) " }" ~=~ P(t)
~=~ 1 ~-~ exp left ( ~-~int sub " 0" sup t ~beta ( tau ) ~d tau right )
~=~ 1 ~-~ e sup {- lambda (0, ~t)}
.EN
When the events of interest are failures, $beta (t)$ is often referred to as
the process \f2hazard rate\f1 and $lambda (0, ~t)$ is the \f2total hazard\f1.
If $lambda (0, ~t)$ is known in closed form, the event probability can be
analyzed as a function of time. But if many related events are intricately
combined in $beta (t)$, the likelihood of a closed-form solution for event
statistics dims considerably. The expressions to be solved can easily become
so convoluted that calculation of results requires a computer programmed
with comparatively complex algorithms.
Of special interest here are discrete event-count processes that merely
record the occurrences of rate-controlled events over time. The function
$beta sub n (t)$ denotes the conditional occurrence rate, given that the
$n$th event has already occurred by the time $t$. The integral of
$beta sub n (t)$ is $lambda sub n (0,~t)$. These processes are
termed non-homogeneous when $beta sub n (t)$ depends explicitly on $t$.
The probability $P sub n (t)$ that $n$ events occur in $(0,t)$ is much
more difficult to express than (Eq.1), and does not concern us here.
One important event-rate process is the discrete Markoff process.
A Markoff process is said to be homogeneous when its rate function is
sensitive only to time differences, rather than to absolute time values.
The notation $beta sub n (t)$, in these cases, signifies that $t$ is
measured from the occurrence time $t sub n$ of the $n$th event.
When the hazard rate $beta sub n (t)$ of a Markoff event-count process is
independent of $n$, then one may readily verify that the general
event count behavior is a non-homogeneous Poisson process (NHPP) whose mean
and variance are given by
.EQ (Eq.2)
n bar mark ~~=~~ lambda (0, ~t) ~~;~~~~~
sigma sup 2 ~~=~~ lambda (0, ~t)
.EN
.EQ (Eq.3)
{sigma} over {n bar} lineup ~~=~~ 1 ~/~ {sqrt{lambda (0, ~t)}}
~=~ 1 ~/~ {sqrt{n bar}}
.EN
The homogeneous (constant event rate) Poisson process is described by
$lambda ~=~ beta t$. Homogeneous Poisson process statistics thus only apply
to the homogeneous Markoff event-count process when the Markoff
$beta sub n (t) ~=~ beta $ is constant.
One may note from (Eq. 3) that as $n bar$ increases, the percentage deviation
of the process decreases. In fact, any event process with independence among
events in non-overlapping time intervals will exhibit relative fluctuations
that behave as $O(1 / sqrt{n bar})$, a quantity that gets increasingly
smaller for larger $n bar$.
This trend signifies that Poisson and Markoff processes involving
large numbers of event occurrences will tend to become percentage-wise
relatively regular. If physical processes appear to be very irregular,
then it will not be possible to simulate them using independent-increment
assumptions with regular rate functions.
There is a sense in which the NHPP form is inappropriate for describing the
overall software reliability profile. Reliability of software grows only
as faults are discovered and repaired, and these events occur only at
a finite number of times during the life cycle. The true hazard rate
presumably changes discontinuously at these times, whereas the NHPP rate
changes continuously.
In any case, the event-count Markoff model of software reliability is
more general than
the NHPP form, in that there is no assumption that its cumulative rate
$lambda$ is independent of $n$ or $t sub n$.
\(bu Multiple Event Processes
Conditional event-rate processes are also characterized by the property that
the occurrences of several independent classes of
events, $epsilon sub 1 ,..., epsilon sub f$, with rate functions
$beta sub n sup {[1]} (t) ,..., beta sub n sup {[f]} (t)$,
respectively, together behave as if $f$ algorithms of the single-event
variety were running simultaneously, each with its own separate
rate function, \fCbeta[i](n, t)\fR, controlling the $n$th occurrence of
event $epsilon sub i$ at time $t$. That is, the event occurrence
process is equivalent to a single event-rate process governed by its composite
rate function,
.EQ (Eq.4)
beta sub n (0,~ t) ~=~ sum from {\di=1\u} to {\uf\d} {beta sub n sup {\u [i]\d} (0, ~t)}.
.EN
When event occurrences in non-overlapping intervals are independent,
each $( t sub a, ~t sub b )$ interval is governed by a non-homogeneous
Markoff process with rate $beta sub n (t , ~t sub n )$.
.EQ (Eq.5)
beta sub n (t , ~t sub n ) ~=~ sum from {\di=1\u} to {\uf\d} beta
sub {n sub i} sup {\u [i]\d} (t , ~t sub n sub i )
.EN
When a new event $epsilon sub i$ is added (or deleted) to (or from) the
distinguished class of events, $beta sub n (t , ~t sub n )$ readjusts
to include (or exclude) the corresponding
$beta sub {n sub i} sup {[i]} (t , ~t sub n sub i )$
function and the simulation proceeds. This characteristic provides a simple
and straightforward method to simulate the effects of fault and defect
injections and removals.
References:
[1] J.D. Musa, A. Iannino, and K. Okumoto,
Software Reliability \- Measurement, Prediction, Application
McGraw-Hill Book Company, New York, New York, 1987.
[2] M.R. Lyu and A. Nikora,
"Using Software Reliability Models More Effectively,"
IEEE Software, July 1992, pp. 43-52.
[3] A. von Mayrhauser, Y.K. Malaiya, J. Keables, and P.K. Srimani,
"On the Need for Simulation for Better Characterization of Software
Reliability,"
Proceedings of 1992 International Symposium on Software Reliability
Engineering, Denver, Colorado, November 1993.
[4] R. Tausworthe,
"A General Software Reliability Process Simulation Technique,"
Jet Propulsion Laboratory, Technical Report 91-7, Pasadena, California,
March 1991.
[5] M.R. Lyu (ed.)
Handbook of Software Reliability Engineering,
McGraw-Hill and IEEE Computer Society Press, New York, 1996.
[6] W. Kreutzer,
System Simulation: Programming Stypes and Languages,
International Computer Science Series, Addison-Wesley, Menlo Park,
California, 1986.
[7] Z. Jelinski and P.B. Moranda,
"Software Reliability Research,"
Statistical Computer Performance Evaluation
E W. Freiberber (ed.), Academic, New York, 1972, pp. 465-484.
[8] A.L. Goel and K. Okumoto,
"Time-Dependent Error-Detection Rate Model for Software Reliability
and Other Performance Measures,"
IEEE Transactions on Reliability, vol. R-28, 1979, pp. 206-211.
[9] J.D. Musa and K. Okumoto,
"A Logarithmic Poisson Execution Time Model for Software Reliability
Measurement,"
Proceedings of Seventh International Conference on Software Engineering,
Orlando, Florida, 1984, pp. 230-238.
[10] J.T. Duane,
"Learning Curve Approach to Reliability Monitoring,"
IEEE Transactions on Aerospace, vol. AS-2, 1964, pp. 563-566.
[11] B. Littlewood and J.L. Verrall,
"A Bayesian Reliability Growth Model for Computer Software,"
Journal Royal Statistics Society, vol. 22, 1973, pp. 332-346.
[12] S. Yamada, M. Ohba, and S. Osaki,
"S-Shaped Reliability Growth Modeling for Software Error Detection,"
IEEE Transactions on Reliability, vol. R-32, December 1983, pp. 475-478.
[13] D.E. Knuth,
The Art of Computer Programming: Semi-Numerical Algorithms,
Addison-Wesley, Reading, Massachusetts, 1970.
[14] N. Roberts, et al.,
Introduction to Computer Simulation,
Addison-Wesley, Reading, Massachusetts, 1983.
[15] A. Papoulis,
Probability, Random Variables, and Stochastic Processes,
McGraw-Hill, New York, 1965.