Introduction
Safety, reliability and life-cycle costs of technical systems are directly
determined by efficiency in fault diagnosis throughout the life cycle.
Within the BRIDGE project the aim is to significantly improve diagnosis
efficiency in operations and reduce the efforts required in supporting
actions. Improved efficiency in operational diagnosis comprises a drastic
increase of automation for diagnosis of occurring failures and problems,
significant increase in coverage of failures and problems by tuning of
symptoms and additional tests, and ensuring guaranteed response times.This
report gives a brief synthesis of the background to the project, which
comes from three independent developments; case-based reasoning and more
in particular the use of fault networks for fault diagnosis, state-based
real-time expert systems, en the Q-methodology and more in particular the
LIMITS CASE-Tool. The report
has been kept short with many references to the articles and theses produced.
All articles and theses are available for the EU and the Reviewers upon
request.
Problem Description
The safe and efficient operation and maintenance of modern technical systems
involve several on-line and off-line diagnostic tasks. On-line tasks have
to respect strict real-time requirements. Each diagnostic task has specific
goals, decision strategy, input and output information, but all concern
the same technical (sub) system and should therefore comprise similar domain
knowledge. Ideally, the knowledge in each task is consistent with every
other task, but as systems are usually developed separately, tremendous
additional efforts are required to obtain and maintain some degree of consistency.
The lack of consistency and integration between operational and maintenance
diagnostic systems is a major reason for inefficiency in exploitation.
Large technical applications
The applications considered in the BRIDGE project are large technical systems
for which the total number of symptoms, tests, actions and faults are in
the order of tens of thousands (e.g. see Raaphorst et al., 1995). As the
relations are too numerous and complex and the acquisition of a correct
model would be too costly, in practice it becomes impossible to develop
an accurate causal model (see Netten and Vingerhoeds, 1994). In reality,
therefore, such models are developed for critical components and systems
only. Modelling the interactions with other systems and the environment,
however, remains a major problem. Maintenance of these explicit models,
e.g. for specific customer application updates and modifications, is very
costly as well.
Explicit formulation of diagnosis systems
For modern fault-diagnosis systems, faults have to be specified explicitly
in terms of their resulting symptoms and test results (facts) as well as
the relations between those facts. These facts can be explicitly arranged
according to the specified relations in the form of a fault-tree, frequently
constructed from individual sub-trees for symptoms and sub-systems. Due
to the complexity of modern technical systems, together with the number
of on-board systems they include, the typical size of their diagnostic
problems is increased to such an extent that for fault-tree-based diagnostic
systems the following problems can be identified:
-
It is almost impossible for human experts to explicitly define all facts
and their relations consistently.
-
It is also almost impossible for human experts to generate, verify and
maintain a complete and consistent fault-tree of this size. Due to the
inherent structure of a fault-tree, duplications of parts of branches are
very difficult to detect, while any modification has to be specified explicitly
in every related sub-tree.
-
The objectives, the series of tests that can be performed, and the actions
that can be taken may be different for every diagnostic task. In practice,
for each diagnostic task a separate system has to be developed.
-
Different diagnosis strategies ask for a different ordering of the additional
tests. The more decisive tests should be placed higher up in the tree.
For each diagnostic task, only tests that can be performed in the given
situation should be invoked.
-
During diagnosis, not all known symptoms and test results are directly
in a classic fault-tree, but remain unused until they appear in a fault-tree
node. Much valuable time is lost because not all the available information
is directly used to evaluate parts of sub-trees.
-
Searching for not yet known information requires advanced tree-search techniques
and search time.
-
The behaviour and reliability of system components cannot always be precisely
specified for real-world applications. The occurrence of symptoms and the
outcome of tests cannot be assigned with 100 % probability for each fault.
In practice, a human expert should estimate these.
Case-Based Reasoning
The use of expert system technology offers several advantages for use in
on-line fault diagnosis systems, thereby incorporating the knowledge and
experience of manufacturers and users. The size of the diagnosis problem
is such that explicit formulation of the fault-tree is very complicated.
The knowledge, however, is implicitly available in the description of faults
in terms of corresponding symptom-codes, results of performed tests and
repair actions. Case-based reasoning (CBR, see Aamodt and Plaza, 1994)
can be used to facilitate the automatic generation, consistency checking
and maintenance of the fault-tree (see Netten and Vingerhoeds, 1994). CBR
allows reducing the problem formulation to a definition of fault-cases
only. The syntax of a fault-case is simple and straightforward. For each
individual fault, a case is defined, consisting of failures, results to
additional tests and repair actions. CBR reasons over implicit relations
in cases and provides the required knowledge for development of a fault
diagnosis system.A new problem is diagnosed by remembering similar fault
cases. Additional tests should be performed to refine the diagnosis. Relevant
tests are identified from the most similar cases, and a preference order
can be determined by some information gain. The cases are reused by suggesting
their actions in the priority order of their similarities. Advantages for
applying case-based diagnosis include concentration of fault and failure
knowledge in one case-base, reduction of development and maintenance costs
of specialised diagnosis systems, and effort reduction of acquisition and
maintenance of the knowledge base. A major problem is encountered with
respect to the processing speed of CBR systems. Off-the-shelf CBR tools
do not meet the real-time requirements. Some of the existing tools structure
the case-base in tree structures, which for fault diagnosis have some serious
disadvantages with respect to the search process and time-to-diagnosis.The
approach presented here for fault-tree generation and on-line fault diagnosis
makes use of case-based reasoning techniques, tailored in such a way that
the response time for each diagnostic system is minimised. Matching and
retrieving the cases from a database would be computationally too time-consuming
for on-line diagnosis. To avoid extensive database queries, a fault-network
is first built off-line from the cases, and only this network is used during
the on-line diagnosis process itself. A textual case description of the
symptoms, tests and results is only necessary to inform the user during
development or diagnosis. This information should therefore be contained
in databases for the man-machine interfaces. Within the actual on-line
diagnostic systems, the cases, symptoms, test results and actions are only
referenced by indices. The fault network developed resembles the Rete network
(see ). Some additional node-types are added to this structure to account
for uncertain information of fault descriptions. The top layer of the network
consists of all input nodes for the symptoms and tests. In the next layers,
nodes are built for each combination of symptoms and results, in such a
way that each combination is built only once in the network. The bottom
layer consists of nodes with reference to the individual faults. Building
such a fault network has several major advantages over other diagnostic
systems:
-
The structure of the network is smaller. Each test result and each
symptom appear only once in the network as input nodes.
-
Efficiency of the data-driven network. A high efficiency is realised
by making optimal use of the fact that the results of a new test only lead
to limited modifications in the network. In every cycle, the current status
of all nodes is maintained, and only modifications of the inputs are propagated
through the network.
-
Simultaneous diagnosis of multiple symptoms. The relations of sub-trees
for specific symptoms in the former fault-trees are now merged into one
network, allowing for the simultaneous treatment of multiple symptoms.
-
Diagnostic strategy inhibited in the network. From the objectives
for a particular diagnosis problem, general rules for the diagnostic strategy
can be derived. The strategy should be used to determine the inference
process. In an off-line procedure, steps can be assigned for each diagnostic
task separately, according to the rules of the specific strategy. For on-line
diagnosis, the appropriate strategy is maintained by simply following the
threads through the activated nodes.
-
Non-procedural order. The nodes on the threads are not addressed
in a procedural order, as any input is immediately propagated to any node
in the network.
-
Computer memory can be reduced. For on-line diagnosis, only the
network structure is loaded into memory as a set of pointers between the
nodes.
-
Reduced search time. The response time for diagnosis is reduced
in several ways with respect to fault-trees; the structure to be searched
is smaller and all cases satisfying a relation are treated in one operation
within one node. The diagnostic process itself is not hindered by time-consuming
string operations.
-
Consistency checking. Faults with identical descriptions are found
in the same end-nodes of the network and can be easily detected. Input
nodes that have no further links downwards either represent default answers
to questions, or indicate that faults are not completely specified in the
case-base).
-
Incorporation of new experience. New experiences of maintenance
crews and operators can be incorporated as new faults in the case-base
as they become available.
The initial developments are reported by Netten and Vingerhoeds, 1994,
used the network on-line as search medium. Initial measurements reported
to lead to a satisfactorily behaviour.
Real-Time Expert Systems
Recently, there has been considerable interest in the use of expert systems
in real-time applications (see Jones and Rodd, 1993). Resulting from this
interest, a generation of expert system tools, or shells, designed specifically
for real-time applications, has been developed. However, these tools suffer
from a number of defects. Most notably, the current generation of tools
is not suitable for hard real-time applications, where the response-time
must be guaranteed. To overcome this problem, an alternative approach to
the development of expert systems for hard real-time applications has been
developed (see Jones, 1995). A state-based architecture is adopted, as
opposed to the event-based approach used by existing real-time expert system
shells. In a state-based system only one process is active at any time.
Hence, every process has exclusive access to any global data and no process
will be pre-empted. This eliminates the problems associated with controlling
access to global data, such as the blackboard of an expert system. The
use of a state-based architecture overcomes some of the problems of predicting
the execution times of expert systems. However, it does not solve the problem
predicting the execution times of the inference process. To do this, a
rule-set compiler has been developed, which converts an expert system rule-set
into simple non-recursive procedural functions with an execution time that
is largely unaffected by the input data. This allows guaranteed worst-case
execution times for the expert system to be determined. In this way, searching
within an expert system is for a large part performed off-line. On-line,
a relatively simple search process remains. The rule-set compiler generates,
from a given knowledge base, C or Fortran language functions with a known
worst-case execution time, that are logically equivalent to the knowledge
base.
The Q-Methodology
Good software engineering tools should allow for verification of the required
real-time behaviour before implementation. Unfortunately there are not
many available tools that give a guarantee that the design will be conform
the pre-set real-time requirements. Modelling techniques exist which try
to model a real-time application, but they unfortunately cannot completely
model hard real-time problems. Most of these modelling techniques lack
the capability to handle truly a-synchronic message transfer. Some other
techniques exist, but are so complicated to use that they get abandoned.
One real-time methodology offers good possibilities, which can cope with
the requirements in designing hard real-time systems. This competitive
methodology was developed by Motus and Rodd, 1994, building on an original
approach by Quirk, and now is known as the Q-methodology, allowing the
specification and verification of all required timing characteristics during
the design of real-time systems. In the Q-methodology, the temporal information
is modelled via so-called processes and channels. By connecting the processes
with the channels, data and control may be transferred between different
processes and an order of execution may be provided. Each process has an
own time-set, which may be shared by other processes if synchronously running
of these processes is required. Using synchronous channels between two
processes does synchronously starting of the processes. Two other channel
types in the Q-methodology are the semi-synchronous channel and the a-synchronous
channel. The semi-synchronous channel, comparable to the only channel type
a Petri-net has, starts the following process when the previous process
has produced its data and finished its task. The a-synchronous channel
may be compared to a data-channel where information is available for the
consumer processes, without triggering the consumer process to start. The
timesets of these two processes are not related to each other.The Q-model
enables to incorporate and verify timing constraints starting from the
specification and ending with the maintenance (the whole life cycle). It
combines analytical (formal) and simulational (informal) approaches for
verifying time correctness. Formal analysis of a system described in the
Q-model is performed in three stages:
-
Analysis of separate elements (parameters of a process and of a channel)
-
Analysis of pairs of interacting processes (separately for each channel
type)
-
Behavioural analysis of a group of interacting processes (deadlock, performance,
synchronisation precision)
The Q-methodology enables to follow the satisfaction of the primary time
constraints through life-cycle stages (requirement specification, algorithm
specification, logical design, physical design, and implementation). It
also supports the introduction and proof of secondary time constraints,
implied by the primary constraints. The Q-model, like many other computational
models, is based on heuristics, which have been filtered pragmatically.
It has been demonstrated that the Q-model is a superset of ordinary Petri
Nets, and that the Q-model can be mapped into a weak second order predicate
calculus.Modelling of a real-time application with the Q-methodology is
concerning the dynamic and functional behaviour of the whole application.
For a complete model, it is necessary to model the static information as
well. Object oriented methodology functions as a good addition to the Q-methodology
for a static base of the complete model. In particular, the Object Modelling
Technique (OMT) of Rumbaugh et al., 1991, offers good possibilities.To
target the real-time software engineering market, a combination of OMT
and the Q-methodology was developed within the framework of a research
and development project supported by the European Union. The LIMITS-CASE
tool allows the specification and design of hard real-time applications
and the analysis of the real-time characteristics. First static information
is specified in the form of object models, where the characteristics of
each class is provided in the form of attributes and operations belonging
to that specific class. For an intuitive approach, it is advisable to continue
with the creation of state-diagrams of the dynamic classes, see e.g. Zijderveld
and Vingerhoeds, 1996b. After a satisfactorily design has been made, it
is possible to continue with the creation of the Q-diagrams. In these diagrams
all information is specified concerning the timing performances of all
processes and the interaction between each other. When the complete model,
or an integral part, of the real-time application is finished, it is possible
to animate the behaviour of the application to provide a first simulation
of the whole process. With this animation it is also possible to perform
certain scenarios, where a fault is added on purpose. The behaviour of
the system, in reaction to this fault, can be animated, which may be of
interest to the designer.
The BRIDGE Approach
Within the BRIDGE project, the above-described approach is extended using
a two-phase approach to address the problems identified above. In an off-line
process, the case-base is built, analysed and modified for a specific diagnosis
task and specific real-time diagnosis systems are generated from the case-base
for each operational mode. On-line diagnosis is performed by this real-time
system, while the original case-base is not accessed. The Case-Based Reasoning
cycle, presented by Aamodt and Plaza, 1994, is now executed in two loops;
in the off-line process, the cases are retrieved, analysed, revised and
retained, while the actual reuse of fault cases and actions occurs in the
on-line diagnosis loop.
Off-line development of the fault-base
The application is represented by a case-base of (analytical) fault cases
and more general domain knowledge about the features. Every diagnosis task
is defined separately by its strategy and operational conditions, including
the technical (sub) systems, alarm level(s), operational phase(s), operator
level(s), maximum available efforts and time for executing tests and actions,
and performance requirements. Real-world problems are recorded during on-line
diagnosis in a history case-base.The strategy and operational conditions
of a diagnosis task determine which relevant cases and features are retrieved
and organised in a network. This network is used for the analysis and revision
of the fault-base. The network structure enables the identification of
the most critical situations that can be foreseen during on-line diagnosis.
Unresolved problems and occurred critical situations from the history case-base
are analysed. Each critical situation is treated separately. Similar cases
are retrieved for analysis of correctness and their sensitivity to the
critical situation. Cases that do not satisfy these conditions and will
require a revision of the network structure, which could also involve a
modification of cases or some features in the case-base. The coverage and
real-time requirements (see further) are only analysed for a network satisfying
the correctness and sensitivity requirements, but could also lead to revisions
and case-base modifications to improve the coverage or timing behaviour.
It should be noted that these revisions and retainments are performed only
after validation and authorisation of a human expert. The analysis only
provides suggestions for modifications. When the network is validated for
all requirements, a real-time diagnosis system can be generated from the
network structure.
On-line fault diagnosis
A new problem is matched to the faults and actions present in the diagnosis
system. If the matching is sufficient, the appropriate actions are suggested
to the user. Otherwise additional tests are suggested for which the results
provide relevant new information. The on-line process does not have any
direct access to the original case-base.
Guaranteeing Real-Time Behaviour
Once a suitable subset for the diagnosis system has been selected, and
analysis of correctness, sensitivity and coverage have been performed satisfactorily,
the real-time characteristics of the envisaged system have to be assessed.
This means that the response time of the on-line diagnosis system must
match the requirements of the process at hand. There is a need for formal
analysis of the resulting on-line system that will be derived from the
network structure. It is therefore necessary to assess the overall real-time
behaviour. The mentioned Q-methodology was further developed and combined
with the Object Modelling Technique (OMT) within the framework of the LIMITS
project, supported by the European Union. The LIMITS-CASE tool allows the
specification and design of hard real-time applications and the analysis
of the real-time characteristics. Parts of the LIMITS tool, in particular
related to the verification of the real-time behaviour, can now be used
for the assessment of the real-time behaviour. When the real-time behaviour
matches the requirements, the case-compiler can effectively be applied
to generate the C-functions (see e.g. Vingerhoeds et al., 1995b).
BRIDGE Toolset
The complete system under development now consists of two main parts, to
be called the "operational diagnostic tool" and the "support function tool".
The support function tool allows administrators to administrate the faultbase,
i.e. entering new cases, modifying existing cases, specifying possible
tests, actions and potential test results, etc. In addition, recorded problems,
and unresolved failures can be analysed. The tool generates the network,
which allows for evaluation of coverage and sensitivity of symptoms, tests
and test results using special purpose analysis routines. When a consistent
network has been obtained, an analysis for the expected real-time characteristics
can be performed, and when the performance is within specified time limits,
the support function tool will generate the real-time kernel for on-line
use. The tool also generates a condensed history from the global history
in the data store for on-line retrieval. The operational diagnostic tool
provides access functions for non-administrative persons for the real-time
kernel and the condensed history, dependent on a certain operational mode
and governed by an operator level. The tool records a history of all related
data of problems for future evaluation in the support function tool. The
actual fault diagnosis is performed by the real-time kernel, which monitors
the technical system in real time and suggests additional tests, based
on presented symptoms, previous test results and interaction with the operator.
Based on this the fault will be identified and corrective actions will
be suggested. The case-base is made up of cases; each described by symptoms,
tests, test results and actions. Each case represents a single fault. A
certainty factor will be specified for the occurrence of the fault and
an alarm level will be specified. Symptoms are defined with information
about alarm levels, operational status and textual explanations and a certainty
factor for the occurrence of the symptom for the given case. Tests are
defined with possible results, time and efforts required to perform the
tests, operational modes during which the test can be performed. As with
symptoms, possible test results are specified by a certainty factor for
the occurrence of the result for the given case, or vice-versa.
References
-
Aamodt, A, and E. Plaza (1994). Case-Based Reasoning:
Foundational Issues, Methodological Variations, and System Approaches.
AI Communications, 7, nr. 1, pp. 39-52, March.
-
Forgy, C.L. (1982). Rete: A fast Algorithm for the
Many pattern/Many Object Pattern Match Problem. Artificial Intelligence,
19, pp. 17-37.
-
Jones
A.V. (1995). "An Approach to the Design of Expert Systems for Hard Real-Time
Applications", PhD thesis, University of Wales Swansea.
-
Jones and Rodd, 1993Jones, A.V.,
and M.G. Rodd (1993). Problems with Expert Systems in Real-time Control.
Engineering Applications of Artificial Intelligence, 7, nr.
3, pp. 499 - 506
-
Jones, A.V., and M. G. Rodd (1994).
An approach to the design of expert systems for hard real-time control.
IFAC Workshop on safety, reliability and applications of emerging intelligent
control technologies, Hong Kong, 12-14 December, pp. 30-35.
-
Jones A.V., Vingerhoeds R.A., Rodd M.G. (1995).
"Real-Time Expert Systems for flight control.", IFAC Artificial Intelligence
in Real-Time Control, AIRTC'95. Slovenia, November 29-1 December.
-
Motus, L., Rodd, M.G., Timing Analysis of Real-Time Software,
Pergamon Press, Oxford, 1994.
-
Netten B.D., and R.A. Vingerhoeds
(1994). Automatic Fault Tree Generation. IFAC Workshop of Safety, reliability
and applications of emerging intelligent control techniques, Hong Kong,
12-14 december 1994, pp 182-187.
-
Netten B.D., Vingerhoeds R.A. (1995a). Automatic
Fault-Tree Generation: A Generic Approach for Fault Diagnosis Systems,
TRAIL PhD congress, Multidisciplinary visions on TRAnsport, Infrastructure
and Logistics, May 30, Rotterdam.
-
Netten B.D., and R.A. Vingerhoeds
(1995b). Large-Scale Fault Diagnosis for On-Board Train Systems, in: Case-Based
Reasoning, Research and Development, (eds.) Veloso M., Aamodt A., Lecture
Notes in Artificial Intelligence 1010, Springer Verlag, Berlin, pp 67-76.
-
Raaphorst A., Netten B.D., Vingerhoeds R.A., Automated
Fault Tree Generation for Operational Fault Diagnosis, RAILinkÆ95,
IEE Int. Conf. Electric Railways in a United Europe, Amsterdam, March 27-30,
1995, pp. 173-177.
-
Rodd, M.G. (1995). Safe AI - Is
this possible? Engineering Applications of Artificial Intelligence,
8, nr. 3, pp. 243-250.
-
Rumbaugh J., Blaha M., Premerlani W., Eddy F., Lorensen
W., "Object-Oriented Modelling and Design", Prentice-Hall, 1991.
-
Vingerhoeds, R.A., Janssens, P., Netten, B.D., Aznar
Fernández-Montesinos, M. (1995a). "Enhancing off- and on-line monitoring
and fault diagnosis", Control Engineering Practice, Vol. 3, Nr. 11, pp.
1515-1528.
-
Vingerhoeds, R.A., B.D. Netten and L. Boullart (1992).
Artificial Intelligence in Process Control Applications. Communications
and Cognition in Artificial Intelligence, 9, pp. 161-173.
-
Vingerhoeds, R.A., Netten, B.D., Jones, A.V., Rodd,
M.G. (1995b). Enhancing On-Line Fault Diagnosis using Real-Time Expert
Systems, in proc: ESTEC Workshop Artificial Intelligence and Knowledge
Based Systems for Space, 10-11 October.
-
Zijderveld P.D., Vingerhoeds R.A. (1996a). "Design
Methodology for Real-Time systems", proc. Advanced School for Computing
and Imaging conference, June 5-7 1996, Lommel, Belgium.
-
Zijderveld P.D., Vingerhoeds R.A. (1996b). "Ensuring
Real-Time performance of Expert Systems", proc. IEEE Workshop, AI in Aerospace
Applications, November 16, 1996, Toulouse, France.
Updated by Rob Vingerhoeds (rob@kgs.twi.tudelft.nl)
Last Update 06-11-1997.