7.4 System Safety and Reliability
7.4 System Safety and Reliability
1. Fundamental Concepts
1.1 System Safety
Definition: The application of engineering and management principles to achieve an acceptable level of risk throughout a system's lifecycle.
Objective: To identify, evaluate, and eliminate or control hazards before they result in accidents or failures.
Key Principle: Safety should be designed into systems, not added as an afterthought.
1.2 Reliability
Definition: The probability that a system or component will perform its required function under stated conditions for a specified period of time.
Mathematical Definition: R(t)=P(T>t) Where R(t) is reliability at time t, and T is time to failure.
Reliability vs Safety:
Reliability: Focuses on functional performance.
Safety: Focuses on prevention of accidents/harm.
A system can be reliable but unsafe, or safe but unreliable.
1.3 Key Terms
Hazard: A condition with potential to cause harm (e.g., high voltage, toxic chemical).
Risk: Combination of probability and severity of harm from a hazard. Risk=Probability×Severity
Failure: Termination of a system's ability to perform required function.
Fault: Abnormal condition that may cause failure.
Error: Human mistake that creates faults.
2. Failure Modes
2.1 Classification of Failure Modes
By Timing:
Early Failures: Occur during initial operation (infant mortality).
Random Failures: Constant failure rate during useful life.
Wear-out Failures: Increasing failure rate as components age.
By Effect:
Catastrophic: Complete, sudden loss of function.
Degraded: Gradual deterioration of performance.
Intermittent: Sporadic failures.
By Cause:
Hardware Failures: Physical component breakdown.
Software Failures: Bugs, logic errors.
Human Error: Operator mistakes, maintenance errors.
Environmental: Temperature, humidity, vibration.
2.2 Failure Mode and Effects Analysis (FMEA)
Purpose: Systematic method to identify potential failures and their consequences.
Process Steps:
Identify components/functions.
List potential failure modes.
Determine effects of each failure.
Identify causes of failures.
Determine current controls.
Calculate Risk Priority Number (RPN).
Risk Priority Number (RPN): RPN=Severity×Occurrence×Detection
Severity: Impact of failure (1-10).
Occurrence: Probability of occurrence (1-10).
Detection: Ability to detect before effect (1-10).
2.3 Common Failure Mechanisms
Mechanical:
Fatigue cracking.
Wear and abrasion.
Corrosion and erosion.
Creep deformation.
Electrical:
Short circuits.
Open circuits.
Insulation breakdown.
Electronic:
Thermal cycling damage.
Electromigration.
Electrostatic discharge.
Software:
Logic errors.
Memory leaks.
Race conditions.
3. Mean Time Between Failures (MTBF)
3.1 Definition and Calculation
MTBF Definition: Average time between consecutive failures of a repairable system. MTBF=Number of FailuresTotal Operational Time
For Constant Failure Rate:
Assuming exponential distribution: R(t)=e−λt
Where λ is failure rate (failures per unit time).
MTBF=λ1
Units: Typically expressed in hours.
3.2 Applications and Interpretation
Reliability Prediction:
Probability of no failure in time t: R(t)=e−t/MTBF
For t=MTBF: R(MTBF)=e−1≈0.368 (36.8% reliability)
Maintenance Planning:
Determines inspection intervals.
Guides spare parts inventory.
System Comparison:
Higher MTBF indicates more reliable system.
Used in procurement specifications.
3.3 Related Metrics
Mean Time To Failure (MTTF):
For non-repairable items.
Average time until first failure.
Mean Time To Repair (MTTR):
Average time to restore system after failure.
Availability: Availability=MTBF+MTTRMTBF
Percentage of time system is operational.
3.4 Limitations of MTBF
Assumes Constant Failure Rate: Not valid for wear-out periods.
Doesn't Consider Failure Severity: Treats all failures equally.
Population Statistic: Not a guarantee for individual units.
Data Quality Dependent: Requires accurate failure data.
4. Fault Tree Analysis (FTA)
4.1 Overview
Purpose: Top-down, deductive analysis method.
Approach: Starts with undesired top event, works downward to identify root causes.
Visual Representation: Boolean logic tree with gates and events.
4.2 Symbols and Notation
Events:
Top Event: Undesired system failure.
Basic Event: Lowest level, requires no further development.
Intermediate Event: Result of logic gate combination.
Logic Gates:
AND Gate: Output occurs if ALL inputs occur.
OR Gate: Output occurs if ANY input occurs.
Transfer Symbols: Connect different parts of large trees.
4.3 Analysis Process
Define System: Boundaries, assumptions, success/failure criteria.
Identify Top Event: Specific system failure to analyze.
Construct Tree: Develop logic relationships downward.
Evaluate Tree: Qualitative and quantitative analysis.
Interpret Results: Identify critical paths, recommend improvements.
4.4 Quantitative Analysis
Probability Calculation:
AND Gate: Poutput=∏Pinput
OR Gate: Poutput=1−∏(1−Pinput)
Cut Sets: Minimal combinations of basic events causing top event.
Importance Measures:
Birnbaum Importance: Sensitivity of top event to component.
Criticality Importance: Accounts for component reliability.
4.5 Applications and Benefits
Design Evaluation: Identify weak points in system design.
Risk Assessment: Quantify probability of hazardous events.
Root Cause Analysis: Systematic investigation of failures.
Safety Verification: Demonstrate compliance with safety requirements.
5. Event Tree Analysis (ETA)
5.1 Overview
Purpose: Forward-looking, inductive analysis method.
Approach: Starts with initiating event, works forward through possible outcomes.
Visual Representation: Tree branching at decision points.
5.2 Analysis Process
Identify Initiating Event: Starting point for analysis.
Define Safety Functions: Systems responding to initiating event.
Construct Tree: Branch for success/failure of each function.
Assign Probabilities: To each branch point.
Calculate Outcome Probabilities: Multiply along paths.
5.3 Quantitative Analysis
Path Probability: Product of probabilities along path. Ppath=PIE×∏Pbranch
Outcome Probability: Sum of probabilities for all paths leading to same outcome.
Risk Calculation: Combine probability with consequence severity.
5.4 Applications
Accident Sequence Analysis: How initiating events escalate.
Safety System Effectiveness: Evaluate protective systems.
Emergency Planning: Identify critical response sequences.
Regulatory Compliance: Demonstrate risk control measures.
5.5 FTA vs ETA Comparison
Direction
Top-down (deductive)
Forward-looking (inductive)
Starting Point
Undesired top event
Initiating event
Focus
Causes of specific failure
Consequences of initiating event
Logic
Boolean (AND/OR gates)
Sequential branching
Best For
Root cause analysis, design weakness
Accident progression, safety system evaluation
6. Hazard Analysis
6.1 Types of Hazard Analysis
Preliminary Hazard Analysis (PHA):
Early in design phase.
Identifies potential hazards.
Recommends design changes.
System Hazard Analysis (SHA):
Detailed analysis of system interactions.
Identifies interface hazards.
Operating and Support Hazard Analysis (OSHA):
Focuses on operational phase.
Human-machine interface hazards.
Job Hazard Analysis (JHA):
Task-specific analysis.
Identifies step-by-step hazards.
6.2 Hazard Identification Techniques
Checklists: Standardized lists of potential hazards.
What-If Analysis: Brainstorming potential abnormal situations.
Hazard and Operability Study (HAZOP):
Systematic examination of deviations from design intent.
Uses guide words (NO, MORE, LESS, etc.).
Failure Modes, Effects, and Criticality Analysis (FMECA):
Extension of FMEA with criticality ranking.
6.3 Risk Assessment Matrix
Severity Categories:
Catastrophic
Critical
Marginal
Negligible
Probability Levels:
Frequent
Probable
Occasional
Remote
Improbable
Risk Ranking: Combine severity and probability.
6.4 Risk Control Hierarchy
Elimination: Remove hazard completely.
Substitution: Replace with less hazardous alternative.
Engineering Controls: Physical changes (guards, ventilation).
Administrative Controls: Procedures, training, supervision.
Personal Protective Equipment (PPE): Last line of defense.
7. Reliability Engineering Principles
7.1 Design for Reliability
Derating: Operate components below rated capacity.
Redundancy:
Active: Multiple components operate simultaneously.
Standby: Backup activates when primary fails.
k-out-of-n: System works if k of n components work.
Fault Tolerance: Continue operation despite component failures.
Fail-Safe Design: System fails to safe state.
7.2 Reliability Modeling
Series Systems:
All components must work for system success.
Rsystem=∏Ri
System reliability ≤ weakest component.
Parallel Systems:
System works if any component works.
Rsystem=1−∏(1−Ri)
Combined Systems: Series-parallel combinations.
7.3 Reliability Testing
Life Testing: Operate until failure.
Accelerated Life Testing: Apply stress to hasten failures.
Environmental Stress Screening: Eliminate early failures.
Burn-in Testing: Operate before delivery to remove infant mortality.
7.4 Reliability Data Analysis
Weibull Analysis: Models various failure patterns.
Shape parameter (β) indicates failure pattern.
Exponential Distribution: Constant failure rate assumption.
Normal Distribution: Wear-out failures.
Data Collection: Failure times, operating conditions, maintenance records.
8. Safety Instrumented Systems (SIS)
8.1 Safety Integrity Levels (SIL)
Definition: Quantitative target for safety system performance.
SIL Levels 1-4: Increasing reliability requirements.
Probability of Failure on Demand (PFD):
SIL 1: 0.1 to 0.01
SIL 2: 0.01 to 0.001
SIL 3: 0.001 to 0.0001
SIL 4: 0.0001 to 0.00001
8.2 Safety Lifecycle
Hazard and Risk Assessment
Allocation of Safety Functions
Safety Requirements Specification
Design and Engineering
Installation and Commissioning
Operation and Maintenance
Modification and Decommissioning
9. Implementation Considerations
9.1 Organizational Factors
Safety Culture: Management commitment, employee involvement.
Competence: Training and qualification of personnel.
Documentation: Procedures, records, analysis reports.
Continuous Improvement: Learn from incidents and near-misses.
9.2 Cost-Benefit Analysis
Direct Costs: Analysis time, implementation of controls.
Benefits: Reduced accidents, lower insurance, regulatory compliance.
Return on Investment: Often significant for safety improvements.
9.3 Regulatory Framework
Standards and Codes: ISO 31000 (Risk management), IEC 61508 (Functional safety).
Industry Specific: Process industry, aerospace, automotive.
Legal Requirements: Occupational safety, environmental protection.
10. Summary of Key Points
Proactive Approach: Identify and mitigate risks before incidents occur.
Systematic Methods: Use structured techniques (FTA, ETA, FMEA).
Quantitative Analysis: Support decisions with data and probabilities.
Lifecycle Perspective: Consider all phases from design to decommissioning.
Integrated Approach: Combine safety, reliability, and maintainability.
Continuous Process: Regular review and update of analyses.
Documentation: Maintain records for verification and improvement.
Effective system safety and reliability engineering requires a balanced approach combining technical analysis with organizational commitment and continuous improvement culture.
Last updated