A practical, step-by-step guide to designing world-class, high availability systems using both classical and DFSS reliability techniques
Whether designing telecom, aerospace, automotive, medical, financial, or public safety systems, every engineer aims for the utmost reliability and availability in the systems he, or she, designs. But between the dream of world-class performance and reality falls the shadow of complexities that can bedevil even the most rigorous design process. While there are an array of robust predictive engineering tools, there has been no single-source guide to understanding and using them . . . until now.
Offering a case-based approach to designing, predicting, and deploying world-class high-availability systems from the ground up, this book brings together the best classical and DFSS reliability techniques. Although it focuses on technical aspects, this guide considers the business and market constraints that require that systems be designed right the first time.
Written in plain English and following a step-by-step ‘cookbook’ format, Designing High Availability Systems:
- Shows how to integrate an array of design/analysis tools, including Six Sigma, Failure Analysis, and Reliability Analysis
- Features many real-life examples and case studies describing predictive design methods, tradeoffs, risk priorities, ‘what-if’ scenarios, and more
- Delivers numerous high-impact takeaways that you can apply to your current projects immediately
- Provides access to MATLAB programs for simulating problem sets presented, along with Power Point slides to assist in outlining the problem-solving process
Designing High Availability Systems is an indispensable working resource for system engineers, software/hardware architects, and project teams working in all industries.
Cuprins
Preface xiii
List of Abbreviations xvii
1. Introduction 1
2. Initial Considerations for Reliability Design 3
2.1 The Challenge 3
2.2 Initial Data Collection 3
2.3 Where Do We Get MTBF Information? 5
2.4 MTTR and Identifying Failures 6
2.5 Summary 7
3. A Game of Dice: An Introduction to Probability 8
3.1 Introduction 8
3.2 A Game of Dice 10
3.3 Mutually Exclusive and Independent Events 10
3.4 Dice Paradox Problem and Conditional Probability 15
3.5 Flip a Coin 21
3.6 Dice Paradox Revisited 23
3.7 Probabilities for Multiple Dice Throws 24
3.8 Conditional Probability Revisited 27
3.9 Summary 29
4. Discrete Random Variables 30
4.1 Introduction 30
4.2 Random Variables 31
4.3 Discrete Probability Distributions 33
4.4 Bernoulli Distribution 34
4.5 Geometric Distribution 35
4.6 Binomial Coeffi cients 38
4.7 Binomial Distribution 40
4.8 Poisson Distribution 43
4.9 Negative Binomial Random Variable 48
4.10 Summary 50
5. Continuous Random Variables 51
5.1 Introduction 51
5.2 Uniform Random Variables 52
5.3 Exponential Random Variables 53
5.4 Weibull Random Variables 54
5.5 Gamma Random Variables 55
5.6 Chi-Square Random Variables 59
5.7 Normal Random Variables 59
5.8 Relationship between Random Variables 60
5.9 Summary 61
6. Random Processes 62
6.1 Introduction 62
6.2 Markov Process 63
6.3 Poisson Process 63
6.4 Deriving the Poisson Distribution 64
6.5 Poisson Interarrival Times 69
6.6 Summary 71
7. Modeling and Reliability Basics 72
7.1 Introduction 72
7.2 Modeling 75
7.3 Failure Probability and Failure Density 77
7.4 Unreliability, F(t) 78
7.5 Reliability, R(t) 79
7.6 MTTF 79
7.7 MTBF 79
7.8 Repairable System 80
7.9 Nonrepairable System 80
7.10 MTTR 80
7.11 Failure Rate 81
7.12 Maintainability 81
7.13 Operability 81
7.14 Availability 82
7.15 Unavailability 84
7.16 Five 9s Availability 85
7.17 Downtime 85
7.18 Constant Failure Rate Model 85
7.19 Conditional Failure Rate 88
7.20 Bayes’s Theorem 94
7.21 Reliability Block Diagrams 98
7.22 Summary 107
8. Discrete-Time Markov Analysis 110
8.1 Introduction 110
8.2 Markov Process Defined 112
8.3 Dynamic Modeling 116
8.4 Discrete Time Markov Chains 116
8.5 Absorbing Markov Chains 123
8.6 Nonrepairable Reliability Models 129
8.7 Summary 140
9. Continuous-Time Markov Systems 141
9.1 Introduction 141
9.2 Continuous-Time Markov Processes 141
9.3 Two-State Derivation 143
9.4 Steps to Create a Markov Reliability Model 147
9.5 Asymptotic Behavior (Steady-State Behavior) 148
9.6 Limitations of Markov Modeling 154
9.7 Markov Reward Models 154
9.8 Summary 155
10. Markov Analysis: Nonrepairable Systems 156
10.1 Introduction 156
10.2 One Component, No Repair 156
10.3 Nonrepairable Systems: Parallel System with No Repair 165
10.4 Series System with No Repair: Two Identical Components 172
10.5 Parallel System with Partial Repair: Identical Components 176
10.6 Parallel System with No Repair: Nonidentical Components 183
10.7 Summary 192
11. Markov Analysis: Repairable Systems 193
11.1 Repairable Systems 193
11.2 One Component with Repair 194
11.3 Parallel System with Repair: Identical Component Failure and Repair Rates 204
11.4 Parallel System with Repair: Different Failure and Repair Rates 217
11.5 Summary 239
12. Analyzing Confidence Levels 240
12.1 Introduction 240
12.2 pdf of a Squared Normal Random Variable 240
12.3 pdf of the Sum of Two Random Variables 243
12.4 pdf of the Sum of Two Gamma Random Variables 245
12.5 pdf of the Sum of n Gamma Random Variables 246
12.6 Goodness-of-Fit Test Using Chi-Square 249
12.7 Confidence Levels 257
12.8 Summary 264
13. Estimating Reliability Parameters 266
13.1 Introduction 266
13.2 Bayes’ Estimation 268
13.3 Example of Estimating Hardware MTBF 273
13.4 Estimating Software MTBF 273
13.5 Revising Initial MTBF Estimates and Tradeoffs 274
13.6 Summary 277
14. Six Sigma Tools for Predictive Engineering 278
14.1 Introduction 278
14.2 Gathering Voice of Customer (VOC) 279
14.3 Processing Voice of Customer 281
14.4 Kano Analysis 282
14.5 Analysis of Technical Risks 284
14.6 Quality Function Deployment (QFD) or House of Quality 284
14.7 Program Level Transparency of Critical Parameters 287
14.8 Mapping DFSS Techniques to Critical Parameters 287
14.9 Critical Parameter Management (CPM) 287
14.10 First Principles Modeling 289
14.11 Design of Experiments (DOE) 289
14.12 Design Failure Modes and Effects Analysis (DFMEA) 289
14.13 Fault Tree Analysis 290
14.14 Pugh Matrix 290
14.15 Monte Carlo Simulation 291
14.16 Commercial DFSS Tools 291
14.17 Mathematical Prediction of System Capability instead of “Gut Feel” 293
14.18 Visualizing System Behavior Early in the Life Cycle 297
14.19 Critical Parameter Scorecard 297
14.20 Applying DFSS in Third-Party Intensive Programs 298
14.21 Summary 300
15. Design Failure Modes and Effects Analysis 302
15.1 Introduction 302
15.2 What Is Design Failure Modes and Effects Analysis (DFMEA)? 302
15.3 Definitions 303
15.4 Business Case for DFMEA 303
15.5 Why Conduct DFMEA? 305
15.6 When to Perform DFMEA 305
15.7 Applicability of DFMEA 306
15.8 DFMEA Template 306
15.9 DFMEA Life Cycle 312
15.10 The DFMEA Team 324
15.11 DFMEA Advantages and Disadvantages 327
15.12 Limitations of DFMEA 328
15.13 DFMEAs, FTAs, and Reliability Analysis 328
15.14 Summary 330
16. Fault Tree Analysis 331
16.1 What Is Fault Tree Analysis? 331
16.2 Events 332
16.3 Logic Gates 333
16.4 Creating a Fault Tree 335
16.5 Fault Tree Limitations 339
16.6 Summary 339
17. Monte Carlo Simulation Models 340
17.1 Introduction 340
17.2 System Behavior over Mission Time 344
17.3 Reliability Parameter Analysis 344
17.4 A Worked Example 348
17.5 Component and System Failure Times Using Monte Carlo Simulations 359
17.6 Limitations of Using Nontime-Based Monte Carlo Simulations 361
17.7 Summary 365
18. Updating Reliability Estimates: Case Study 367
18.1 Introduction 367
18.2 Overview of the Base Station Controller—Data Only (BSC-DO) System 367
18.3 Downtime Calculation 368
18.4 Calculating Availability from Field Data Only 371
18.5 Assumptions Behind Using the Chi-Square Methodology 372
18.6 Fault Tree Updates from Field Data 372
18.7 Summary 376
19. Fault Management Architectures 377
19.1 Introduction 377
19.2 Faults, Errors, and Failures 378
19.3 Fault Management Design 381
19.4 Repair versus Recovery 382
19.5 Design Considerations for Reliability Modeling 383
19.6 Architecture Techniques to Improve Availability 383
19.7 Redundancy Schemes 384
19.8 Summary 395
20 Application of DFMEA to Real-Life Example 397
20.1 Introduction 397
20.2 Cage Failover Architecture Description 397
20.3 Cage Failover DFMEA Example 399
20.4 DFMEA Scorecard 401
20.5 Lessons Learned 402
20.6 Summary 403
21. Application of FTA to Real-Life Example 404
21.1 Introduction 404
21.2 Calculating Availability Using Fault Tree Analysis 404
21.3 Building the Basic Events 405
21.4 Building the Fault Tree 406
21.5 Steps for Creating and Estimating the Availability Using FTA 408
21.6 Summary 416
22. Complex High Availability System Analysis 420
22.1 Introduction 420
22.2 Markov Analysis of the Hardware Components 420
22.3 Building a Fault Tree from the Hardware Markov Model 427
22.4 Markov Analysis of the Software Components 427
22.5 Markov Analysis of the Combined Hardware and Software Components 433
22.6 Techniques for Simplifying Markov Analysis 437
22.7 Summary 446
References 447
Index 450
Despre autor
ZACHARY TAYLOR is a Systems Architect at Nokia Solutions & Networks with over thirty years’ experience designing high availability and mission critical systems at GE, Lockheed Martin, and Motorola. He has a Masters in Electrical Engineering.
SUBRAMANYAM RANGANATHAN is a DFSS Master Black Belt at Nokia Solutions & Networks with over twenty years’ experience in the high-tech industry including at Motorola. He has a Masters in Electrical Engineering and an MBA from the Kellogg School of Management.