Creating a Reliable Future

10/18/21 Pratt School of Engineering

Kishor Trivedi educates practicing engineers about the best hardware-software reliability assessment tools available, and creates some new ones of his own

Creating a Reliable Future

Kishor Trivedi has published four books on his favorite computing topic—reliability—and last year, he was honored with the IEEE Reliability Society Lifetime Achievement Award, in recognition of sustained contributions to the methods, tools and education in the reliability assessment of hardware-software systems.

Yet the most prized memory of his career isn’t of receiving an award or seeing his name on the dust jacket of a textbook fresh on the press. It’s of a handshake that happened in Hawaii in 1988.  

“When I was a young professor in computer science, I was given a probability course to teach, but there was no corresponding textbook,” said Trivedi, now the Hudson Distinguished Professor of Electrical and Computer Engineering at Duke. “I started looking through all the existing literature and collecting lots of examples from computing that would fit. I started to give these notes to the students, then computerized them around 1977.”  

Soon, he was putting together the first textbook on the subject of probability for computer and reliability engineers. 

At the time, Trivedi reminded, print quality was not very good. The way a printer would create an “alpha” (a) indicating a variable was to print an “o” flanked by a “<.” The textbook’s reviewer—Dr. Richard Hamming, the famed mathematician who developed Hamming code and created the Hamming port—griped about the low-quality print job, but nonetheless received the book’s ideas favorably.  

Probability and Statistics with Reliability, Queuing and Computer Science Applicationswas published in 1982. Around 1988, Trivedi traveled to the Hawaii International Conference on System Sciences, where he spotted a tall man dressed in white: Hamming. Trivedi introduced himself, and Hamming greeted him with a handshake, saying, “You’re the fellow who wrote that book!”

“It was like meeting a rock star,” said Trivedi, who still smiles when he recounts the meeting.

Over the course of his career, Trivedi has remained starry-eyed about the subject of reliability in general. And the topic is more relevant than ever, as software has made its way from traditional computing into nearly every facet of modern life. 

Accurate Predictions of Performance Depend on Realistic Projections of Reliability 

“The size of software in systems over time resembles a hockey stick,” said Trivedi. The amount of software in systems like aerospace was small at first, and growth curved gently upward like the blade of a hockey stick. In current day systems, software has taken off like a rocket. Now, financial and medical records are software-dependent, as is critical infrastructure including power, water and fuel. 

Software’s exponential growth isn’t itself a problem, said Trivedi—but the people evaluating the software’s trustworthiness tend to overestimate how well it will perform, and that could lead to a software-driven crisis in the near future.

Often, he explained, evaluators use overly simple techniques that assume the perfect independence of each of the system’s components. “In fact,” Trivedi explained, “the system is composed of interdependent components, and dependency generally brings down the level of actual reliability.” 

Overestimating a program’s reliability, unsurprisingly, can lead to overinflated confidence in a system’s ability to perform as it should.

Overestimating a program’s reliability, unsurprisingly, can lead to overinflated confidence in a system’s ability to perform as it should, and so for nearly 50 years Trivedi has been on mission to educate practicing engineers about more robust reliability assessment techniques and tools, and create some of his own.

“Practicing engineers tend to use simple reliability block diagrams and fault trees models, which depend on unrealistic assumptions about independence,” said Trivedi. To better capture extent of dependence among system components, more advanced models like Markov chains are required, but Trivedi said there are significant obstacles to deploying them. “Understanding of these models generally is poor, and commercially available tools offer a shiny user interface but do not have a lot going on beneath the hood,” said Trivedi. 

Trivedi’s approach is multi-level, combining the efficiency and simplicity of reliability block diagram and fault trees with the advanced capability of Markov models. This was the topic of his 2017 book and the central theme of his SHARPE tool, which counts Boeing, Honda, GE and numerous universities among its 1000-plus users. It was also the focus of a four-week intensive study program for engineers at the Air Force Ground Based Strategic Deterrent, or GBSD, a land-based intercontinental ballistic missile system that is being developed for readiness in 2023. Needless to say, there’s no margin for error with this system, so Trivedi led the course to introduce the sophisticated probabilistic modeling techniques and tools that would enable DoD hardware and software engineers to most accurately gauge its reliability. 

Class photo of students in Department of Defense class

The first two weeks covered reliability modeling methods and applications to real-world problems, primarily based on his 2017 book. In the second two weeks, focus shifted to making software more reliable. For this, Trivedi invited 21 other experts, each with their own specific focus, to contribute to the comprehensive crash course.   

The participants were largely unfamiliar with the material and methods Trivedi and his colleagues covered, so Trivedi said he feels encouraged knowing the whole gamut was professionally recorded for posterity, and that future GBSD engineers will also be able to tap into that wealth of knowledge.  

“Software reliability is going to continue to be a problem,” warned Trivedi, “and we need to continue to put our heads together to look at all the different methods available to us and combine them in new ways to get the best and most reliable software possible.”

More News from ECE