Gremlin can now automatically find common reliability issues

August 30, 2023 ndowd

Gremlin, the reliability testing startup best known for its chaos engineering tools, today announced the launch of its Detected Risks feature. With this, Gremlin can now automatically identify high-priority reliability issues like misconfiguration or bad default values in Kubernetes-based services and then categorize them by the severity of the risk they present. The service will also suggest potential fixes.

“Reliability continues to grow in importance,” said Kolton Andrus, CTO and founder of Gremlin. “Our digital infrastructure is as important as our physical infrastructure. Government, healthcare, transportation, communication and finance all rely on this digital foundation, and it has risks. Fortunately, many of these risks are simple to mitigate — if they are known. That is why we are excited to announce our new Detected Risks. We have worked hard to quickly expose serious issues within our customers’ systems, risks that they can then mitigate to qualitatively improve the posture of their systems.”

Image Credits: Gremlin

Whereas Gremlin’s chaos engineering tools look for unusual situations that can push a company’s infrastructure to its limits, Detected Risks uses a set of pre-configured tests, with 20 more coming later this year. These tests check for common issues that can affect how reliable and resilient a company’s infrastructure really is. Detected Risks works without having to run chaos engineering experiments or reliability tests.

To a large degree, these tests are pretty straightforward and encapsulate best practices, like ensuring that a deployment is configured to run in multiple availability zones to ensure redundancy. That may seem like common sense, but in looking at the thousands of deployments that its customers run, Gremlin found that 26% had no redundancy and 80% of deployments did not have two redundancies. The company notes that the system also looks for common Kubernetes misconfigurations that could affect autoscaling, for example.

“Our industry has many bright SREs working hard to personally mitigate these issues, but that approach doesn’t scale,” said Andrus. “We are solving this problem by building something easy to use that provides valuable insight across thousands of real-world applications. Providing engineering leadership with visibility into existing risks helps them prioritize and accomplish this important work so that they can continue to protect the customer experience and build high-quality software.”

source