©Resolve.ai - All rights reserved
Get back to building.
©Resolve.ai - All rights reserved
Meet Lanting Chiang, a Software Engineer whose role has evolved to embrace Site Reliability Engineering (SRE). In this candid conversation, she shares her journey into the world of infrastructure, her experiences with on-call rotations, and practical advice for those navigating the challenges of on-call duties.
When I first started, our on-call rotation was pretty intense. It was just constant noise - pages during the day, pages during the night, plus managing a service desk where people would tag you for all sorts of things. But we've made real progress. By focusing on reliability and emphasizing testing, we've managed to significantly reduce the noise level of our alerts.
These days, things are usually fine during normal hours. We still face some challenges at night, particularly during low-traffic periods and scale-downs. Our CD pipeline can be a bit temperamental too, sometimes paging during non-sleep hours. It's better than it used to be, but there's always room for improvement.
I'll be honest - being on-call is hard. Even when nothing's happening, I find myself stressed just anticipating that something might go wrong. I've had to develop strategies to cope with this:
I do breathing exercises and yoga, especially during my on-call weeks. One of the trickiest parts is just getting outside. It's really hard to go out and get fresh air and sunlight because I feel stressed if I'm without my laptop, and I definitely don't want to go on a walk carrying it around! I've learned to be intentional about even just stepping out my front door. I've found that there's usually a quiet period between 9 AM to 11 AM EST, so I try to do my grocery shopping during that window.
About a year and a half ago, we had an interesting incident that really stuck with me. We were using an internal authentication webhook for our Kubernetes clusters - something we'd adapted from an open-source GitHub project. What we didn't realize was that the OAuth client was owned by the original code author. We had never really questioned who owned it or where it was hosted; it just worked, so we left it alone.
Then one day, the OAuth client suddenly got deleted, and no one could authenticate to our Kubernetes clusters. It took us quite a while to figure out what was wrong because none of us were really familiar with the authentication flow. It was one of those "it just worked until it didn't" situations that taught us the importance of understanding every component in our system, even the borrowed ones.
When I'm responding to incidents, I rely on a few key tools:
If you're just starting your on-call journey, here's what I've learned:
The biggest thing is not to get too emotionally involved. Your brain can start doing this thing where it snowballs to the worst cases - "Oh, this alert goes off. Oh, if I don't fix it. Oh, we're probably gonna be down. Oh, I'm gonna get fired." It just irrationally spins out of control. Try to focus on the problem at hand and remember that it's okay if you don't know how to fix something. There's always someone to escalate to, and you should never be afraid to do that.
I'll admit, keeping up with the fast pace of software development is challenging. I subscribe to newsletters like "byte byte go" and browse through interesting blog posts that my teammates share in our Slack channels. It's not perfect, but it helps me stay informed about what's happening in the industry.
One thing I've noticed is that people outside of tech often don't understand what being on-call really means. It's hard to explain why you need to carry your laptop everywhere, or why your phone might go off in the middle of the night. They just see someone in tech carrying their laptop around like it's a security blanket!
Being on-call is demanding, but it's also taught me a lot about systems, incident response, and most importantly, about taking care of myself while maintaining high availability. If there's one thing I want you to take away from my experience, it's that finding the right balance between readiness and self-care is crucial for surviving and thriving in an on-call role.
Resolve AI has launched with a $35M Seed round to automate software operations for engineers using agentic AI, reducing mean time to resolve incidents by 5x, and allowing engineers to focus on innovation by handling operational tasks autonomously.
Resolve AI has built a holistic AI platform for proactive incident troubleshooting and operational efficiency.
Meet Resolve AI’s Production Engineer, an AI-powered tool designed to autonomously handle incident response and root cause analysis, reducing on-call stress and allowing engineers to focus on innovation.