How to turn chaos into clarity with Investigation Docs as an engineer
Guest post by Karthik Subramanian, Software Engineer at Rippling, ex-Pinterest
Hi fellow High Growth Engineer, Jordan here 👋
Today’s article features Karthik Subramanian, software engineer at Rippling and author of Karthik’s Newsletter. I’ve been impressed with how actionable Karthik’s writing is. I’ve shared his article on how to succeed as an intern with practically every intern I’ve met 😄.
However, today’s topic is on how to drive incident investigations and create clarity out of chaos. Strap in, this is a good one.
Without further ado, I’ll pass the mic 🎤 to Karthik 👏
“Hey Karthik, we've noticed a sharp decline in our core metrics across the board over the past few days. Any idea what's going on?”
Alarm bells 🚨 fired off in my head. I felt my stomach twist into knots. A critical data pipeline for downstream operations had suddenly stopped working. The pressure was on to figure out the cause and fix it.
Initially, I felt confident, running queries and reviewing dashboards like a detective sifting through clues. But as each new piece of data seemed to contradict my previous theories, that confidence quickly became frustration. I was trapped in a loop, chasing my tail while the clock ticked away mercilessly.
By 2 AM, I was surrounded by a virtual war room of Slack threads and scattered query notebook cells, with no clear answers. It was chaos.
But what if there was a tool that could not only help solve your current crisis, but also supercharge your career growth?
Enter the investigation doc—a simple, yet powerful weapon in your problem-solving arsenal.
“Ugh, another document to write?" you may say, but trust me, this could be the key to transforming how you tackle complex issues and earn the trust of your team.
In this article, we’ll explore:
Why investigation docs are game-changers for debugging
How investigation docs led to improving a production system
Ready to turn chaos into clarity? Let’s dive in and discover how a little documentation can make you the go-to problem solver on your team.
🤔 Why Write an Investigation Doc?
As software engineers, we often encounter challenges ranging from tricky bugs to full-blown production alerts that affect stakeholders. In these moments, writing an investigation doc provides the following benefits:
Clarity in writing: You’re forced to summarize/articulate the issue in a structured format.
Easy Distribution: A well-written doc can be easily shared with team members, stakeholders, and even future you, ensuring everyone is on the same page.
Demonstrates Ownership: By documenting your problem-solving process, you show end-to-end ownership of issues. This builds trust within your team and organization and prepares you for more complex and ambiguous projects in your career.
Learning Tool: Investigation docs serve as excellent learning resources for the team, helping prevent similar issues in the future and speeding up resolution times.
Investigation Docs act as a catalyst for your growth!
📝 What It Looks Like
The components of an investigation doc usually include the following:
TLDR: A brief summary of the issue and resolution for quick reference.
Background/Context: Outline any underlying context behind the system or process of context. This section can also add any context on what the ongoing projects were for that system.
Detection: How was the issue detected? Include a timeline if relevant
Investigation Steps: Detail the steps taken to identify the root cause. Be specific about queries run, dashboards checked, and thought processes.
Resolution: Describe the steps taken to resolve the issue once the root cause was identified.
Learnings: Reflect on what the team learned from the investigation and how to prevent similar issues or speed up future investigations.
Let’s break down each section with an example! 🎬
💡Example: The Case of the Missing Donation Receipts
We’ll examine this example doc. It would be difficult to show an analysis of the full document in this post, but I’ll highlight the 2 most important sections: The “TLDR” and the “Investigation Process.”
In the TLDR, your goals are:
Describe the issue so anyone can understand
Summarize the root cause
Highlight the business impact and how you’ll resolve it
In the “Investigation Doc,” your goals are to:
Keep track of what you’ve tried so you can constantly move forward
Record for others for the future how to solve similar issues
You can see below that I keep track of each step I took, which helps me avoid being drowned in different theories and keep a clear step-by-step thought process.
To see each section analyzed, head to this part of the example doc.
How and When I Use the Investigation Doc
In a similar scenario to the above, I was drowning in a flurry of Slack threads, scattered query notebook cells, and no clear path forward.
I wasted valuable time retesting the same hypotheses and lacked a systematic approach to zero in on the root cause. The ideas and rabbit holes were all in my head!
Investigation Docs turned confusion into a solution!
They forced me to step back, clearly document the issue at hand, and list out the investigation steps I’d already taken. This process helped me organize my thoughts and quickly identify new hypotheses to test. Eventually, I was able to pinpoint the root cause.
When I needed to involve another engineer, I shared the doc instead of sending a giant paragraph of context in Slack. This saved time and quickly aligned us on the problem.
And as a bonus, structuring the problem-solving process using the Investigation Doc framework revealed opportunities to improve the system’s observability. I used it to add new charts to the dashboard, which made the next issue easier to handle!
📤 Sharing and Acting on Your Investigation Doc
Writing the doc is just the beginning. Now that you have an artifact, you can share it to maximize learning, impact, and prevention of future issues.
Share Widely: Present your findings in team meetings or relevant threads/channels.
Invite Feedback: Encourage team members to comment and provide additional insights. This collaborative approach can uncover blind spots and lead to even better solutions.
Create Action Items: Turn your learnings into concrete tasks. In the example investigation doc, some action items might include:
Improve observability of the Receipt Generation step of the donation processing pipeline → Add charts and alerting to the existing dashboard
Improve test coverage of the entire donation flow → enhance unit tests and end-to-end testing of the system
Uplevel team’s knowledge base of the Receipt Generation system → Setup a Lunch-and-learn session, update wikis/documentation
Follow Up: Regularly revisit your action items to ensure they're being implemented. This shows dedication to continuous improvement and helps prevent similar issues in the future.
📖 TL;DR
Investigation docs drive systematic problem solving, knowledge sharing, and career growth. You can use them for issues ranging from small bugs to big production issues
The general components of an investigation doc generally include the following:
TLDR
Background/Context
Detection
Investigation Steps
Resolution/Next Steps
By taking the time to document your investigative process, you’re not just solving today’s problem—you’re setting yourself and your team up for success in tackling future challenges
To get started easily with implementing investigation docs in your engineering toolbox, check out this template
🙏 Thank you to Karthik
Thank you again to Karthik for the highly actionable article on creating structure and clarity in highly intense situations at work, plus creating a full template and example we can all use. It also reminds me of the popular MECE Framework article section on investigation updates. The template you gave provides a structured doc to use MECE.
Check out his newsletter if you’d like to see more from Karthik.
👏 Shout-outs of the week
The 13 Software Engineering Laws on
by Anton Zaides — One of the most viral tech articles this year. It’s the best reference on software engineering laws you’ve heard throughout your career, explaining how and why to use them.How to collaborate cross-functionally on
by Torsten Walbaum — Decades of experience packed into a single article of actionable advice on working effectively across teams. A must-read.How to use Cursor AI to build side projects on
— Check this collaboration post between Gregor, me, and Sidwyn Koh, which shows you the top ways to use Cursor IDE to build quickly.
Thank you for reading and getting us to 90k subscribers! INSANE! I appreciate you 🙏
You can also hit the like ❤️ button at the bottom of this email to support me or share it with a friend to earn referral rewards. It helps me a ton!
Loved this write up on investigation docs.
I follow a similar pattern for post mortems / retros, but this felt like a new idea to use it while I’m investigating and after.
One trick I’ve been using recently is if I need quick updates to the team while investigating - say during an incident, I’ll take notes with links to datadog / sumo in slack, and then afterwards I’ll ask Claude to create a summary for me to share with others.
I can see giving Claude this template to follow for the summary!