The hottest Incident Management Substack posts right now

Denial of service (DoS) attacks aim to overwhelm a system with traffic, rendering it inaccessible. Robust security operations center capabilities are crucial for detecting and mitigating these attacks effectively.
Microsoft Sentinel offers tools like analytics rules, incident management, and threat intelligence integration for detecting and responding to DoS attacks in real-time.
To mitigate DoS attacks, organizations can leverage network traffic monitoring, DDoS protection integration, and incident response playbooks offered by Microsoft Sentinel.

SBOMs, or Software Bill of Materials, list components of software products. They help organizations know what parts make up their software, which is important for security.
The NSA offers guidelines for managing SBOMs, emphasizing the need for both software suppliers and consumers to take security seriously. Suppliers should be transparent and accountable, while consumers should ensure their suppliers follow good security practices.
Organizations need effective SBOM tools that can manage and analyze software components, detect vulnerabilities, and facilitate easy reporting. These tools should also be user-friendly to help teams work efficiently.

At Netflix, there was a serious concurrency bug causing CPU problems, and they needed a quick solution. They couldn't fix it right away and had to come up with a way to keep their systems running through the weekend.
Instead of manually fixing everything, they created a self-healing system. They randomly killed a few server instances every 15 minutes, replacing them with fresh ones, which allowed the team to relax during the crisis.
This situation taught them that sometimes unconventional solutions are necessary. Prioritizing the team's well-being can be just as important as fixing technical issues.

Microsoft Security Copilot enhances security by seamlessly integrating with Microsoft Purview, simplifying security policies and governance.
The AI capabilities of Microsoft Security Copilot aid in proactive threat detection and response by analyzing data to identify potential risks before they escalate.
Automated compliance and data governance processes are streamlined through the combination of Microsoft Purview's features and Security Copilot's automation, facilitating adherence to regulations.

Receive an email notification each morning with the list of daily Microsoft Sentinel incidents created.
The Logic App provided automates the process of checking and compiling incident details for easy access.
Customize the email notification further by filtering incidents based on severity levels for more targeted updates.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Use Dead Letter Queue (DLQ) to recover from failures and get notified in fault tolerant systems.
Set up DLQ with two queues - Original for processing messages and DLQ to capture failed messages for retries.
Implement monitoring for DLQ growth to alert on-call engineers and prevent issues from affecting customers.

Ensure your on-call rotation is sufficiently staffed to prevent burnout and ensure a timely response to incidents.
Avoid delegating on-call responsibilities to another team to maintain a tight feedback loop and incentivize problem-solving.
Have everyone on the team participate in the on-call rotation to promote empathy, reliability, and a collective care for system stability.

Having a well-defined escalation policy is crucial for effectively addressing production issues that monitoring may not catch. This policy should outline steps to take when the on-call team cannot resolve an issue.
Creating a team page with essential information like how to ask for help, defining emergencies, and team responsibilities helps guide the decision on escalating an issue and waking up the on-call staff if needed.
In larger organizations, centralizing the escalation process by creating a common document with links to different teams, and using consistent tools for escalations, can streamline and speed up the incident resolution process.

Write-ups are essential after incidents to learn and improve. They help document the incident, leading to better post-mortems and prevention strategies.
Creating an effective write-up involves describing the impact, crafting a detailed timeline, and using it to tell a coherent story. Following a specific format makes understanding easier.
Understanding what triggered the incident, identifying fixes, and improvements are crucial steps. Focus on blameless analysis, seek contributing factors, and fine-tune prevention strategies.

In incident management, avoid blame and focus on process and organizational factors. Blameless post-mortems are crucial.
Consider power dynamics in post-mortems. Allow a separate group to handle incidents to prevent bias and promote improvement.
Incidents rarely have a single root cause. Embrace a more complex root cause analysis to understand the multifaceted reasons behind failures.