The hottest Monitoring Substack posts right now

And their main takeaways

Incident Management: On-Call

Certo Modo • 0 implied HN points • 28 Apr 23

Ensure your on-call rotation is sufficiently staffed to prevent burnout and ensure a timely response to incidents.
Avoid delegating on-call responsibilities to another team to maintain a tight feedback loop and incentivize problem-solving.
Have everyone on the team participate in the on-call rotation to promote empathy, reliability, and a collective care for system stability.

Incident Management: Alerting

Certo Modo • 0 implied HN points • 20 Apr 23

🕹 Technology Monitoring Alerting Incident Response Metrics Automation

Alerting in incident management notifies the team to respond to production problems promptly based on severity levels.
When setting up alerting mechanisms, consider categorizing alerts into pages for emergencies, tickets for best effort during business hours, and logs that require no response.
Craft actionable alerts by enriching them with context like graphs, log entries, and links to runbooks. Test new alerts thoroughly before directing them to the on-call team.

Incident Management: Escalation Policy

Certo Modo • 0 implied HN points • 13 Apr 23

🕹 Technology Incident Management Monitoring

Having a well-defined escalation policy is crucial for effectively addressing production issues that monitoring may not catch. This policy should outline steps to take when the on-call team cannot resolve an issue.
Creating a team page with essential information like how to ask for help, defining emergencies, and team responsibilities helps guide the decision on escalating an issue and waking up the on-call staff if needed.
In larger organizations, centralizing the escalation process by creating a common document with links to different teams, and using consistent tools for escalations, can streamline and speed up the incident resolution process.

Incident Management: Monitoring

Certo Modo • 0 implied HN points • 10 Apr 23

🕹 Technology Monitoring Metrics Data Sources SaaS

Monitoring is a crucial aspect of incident management to detect issues quickly and efficiently.
Top-level metrics like Service Level Indicators (SLIs) and operational metrics provide valuable insights into system health.
Data for monitoring can come from time series data, logs, and traces, and visualization tools like Grafana help in analyzing and interpreting this data effectively.

Observability in a Box

Certo Modo • 0 implied HN points • 14 Feb 23

🕹 Technology Monitoring Infrastructure Software Tools Alerting

Observability tools provide metrics, dashboards, and notifications without software licensing fees.
Some observability tools focus on cloud-native infrastructure, making setup challenging for non-cloud businesses.
O11y-in-a-box simplifies monitoring by providing Prometheus, Loki, and Grafana for performance, availability, log analysis, and alerting on a single-host system.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

What does vaccination mean for rapid screening?

Joshua Gans' Newsletter • 0 implied HN points • 12 Apr 21

🏥 Health & Wellness Vaccination Immunity Monitoring Screening Epidemiology

As vaccination rates increase, the need for rapid screening to prevent outbreaks remains crucial because both vaccines and screening can help control the spread of Covid-19.
The effectiveness of rapid screening in reducing the risk of outbreaks significantly increases as the percentage of vaccinated individuals rises, highlighting the importance of combining vaccination with screening.
There is a potential issue of waning immunity post-vaccination, especially among older populations, which could lead to the re-emergence of Covid-19. This emphasizes the need for a surveillance plan to monitor immunity levels in the vaccinated population and address any possible resurgence.

Test Week: Let's talk about Sewage

Joshua Gans' Newsletter • 0 implied HN points • 21 Aug 20

🔬 Science Research Monitoring Public Health Infectious Diseases

Testing sewage for the novel coronavirus can help in early detection of outbreaks before they spread widely.
Analyzing sewage can provide valuable information about the presence of infectious diseases in a population, and monitoring waste patterns could lead to new public health insights.
Challenges in using sewage testing for surveillance include factors like rainwater affecting the virus presence, variations in viral material survival, and the need for careful data interpretation.

Real-time magic, no elixirs: optimizing Sera with AnyCable

AnyCable Broadcasts • 0 implied HN points • 01 Mar 23

🕹 Technology Infrastructure Monitoring Testing Deployment CI/CD

The project successfully migrated a critical GPS tracking service from Elixir to AnyCable, enabling real-time features and smoother maintenance.
The team optimized the infrastructure using AWS ECS, Fargate, and CloudFormation, delivering improvements in performance, scalability, and resource management.
AnyCable deployment was streamlined within the project's infrastructure, bringing in monitoring features and helping speed up the CI/CD pipeline.

Access to powerful AI might make computer security radically easier

Redwood Research blog • 0 implied HN points • 10 Jun 24

🕹 Technology AI Monitoring

Access to powerful AI could significantly simplify computer security by automating monitoring and flagging suspicious activities before they cause harm.
Trust displacement by utilizing AI for tasks that could pose security risks if performed by humans can enhance security measures.
Fine-grained permission management with AI could improve security by efficiently handling complex security policies that humans find cumbersome.

Managing catastrophic misuse without robust AI

Redwood Research blog • 0 implied HN points • 07 May 24

🕹 Technology AI Security Ethics Cybercrime Monitoring

Managing catastrophic misuse of powerful AIs requires strategies to ensure they refuse tasks with potential for harm.
Dealing with bioterrorism misuse may involve creating separate API endpoints, stringent user checks, and monitoring for suspicious activities.
Mitigating large-scale cybercrime with AI may involve monitoring, human auditing, and banning users based on suspicious behavior.

luôn có lỗi trong hệ thống

Thái | Hacker | Kỹ sư tin tặc • 0 implied HN points • 29 Jul 08

🕹 Technology Cybersecurity Vulnerabilities Monitoring Data Tracking

There will always be errors in systems and living with serious flaws is a reality.
There are numerous security vulnerabilities and it's challenging for sysadmins to patch all of them.
Continuous monitoring is crucial for ensuring the safety of a system that always has flaws.

Choosing Good SLIs

realkinetic • 0 implied HN points • 19 Feb 24

🕹 Technology Monitoring Cloud Metrics

Transitioning from on-premises to cloud environments requires a shift in monitoring practices, avoiding traditional data center-focused metrics that may not apply well to cloud-native systems.
Select SLIs based on the customer experience, focusing on key metrics like traffic rate, error rate, and latency that directly impact user satisfaction.
Ensure SLIs are user-centric to proactively monitor and improve customer experience, avoiding distractions with irrelevant metrics that do not align with actual user needs.

Guidelines for Chaos Engineering, Part 2

realkinetic • 0 implied HN points • 08 Sep 20

🕹 Technology Engineering Resilience System Design Monitoring

Identify critical systems before introducing chaos engineering to ensure the most impact on the business.
Focus on testing critical components first, particularly those dealing with state, before moving on to less critical systems.
Chaos engineering is an iterative process that should be performed in non-production environments first, with an aim towards ultimately testing in production.

Guidelines for Chaos Engineering, Part 1

realkinetic • 0 implied HN points • 06 Jul 20

🕹 Technology Chaos Engineering Monitoring Testing Resilience Observability

Chaos testing helps understand how systems react to failure and ensures adequate monitoring for resilience.
The goals of chaos testing include aligning system behavior with expectations and identifying gaps in monitoring and response capabilities.
Performing chaos engineering involves defining steady-state metrics, forming hypotheses, running experiments, and adapting based on findings.

Microservice Observability, Part 1: Disambiguating Observability and Monitoring

realkinetic • 0 implied HN points • 03 Oct 19

🕹 Technology Microservices Observability Monitoring Architecture

In microservice architectures, the conversation shifts from traditional monitoring to observability due to the complexity of multiple services interacting dynamically.
In static monolithic architectures, monitoring is more straightforward with a single runtime and centralized telemetry.
Observability offers deeper insights into system behavior by exploring new discoveries after the fact, providing more context and a higher level of granularity compared to traditional monitoring.

Authenticating Stackdriver Uptime Checks for Identity-Aware Proxy

realkinetic • 0 implied HN points • 29 Jan 19

🕹 Technology Cloud Computing Security Monitoring Serverless Computing

Google Stackdriver provides free uptime checks for monitoring service availability across regions and response latencies.
Implementing Stackdriver uptime checks with Cloud Identity-Aware Proxy can be challenging due to authentication requirements.
A workaround solution involves using Google Cloud Functions as a proxy to authenticate Stackdriver uptime checks for IAP-protected resources.

The Observability Pipeline

realkinetic • 0 implied HN points • 12 Sep 18

🕹 Technology Cloud Computing Monitoring Data Collection

Systems are now more distributed and dynamic due to the rise of cloud and containers, requiring new tools and practices to support them
Observability in modern cloud-native environments involves gathering data for granular insights and empowered debugging through structured logging, metrics, traces, and events
Building an observability pipeline helps decouple data collection from ingestion into various systems and allows flexibility to add or replace tools without major disruptions