Certo Modo

Certo Modo is a newsletter that covers DevOps and SRE best practices, focusing on strategies for efficient systems reliability, automation, and team collaboration. It offers insights into emotional intelligence, Ansible, cloud tools, CI/CD pipelines, incident management, and observability, catering to both smaller organizations and larger companies.

DevOps/SRE Best Practices Emotional Intelligence in Tech Ansible Usage and Security Lean SRE Implementation Shell Scripting Demonstrating Value in DevOps/SRE Continuous Integration and Continuous Deployment Incident Management Cloud Infrastructure Team Collaboration and Communication System Monitoring and Observability Interview and Career Advancement in Tech

The hottest Substack posts of Certo Modo

And their main takeaways

Lean SRE

19 implied HN points • 18 Mar 24

Smaller organizations and startups can benefit from implementing Site Reliability Engineering (SRE) practices, leading to reduced operational costs and time savings.
Implementing SRE practices in smaller companies may differ in approach from larger organizations, but can still yield significant benefits.
Starting an SRE program at a larger company can be achieved by beginning with just one software engineering team.

Emotional Intelligence

39 implied HN points • 09 Feb 23

🏥 Health & Wellness Emotional Intelligence Stress management Decision-making

Emotional intelligence is crucial in the DevOps/SRE space for managing emotions, reasoning, and decision-making.
Recognize and manage 'amygdala hijacks' in stressful situations at work to maintain clear thinking and avoid reactive behavior.
Understanding emotions as information from others is important in social settings for effective communication and decision-making at work.

Ansible Tips and Tricks

19 implied HN points • 03 Oct 23

🕹 Technology Security Automation Performance optimization

Organize your Ansible files by following a recommended directory structure. This helps keep things structured and manageable as your project grows.
Avoid putting secrets like credentials directly into variable files. Use Ansible Vault to encrypt sensitive information, maintaining security.
Utilize tools like Ansible-Lint for verifying playbook syntax, and the --check option in ansible-playbook for 'dry-runs' to catch errors before affecting production.

In Defense of Shell Scripts

1 HN point • 26 Feb 24

🕹 Technology Programming Development Tools

Consider using shell scripts when CLI tools are available and APIs aren't, for more efficiency.
For quick prototypes, opt for a shell script solution to validate ideas swiftly before committing to a more complex programming language.
When developing CLI tools, prioritize speed and consider using compiled languages like Golang or Rust for efficiency.

How to Show Your Value In DevOps/SRE

0 implied HN points • 14 Dec 23

🕹 Technology DevOps Business impact Operational Efficiency Communication Strategies

Focus on demonstrating the impact of your work to the business in terms of time and money saved/made compared to what you are being paid.
Communicate the importance of your work to your peers and stakeholders by adding value propositions to your tasks, measuring impact, and tracking significant wins with supporting metrics.
Consistently delivering impactful work, improving organizational perception, and effective communication can lead to growth opportunities such as team expansion, promotions, and better job offers.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Building DroneCI Pipelines

0 implied HN points • 14 Nov 23

🕹 Technology CI/CD DevOps Software Development Monitoring Containers

Each pipeline step in DroneCI can use different container images, allowing for versatile tasks like testing across multiple platforms.
Base64 encoding secrets in DroneCI is a useful technique for securely handling sensitive information like SSH keys.
Monitoring DroneCI pipelines can be enhanced by utilizing Prometheus to track status, duration, and using a Push Gateway to export build metrics.

Running DroneCI

0 implied HN points • 30 Oct 23

🕹 Technology CI/CD Continuous Integration CLI Software Engineering

Drone CI is a self-service Continuous Integration platform that simplifies CI pipeline configuration by using a .drone.yml file in the project's root directory.
Make sure the Drone service is public-facing with port 443 accessible, and use a read-through cache with Docker to ensure resilience to Docker Hub outages.
Add yourself as an administrator to new Drone installations to have full access to API features, set up cron interval as needed, and utilize the CLI tool for more advanced capabilities.

It's Time to Stop Using Jenkins

0 implied HN points • 19 Sep 23

🕹 Technology CI/CD Automation Security Alternatives

Consider alternatives to Jenkins for new software projects due to limitations in plugin complexity.
Evaluate your R&D department's expertise and resources to manage Jenkins installation and perform updates.
Assess the security risks and maintenance requirements of Jenkins installations to prevent potential breaches.

Automate Production in Three Steps!

0 implied HN points • 07 Sep 23

🕹 Technology Automation Documentation Tools Features Production

Automating manual tasks is crucial for growth, as manual work can consume time needed for innovation and advancement.
Runbooks, as documented step-by-step instructions, are key to delegating work, reducing single points of failure, and ensuring consistency in task execution.
Converting manual scripts into full-fledged software services allows for instant and automated task completion, improving efficiency and reducing human involvement.

Video: Beating Big Tech Coding Interviews

0 implied HN points • 24 Aug 23

🕹 Technology Programming Interviews

The post discusses tactics for excelling in coding interviews at big tech companies.
The video presentation offers valuable insights on preparing for challenging parts of the interview process, particularly focusing on technical aspects.
The talk was delivered at a Vegas Programmers Meetup and acts as a complement to a previous post about obtaining an SRE role.

Cloud Lessons: Launching a K3S Cluster

0 implied HN points • 04 Aug 23

🕹 Technology Cloud Computing Infrastructure as Code Kubernetes

Starting a series to explore cloud-native tools like Kubernetes can be an exciting and beneficial learning experience.
Setting up a K3S cluster on a cloud provider using Terraform for infrastructure and Ansible for configuration can be cost-effective and efficient.
Linux systems knowledge, including reading logs, writing scripts, and basic networking, is essential when encountering roadblocks during server setup and configuration.

Kanban Quickstart

0 implied HN points • 28 Jul 23

💼 Business Management Productivity Metrics Process Improvement Teamwork

Kanban is an efficient process for organizing team work, especially for interrupt-driven teams like Site Reliability Engineering, Operations, IT, or Customer Support.
Implementing Kanban provides visibility, measurability, and easy reprioritization of work, helping build credibility and trust with stakeholders.
Key elements of effectively managing Kanban include tracking all work items, creating clear work state columns, limiting work-in-progress, and using metrics to drive continuous improvement.

Podcast Appearance: Day Two Cloud

0 implied HN points • 20 Jul 23

🕹 Technology Podcasts Cloud Computing Tech skills

The podcast episode discusses the role of an SRE and how it differs from a system administrator.
It highlights the importance of communication and persuasion skills for an SRE.
The conversation also delves into the relationship between SRE and DevOps, interview processes, and skills requirements.

Podcast Appearance: Practical Operations

0 implied HN points • 28 Jun 23

🎙 Podcasts Technology Business DevOps Consulting

Amin Astaneh was a guest on the Practical Operations podcast, discussing DevOps transformations and Site Reliability Engineering.
The podcast focuses on real-world use cases and solutions to common problems in systems and operations.
The hosts of the podcast are highly knowledgeable and entertaining to listen to.

Parallel Distributed Shell

0 implied HN points • 22 Jun 23

🕹 Technology Operations Incident Response

The parallel distributed shell is a CLI tool helpful for troubleshooting in large-scale systems.
It allows engineers to simultaneously run commands on multiple hosts, offering a 'break-glass' solution.
Various options exist like dsh, pdsh, hyper-shell, and AWS Run Command for managing infrastructure when traditional methods fail.

Cross-Functional Collaboration

0 implied HN points • 16 Jun 23

💼 Business Collaboration Leadership Communication Change Management

Valuable work in tech is often a collaborative effort involving different teams and perspectives.
To foster cross-functional collaboration, communicate the big picture clearly through mechanisms like RFCs and understand and align with the incentives of others.
Breaking down silos, enabling effective communication, and celebrating contributions are key to successful collaboration between diverse teams.

System Call Tracing

0 implied HN points • 09 Jun 23

🕹 Technology Performance Linux Tools Troubleshooting

System calls are how programs interact with the operating system to request and manage resources like memory and files.
System call tracing allows real-time observation of running processes to understand resource usage and behavior.
Tracing tools like strace and perf can help diagnose issues in production systems but come with a performance impact, requiring caution in usage.

How to Get an SRE Role

0 implied HN points • 01 Jun 23

🕹 Technology Engineering Skills Interviews Systems Linux

To excel in an SRE role, focus on developing important character traits like emotional intelligence, resilience, and assertiveness to stand out as a candidate.
Coding skills are essential for an SRE position; expect to be tested on tasks like file I/O, data structures, and program efficiency, so practice coding and explaining your solutions.
Understanding systems knowledge and experience is crucial; be prepared to discuss Linux internals, troubleshooting tools, and system administration basics in interviews to showcase your expertise.

Running Post-Mortems

0 implied HN points • 23 May 23

🕹 Technology Development Meetings Incidents Moderation Presentations

Post-mortems are crucial for teams to learn from failure and improve systems.
Scheduling recurring post-mortem meetings helps in consistent learning and fostering a culture of continuous improvement.
Selecting a capable moderator and presenter, preparing for the meeting, facilitating discussions, and following up on action items are key responsibilities for successful post-mortems.

Incident Write-ups

0 implied HN points • 12 May 23

🕹 Technology Incident Management Data Analysis Software Engineering

Write-ups are essential after incidents to learn and improve. They help document the incident, leading to better post-mortems and prevention strategies.
Creating an effective write-up involves describing the impact, crafting a detailed timeline, and using it to tell a coherent story. Following a specific format makes understanding easier.
Understanding what triggered the incident, identifying fixes, and improvements are crucial steps. Focus on blameless analysis, seek contributing factors, and fine-tune prevention strategies.

Slight Reliability Ep 70: Meta SRE

0 implied HN points • 09 Oct 23

🕹 Technology Podcasts

The post discusses the experience in Meta's Production Engineering and a story about almost causing a server room fire early in the career.
The content is part of the Slight Reliability Podcast episode 70 on Meta SRE.
The author, Amin Astaneh, shares insights and anecdotes related to reliability engineering and server management.

Incident Management: On-Call

0 implied HN points • 28 Apr 23

🕹 Technology Incident Management Alerting Monitoring

Ensure your on-call rotation is sufficiently staffed to prevent burnout and ensure a timely response to incidents.
Avoid delegating on-call responsibilities to another team to maintain a tight feedback loop and incentivize problem-solving.
Have everyone on the team participate in the on-call rotation to promote empathy, reliability, and a collective care for system stability.

Incident Management: Alerting

0 implied HN points • 20 Apr 23

🕹 Technology Monitoring Alerting Incident Response Metrics Automation

Alerting in incident management notifies the team to respond to production problems promptly based on severity levels.
When setting up alerting mechanisms, consider categorizing alerts into pages for emergencies, tickets for best effort during business hours, and logs that require no response.
Craft actionable alerts by enriching them with context like graphs, log entries, and links to runbooks. Test new alerts thoroughly before directing them to the on-call team.

Incident Management: Escalation Policy

0 implied HN points • 13 Apr 23

🕹 Technology Incident Management Monitoring

Having a well-defined escalation policy is crucial for effectively addressing production issues that monitoring may not catch. This policy should outline steps to take when the on-call team cannot resolve an issue.
Creating a team page with essential information like how to ask for help, defining emergencies, and team responsibilities helps guide the decision on escalating an issue and waking up the on-call staff if needed.
In larger organizations, centralizing the escalation process by creating a common document with links to different teams, and using consistent tools for escalations, can streamline and speed up the incident resolution process.

Incident Management: Monitoring

0 implied HN points • 10 Apr 23

🕹 Technology Monitoring Metrics Data Sources SaaS

Monitoring is a crucial aspect of incident management to detect issues quickly and efficiently.
Top-level metrics like Service Level Indicators (SLIs) and operational metrics provide valuable insights into system health.
Data for monitoring can come from time series data, logs, and traces, and visualization tools like Grafana help in analyzing and interpreting this data effectively.

Running Successful Engagements

0 implied HN points • 28 Mar 23

🕹 Technology Team Collaboration Documentation Feedback

Identify and maintain a relationship with the team's point of contact to ensure clear communication and accountability.
Prior to starting an engagement, conduct initial discovery to understand the team's operational needs and potential risks.
Create a clear engagement document outlining goals, expectations, and metrics for success, ensuring alignment with the team's objectives.

SRE Essentials

0 implied HN points • 06 Mar 23

🕹 Technology Engineering Operations Software Automation Reliability

Site Reliability Engineering (SRE) teams drive higher operational maturity, remove sources of toil, and improve service reliability.
Establishing strong SRE practices involves shared operational responsibility, measuring customer success, using error budgets to prioritize work, and learning from failures in a blameless manner.
Properly staffing on-call rotations and ensuring humane work-life balance are essential for SRE team success.

Production Readiness Review

0 implied HN points • 20 Feb 23

🕹 Technology Development Productivity Processes Reliability

Product launches are crucial and can make or break a business, depending on how well they are received by customers.
The Production Readiness Review (PRR) is a valuable process that ensures a team is fully prepared to offer a product to paying customers by evaluating operational responsibilities.
The PRR process involves creating a standardized questionnaire, delegating questions to team members, presenting and discussing findings, providing feedback, and making a go/no-go decision based on known risks before launching the product.

Observability in a Box

0 implied HN points • 14 Feb 23

🕹 Technology Monitoring Infrastructure Software Tools Alerting

Observability tools provide metrics, dashboards, and notifications without software licensing fees.
Some observability tools focus on cloud-native infrastructure, making setup challenging for non-cloud businesses.
O11y-in-a-box simplifies monitoring by providing Prometheus, Loki, and Grafana for performance, availability, log analysis, and alerting on a single-host system.

On-Call Retrospectives

0 implied HN points • 01 Feb 23

🕹 Technology Operations Improvement Team Collaboration Metrics Automation

On-call retrospectives help teams stay connected to the real operational challenges they face and provide insights on how to enhance the on-call experience.
Holding weekly meetings where team members share metrics, experiences, and discuss improvements can lead to a more efficient and enjoyable on-call rotation.
Taking notes during the retrospective and translating them into actionable tasks for improvement can result in smoother on-call shifts and increased team productivity.

Blameless Postmortems

0 implied HN points • 27 Jan 23

🕹 Technology Software Development Incident Response Engineering Culture Automation Technology Tools

Conduct blameless postmortems to learn from failures and improve system reliability.
Create an environment of psychological safety for teams to openly discuss incident factors.
Moderate postmortems by setting blameless tone, avoiding 'human error' as root cause, and probing underlying failures.

Podcast Appearance: All Things Ops

0 implied HN points • 18 Aug 23

🎙 Podcasts Interview Technology

A discussion about what makes the perfect Site Reliability Engineer
Insight into the reasons for and benefits of a DevOps transformation
Highlighting important tools for modern Site Reliability Engineering

On-Call Stories: Flying Blind

0 implied HN points • 20 Jul 23

🕹 Technology Incidents Alerting Operations

Facing unexpected incidents and downtime is a common challenge in tech operations.
Proactively solving issues during stressful on-call duties can lead to innovative solutions.
Implementing customized alerting systems can greatly improve incident response and team efficiency.

SRE Engagement Models

0 implied HN points • 20 Mar 23

🕹 Technology SRE Reliability Software Engineering Operations

SREs engage with software engineering organizations in different ways to help achieve goals.
Engagement models include consulting, embedded, and infra team, each with unique benefits and challenges.
Implementing SRE involves balancing tradeoffs based on challenges, budget, and organizational needs.

Hidden Benefits of SLOs

0 implied HN points • 27 Feb 23

🕹 Technology Tech

Service Level Objectives (SLOs) reveal important customer experience metrics beyond just uptime, such as latency and error rate.
Developing SLOs fosters cross-functional collaboration within an organization, breaking down silos and promoting a unified approach to reliability.
Implementing SLOs can lead to investments in improved observability, enhancing infrastructure management and monitoring for long-term operational benefits.

Slight Reliability Ep 82: CI/CD

0 implied HN points • 13 Feb 24

🕹 Technology Software Podcast

The post discusses the basics of CI/CD, including change management, running a Change Advisory Board, testing in production, and managing test/deploy infrastructure.
It highlights the importance of understanding and implementing CI/CD processes for efficient software development and deployment.
The post provides insights into how to effectively manage and optimize the test and deployment infrastructure for software projects.

Why Adopt DevOps & SRE?

0 implied HN points • 05 May 23

🕹 Technology DevOps SRE Software Development

In software development, the goal is to make money by increasing subscriptions and shipping code quickly while minimizing operational costs.
DevOps and Site Reliability Engineering (SRE) help increase code delivery by enabling frequent deployments and short lead times for bug fixes.
DevOps and SRE also help reduce infrastructure costs through techniques like capacity planning and identifying resource bottlenecks to optimize performance.

Podcast Appearance: Slight Reliability

0 implied HN points • 13 Jul 23

🎙 Podcasts Interview Tech

Amin Astaneh appeared on the Slight Reliability podcast to discuss site reliability engineering (SRE).
The podcast covered topics like making ops work visible, measuring toil, and implementing SLOs.
Host Stephen Townshend brought valuable SRE experience to the conversation, making the content engaging and informative.

Video: SRE, Demystified

0 implied HN points • 05 Jun 23

🕹 Technology Education

The video is about demystifying Site Reliability Engineering (SRE) and provides a comprehensive introduction to SRE practices.
The talk was presented at the Boston DevOps Meetup by Amin Astaneh.
It serves as a valuable resource for individuals looking to understand SRE concepts and starting their own teams.

Welcome!

0 implied HN points • 28 Apr 23

📰 News

The post welcomes readers to Certo Modo's Substack newsletter.
Readers can easily subscribe to receive weekly posts from the newsletter.
The newsletter content is also available on the Certo Modo website.