Certo Modo

Certo Modo is a newsletter that covers DevOps and SRE best practices, focusing on strategies for efficient systems reliability, automation, and team collaboration. It offers insights into emotional intelligence, Ansible, cloud tools, CI/CD pipelines, incident management, and observability, catering to both smaller organizations and larger companies.

DevOps/SRE Best Practices Emotional Intelligence in Tech Ansible Usage and Security Lean SRE Implementation Shell Scripting Demonstrating Value in DevOps/SRE Continuous Integration and Continuous Deployment Incident Management Cloud Infrastructure Team Collaboration and Communication System Monitoring and Observability Interview and Career Advancement in Tech

The hottest Substack posts of Certo Modo

And their main takeaways
19 implied HN points 18 Mar 24
  1. Smaller organizations and startups can benefit from implementing Site Reliability Engineering (SRE) practices, leading to reduced operational costs and time savings.
  2. Implementing SRE practices in smaller companies may differ in approach from larger organizations, but can still yield significant benefits.
  3. Starting an SRE program at a larger company can be achieved by beginning with just one software engineering team.
19 implied HN points 03 Oct 23
  1. Organize your Ansible files by following a recommended directory structure. This helps keep things structured and manageable as your project grows.
  2. Avoid putting secrets like credentials directly into variable files. Use Ansible Vault to encrypt sensitive information, maintaining security.
  3. Utilize tools like Ansible-Lint for verifying playbook syntax, and the --check option in ansible-playbook for 'dry-runs' to catch errors before affecting production.
39 implied HN points 09 Feb 23
  1. Emotional intelligence is crucial in the DevOps/SRE space for managing emotions, reasoning, and decision-making.
  2. Recognize and manage 'amygdala hijacks' in stressful situations at work to maintain clear thinking and avoid reactive behavior.
  3. Understanding emotions as information from others is important in social settings for effective communication and decision-making at work.
1 HN point 26 Feb 24
  1. Consider using shell scripts when CLI tools are available and APIs aren't, for more efficiency.
  2. For quick prototypes, opt for a shell script solution to validate ideas swiftly before committing to a more complex programming language.
  3. When developing CLI tools, prioritize speed and consider using compiled languages like Golang or Rust for efficiency.
0 implied HN points 14 Dec 23
  1. Focus on demonstrating the impact of your work to the business in terms of time and money saved/made compared to what you are being paid.
  2. Communicate the importance of your work to your peers and stakeholders by adding value propositions to your tasks, measuring impact, and tracking significant wins with supporting metrics.
  3. Consistently delivering impactful work, improving organizational perception, and effective communication can lead to growth opportunities such as team expansion, promotions, and better job offers.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
0 implied HN points 14 Nov 23
  1. Each pipeline step in DroneCI can use different container images, allowing for versatile tasks like testing across multiple platforms.
  2. Base64 encoding secrets in DroneCI is a useful technique for securely handling sensitive information like SSH keys.
  3. Monitoring DroneCI pipelines can be enhanced by utilizing Prometheus to track status, duration, and using a Push Gateway to export build metrics.
0 implied HN points 30 Oct 23
  1. Drone CI is a self-service Continuous Integration platform that simplifies CI pipeline configuration by using a .drone.yml file in the project's root directory.
  2. Make sure the Drone service is public-facing with port 443 accessible, and use a read-through cache with Docker to ensure resilience to Docker Hub outages.
  3. Add yourself as an administrator to new Drone installations to have full access to API features, set up cron interval as needed, and utilize the CLI tool for more advanced capabilities.
0 implied HN points 19 Sep 23
  1. Consider alternatives to Jenkins for new software projects due to limitations in plugin complexity.
  2. Evaluate your R&D department's expertise and resources to manage Jenkins installation and perform updates.
  3. Assess the security risks and maintenance requirements of Jenkins installations to prevent potential breaches.
0 implied HN points 07 Sep 23
  1. Automating manual tasks is crucial for growth, as manual work can consume time needed for innovation and advancement.
  2. Runbooks, as documented step-by-step instructions, are key to delegating work, reducing single points of failure, and ensuring consistency in task execution.
  3. Converting manual scripts into full-fledged software services allows for instant and automated task completion, improving efficiency and reducing human involvement.
0 implied HN points 24 Aug 23
  1. The post discusses tactics for excelling in coding interviews at big tech companies.
  2. The video presentation offers valuable insights on preparing for challenging parts of the interview process, particularly focusing on technical aspects.
  3. The talk was delivered at a Vegas Programmers Meetup and acts as a complement to a previous post about obtaining an SRE role.
0 implied HN points 04 Aug 23
  1. Starting a series to explore cloud-native tools like Kubernetes can be an exciting and beneficial learning experience.
  2. Setting up a K3S cluster on a cloud provider using Terraform for infrastructure and Ansible for configuration can be cost-effective and efficient.
  3. Linux systems knowledge, including reading logs, writing scripts, and basic networking, is essential when encountering roadblocks during server setup and configuration.
0 implied HN points 28 Jul 23
  1. Kanban is an efficient process for organizing team work, especially for interrupt-driven teams like Site Reliability Engineering, Operations, IT, or Customer Support.
  2. Implementing Kanban provides visibility, measurability, and easy reprioritization of work, helping build credibility and trust with stakeholders.
  3. Key elements of effectively managing Kanban include tracking all work items, creating clear work state columns, limiting work-in-progress, and using metrics to drive continuous improvement.
0 implied HN points 20 Jul 23
  1. The podcast episode discusses the role of an SRE and how it differs from a system administrator.
  2. It highlights the importance of communication and persuasion skills for an SRE.
  3. The conversation also delves into the relationship between SRE and DevOps, interview processes, and skills requirements.
0 implied HN points 28 Jun 23
  1. Amin Astaneh was a guest on the Practical Operations podcast, discussing DevOps transformations and Site Reliability Engineering.
  2. The podcast focuses on real-world use cases and solutions to common problems in systems and operations.
  3. The hosts of the podcast are highly knowledgeable and entertaining to listen to.
0 implied HN points 22 Jun 23
  1. The parallel distributed shell is a CLI tool helpful for troubleshooting in large-scale systems.
  2. It allows engineers to simultaneously run commands on multiple hosts, offering a 'break-glass' solution.
  3. Various options exist like dsh, pdsh, hyper-shell, and AWS Run Command for managing infrastructure when traditional methods fail.
0 implied HN points 16 Jun 23
  1. Valuable work in tech is often a collaborative effort involving different teams and perspectives.
  2. To foster cross-functional collaboration, communicate the big picture clearly through mechanisms like RFCs and understand and align with the incentives of others.
  3. Breaking down silos, enabling effective communication, and celebrating contributions are key to successful collaboration between diverse teams.
0 implied HN points 09 Jun 23
  1. System calls are how programs interact with the operating system to request and manage resources like memory and files.
  2. System call tracing allows real-time observation of running processes to understand resource usage and behavior.
  3. Tracing tools like strace and perf can help diagnose issues in production systems but come with a performance impact, requiring caution in usage.
0 implied HN points 01 Jun 23
  1. To excel in an SRE role, focus on developing important character traits like emotional intelligence, resilience, and assertiveness to stand out as a candidate.
  2. Coding skills are essential for an SRE position; expect to be tested on tasks like file I/O, data structures, and program efficiency, so practice coding and explaining your solutions.
  3. Understanding systems knowledge and experience is crucial; be prepared to discuss Linux internals, troubleshooting tools, and system administration basics in interviews to showcase your expertise.
0 implied HN points 23 May 23
  1. Post-mortems are crucial for teams to learn from failure and improve systems.
  2. Scheduling recurring post-mortem meetings helps in consistent learning and fostering a culture of continuous improvement.
  3. Selecting a capable moderator and presenter, preparing for the meeting, facilitating discussions, and following up on action items are key responsibilities for successful post-mortems.
0 implied HN points 12 May 23
  1. Write-ups are essential after incidents to learn and improve. They help document the incident, leading to better post-mortems and prevention strategies.
  2. Creating an effective write-up involves describing the impact, crafting a detailed timeline, and using it to tell a coherent story. Following a specific format makes understanding easier.
  3. Understanding what triggered the incident, identifying fixes, and improvements are crucial steps. Focus on blameless analysis, seek contributing factors, and fine-tune prevention strategies.
0 implied HN points 09 Oct 23
  1. The post discusses the experience in Meta's Production Engineering and a story about almost causing a server room fire early in the career.
  2. The content is part of the Slight Reliability Podcast episode 70 on Meta SRE.
  3. The author, Amin Astaneh, shares insights and anecdotes related to reliability engineering and server management.
0 implied HN points 28 Apr 23
  1. Ensure your on-call rotation is sufficiently staffed to prevent burnout and ensure a timely response to incidents.
  2. Avoid delegating on-call responsibilities to another team to maintain a tight feedback loop and incentivize problem-solving.
  3. Have everyone on the team participate in the on-call rotation to promote empathy, reliability, and a collective care for system stability.
0 implied HN points 20 Apr 23
  1. Alerting in incident management notifies the team to respond to production problems promptly based on severity levels.
  2. When setting up alerting mechanisms, consider categorizing alerts into pages for emergencies, tickets for best effort during business hours, and logs that require no response.
  3. Craft actionable alerts by enriching them with context like graphs, log entries, and links to runbooks. Test new alerts thoroughly before directing them to the on-call team.
0 implied HN points 13 Apr 23
  1. Having a well-defined escalation policy is crucial for effectively addressing production issues that monitoring may not catch. This policy should outline steps to take when the on-call team cannot resolve an issue.
  2. Creating a team page with essential information like how to ask for help, defining emergencies, and team responsibilities helps guide the decision on escalating an issue and waking up the on-call staff if needed.
  3. In larger organizations, centralizing the escalation process by creating a common document with links to different teams, and using consistent tools for escalations, can streamline and speed up the incident resolution process.
0 implied HN points 10 Apr 23
  1. Monitoring is a crucial aspect of incident management to detect issues quickly and efficiently.
  2. Top-level metrics like Service Level Indicators (SLIs) and operational metrics provide valuable insights into system health.
  3. Data for monitoring can come from time series data, logs, and traces, and visualization tools like Grafana help in analyzing and interpreting this data effectively.
0 implied HN points 28 Mar 23
  1. Identify and maintain a relationship with the team's point of contact to ensure clear communication and accountability.
  2. Prior to starting an engagement, conduct initial discovery to understand the team's operational needs and potential risks.
  3. Create a clear engagement document outlining goals, expectations, and metrics for success, ensuring alignment with the team's objectives.
0 implied HN points 06 Mar 23
  1. Site Reliability Engineering (SRE) teams drive higher operational maturity, remove sources of toil, and improve service reliability.
  2. Establishing strong SRE practices involves shared operational responsibility, measuring customer success, using error budgets to prioritize work, and learning from failures in a blameless manner.
  3. Properly staffing on-call rotations and ensuring humane work-life balance are essential for SRE team success.
0 implied HN points 20 Feb 23
  1. Product launches are crucial and can make or break a business, depending on how well they are received by customers.
  2. The Production Readiness Review (PRR) is a valuable process that ensures a team is fully prepared to offer a product to paying customers by evaluating operational responsibilities.
  3. The PRR process involves creating a standardized questionnaire, delegating questions to team members, presenting and discussing findings, providing feedback, and making a go/no-go decision based on known risks before launching the product.
0 implied HN points 14 Feb 23
  1. Observability tools provide metrics, dashboards, and notifications without software licensing fees.
  2. Some observability tools focus on cloud-native infrastructure, making setup challenging for non-cloud businesses.
  3. O11y-in-a-box simplifies monitoring by providing Prometheus, Loki, and Grafana for performance, availability, log analysis, and alerting on a single-host system.
0 implied HN points 01 Feb 23
  1. On-call retrospectives help teams stay connected to the real operational challenges they face and provide insights on how to enhance the on-call experience.
  2. Holding weekly meetings where team members share metrics, experiences, and discuss improvements can lead to a more efficient and enjoyable on-call rotation.
  3. Taking notes during the retrospective and translating them into actionable tasks for improvement can result in smoother on-call shifts and increased team productivity.
0 implied HN points 20 Jul 23
  1. Facing unexpected incidents and downtime is a common challenge in tech operations.
  2. Proactively solving issues during stressful on-call duties can lead to innovative solutions.
  3. Implementing customized alerting systems can greatly improve incident response and team efficiency.
0 implied HN points 20 Mar 23
  1. SREs engage with software engineering organizations in different ways to help achieve goals.
  2. Engagement models include consulting, embedded, and infra team, each with unique benefits and challenges.
  3. Implementing SRE involves balancing tradeoffs based on challenges, budget, and organizational needs.
0 implied HN points 27 Feb 23
  1. Service Level Objectives (SLOs) reveal important customer experience metrics beyond just uptime, such as latency and error rate.
  2. Developing SLOs fosters cross-functional collaboration within an organization, breaking down silos and promoting a unified approach to reliability.
  3. Implementing SLOs can lead to investments in improved observability, enhancing infrastructure management and monitoring for long-term operational benefits.
0 implied HN points 13 Feb 24
  1. The post discusses the basics of CI/CD, including change management, running a Change Advisory Board, testing in production, and managing test/deploy infrastructure.
  2. It highlights the importance of understanding and implementing CI/CD processes for efficient software development and deployment.
  3. The post provides insights into how to effectively manage and optimize the test and deployment infrastructure for software projects.
0 implied HN points 05 May 23
  1. In software development, the goal is to make money by increasing subscriptions and shipping code quickly while minimizing operational costs.
  2. DevOps and Site Reliability Engineering (SRE) help increase code delivery by enabling frequent deployments and short lead times for bug fixes.
  3. DevOps and SRE also help reduce infrastructure costs through techniques like capacity planning and identifying resource bottlenecks to optimize performance.
0 implied HN points 13 Jul 23
  1. Amin Astaneh appeared on the Slight Reliability podcast to discuss site reliability engineering (SRE).
  2. The podcast covered topics like making ops work visible, measuring toil, and implementing SLOs.
  3. Host Stephen Townshend brought valuable SRE experience to the conversation, making the content engaging and informative.
0 implied HN points 05 Jun 23
  1. The video is about demystifying Site Reliability Engineering (SRE) and provides a comprehensive introduction to SRE practices.
  2. The talk was presented at the Boston DevOps Meetup by Amin Astaneh.
  3. It serves as a valuable resource for individuals looking to understand SRE concepts and starting their own teams.
0 implied HN points 28 Apr 23
  1. The post welcomes readers to Certo Modo's Substack newsletter.
  2. Readers can easily subscribe to receive weekly posts from the newsletter.
  3. The newsletter content is also available on the Certo Modo website.