Sigma Software logo

Principal Site Reliability Engineer

Sigma Software
Full-time
Remote
Brazil
Technology & Development

Company Description

Are you ready to lead infrastructure strategy for a cutting‑edge AI‑driven SaaS platform? We are looking for a Principal Site Reliability Engineer with a proven track record in scaling, optimizing, and securing cloud‑based systems. This senior role offers the opportunity to shape the reliability and performance of a platform used by finance teams worldwide.

In this role, you will be part of a dynamic engineering environment where your expertise will directly influence product stability and growth. You will work with advanced cloud technologies, automation tools, and AI-driven solutions, contributing to projects that push the boundaries of innovation.

If you are ready to take on strategic responsibility and make a tangible impact, apply now and join us in building the future of reliable, scalable systems.

CUSTOMER
Sigma Software is partnering with a fast‑growing AI‑driven SaaS platform serving finance and accounting teams in high‑growth businesses. The platform automates critical workflows — from billing and collections to revenue recognition and reporting, ensuring compliance and accelerating cash flow. Leveraging advanced AI, it reduces manual work, increases operational efficiency, and supports scalability for customers worldwide.

PROJECT

The project focuses on building and scaling an AI-powered SaaS solution for finance automation. It integrates advanced machine learning models with robust cloud infrastructure to deliver secure, compliant, and high‑performance services. The engineering culture emphasizes automation, resilience, and operational excellence.

Job Description

  • Define and lead infrastructure and reliability strategy across the platform
  • Design scalable, resilient systems in collaboration with engineering teams
  • Optimize build, testing, and deployment processes for speed and stability
  • Establish and uphold best practices for CI/CD, monitoring, and observability
  • Lead incident response and drive continuous improvement post‑incident
  • Automate workflows to reduce operational toil and risk
  • Mentor engineers and foster a culture of operational excellence
  • Make strategic build‑vs‑buy decisions balancing speed, quality, and sustainability

Qualifications

  • At least 8 years of experience in Site Reliability Engineering or DevOps roles, including 2+ years in a Principal or Lead position
  • Proven experience in infrastructure modernization and scaling initiatives for high‑growth environments
  • Strong proficiency in Python
  • Deep expertise in cloud platforms and container orchestration tools such as AWS ECS and EKS
  • Solid experience in CI/CD pipeline design and optimization using tools like GitHub Actions and Buildkite
  • Proficiency in infrastructure‑as‑code tools such as Terraform
  • Strong knowledge of monitoring, observability, and performance optimization practices
  • Upper-Intermediate level of spoken and written English

WOULD BE A PLUS

  • Experience with monorepos (Turborepo, pnpm)
  • Familiarity with modern TypeScript tools (swc, biome, oxc)
  • Knowledge of NestJS, NextJS, and testing frameworks (Jest, Vitest)

Additional Information

PERSONAL PROFILE

  • Excellent leadership, communication, and decision‑making abilities
  • Ability to work independently and make pragmatic build‑vs‑buy decisions in fast‑paced environments