Sr. Site Reliability Engineer
JOB SUMMARY
At Pantomath, we are building the autopilot for the data-driven enterprise. Data teams today are buried under operational toil — battling broken pipelines, schema drift, and silent quality failures that cost hours of manual debugging and erode customer trust.
We are building the Data Operations Center (DOC) to automate the entire lifecycle of data reliability. Our platform doesn't just monitor — it remediates. We are turning hours of manual troubleshooting into seconds of autonomous, self-healing recovery across the entire data stack.
The Sr. Site Reliability Engineer is a senior technical leader responsible for the availability, security, performance, and scalability of Pantomath's platform. This role goes beyond infrastructure upkeep — you will architect the foundation that makes autonomous remediation possible at scale. You will own our cloud environment end-to-end, drive platform strategy alongside engineering leadership, and set the standard for reliability excellence across the organization. Ideal candidates are deeply technical, self-directed, and energized by building production-grade systems that simply don't fail.
THE OPPORTUNITY
This is a senior individual contributor role on the engineering team, based in the Bay Area. You will partner directly with the VP of Engineering to shape our infrastructure roadmap, accelerate developer velocity, and build the resilient platform backbone that powers autonomous data operations at scale. This is a zero-to-one opportunity to define what enterprise-grade reliability looks like at a high-growth AI startup.
YOUR IMPACT
Own the Platform
Design, build, and maintain Pantomath's cloud infrastructure on AWS (EC2, EKS, IAM, ALB, RDS, S3) using Infrastructure as Code principles (Terraform, CDK).
Architect and evolve CI/CD pipelines (GitHub Actions, NX) that enable development teams to ship with speed, confidence, and consistency.
Lead the incident response lifecycle — own runbooks, drive resolution, and conduct blameless postmortems that harden the platform for the future.
Manage BAU operations including backups, credential rotation, log retention, and system administration with operational discipline.
Engineer for Reliability and Security
Apply zero trust and least privilege design patterns to authorization, authentication, networking, and runtime threat detection across the platform.
Partner with leadership to maintain SOC2-compliant infrastructure practices and proactively close security gaps before they become incidents.
Implement and manage robust observability tooling (Datadog, CloudWatch, Prometheus) — define standards for logging, metrics, and alerting that give every team real-time platform visibility.
Support agent observability for connector services central to Pantomath's autonomous remediation engine.
Drive Efficiency and Scale
Establish cost dashboards, conduct bi-weekly reviews, and implement right-sizing, idle shutdown, and shared infrastructure patterns that meaningfully reduce cloud spend.
Lead migration to shared ALB patterns and optimize EKS autoscaling to support rapid customer and product growth.
Contribute to multi-region readiness strategy and proactively address AWS service limits and scalability bottlenecks before they impact customers.
Reduce friction for developers — automate manual provisioning, clean up IaC repositories, and streamline dev and staging environments so engineers can move fast.
Shape the Engineering Culture
Champion DevOps and SRE best practices within an Agile/Scrum framework across multiple engineering pods.
Drive the infrastructure roadmap and platform strategy in close partnership with the VP of Engineering and company leadership.
Contribute to system architecture discussions and mentor engineers across the organization on reliability and operational excellence.
WHAT WE ARE LOOKING FOR
Education and Experience
Bachelor's degree in Computer Science, Information Systems, or a related field, or equivalent practical experience.
5+ years of experience in Site Reliability, Platform Engineering, DevOps, or Cloud Engineering — ideally in a high-growth startup environment.
Demonstrated track record of owning platform initiatives end-to-end, from design through production operation.
Proven experience operating within an Agile/Scrum development methodology.
Required Skills and Competencies
Deep AWS expertise across core services (EC2, EKS, IAM, ALB, RDS, S3) and strong hands-on experience with Terraform or comparable IaC tools.
Solid CI/CD knowledge, preferably with GitHub Actions, and the ability to build pipelines that accelerate engineering without sacrificing safety.
Proficiency with observability tooling (Datadog, Prometheus, CloudWatch) and the judgment to define meaningful alerting standards across a distributed platform.
Strong command of security best practices — least privilege, secret management, zero trust networking, and runtime threat detection.
Proficiency in at least one scripting language (Python, Bash) for automation, tooling, and infrastructure management.
Proficient in leveraging AI coding assistants and committed to evolving SDLC workflows to maximize the impact of AI-driven development.
Excellent problem-solving, communication, and cross-functional collaboration skills.
Preferred Skills and Competencies
Experience designing and operating multi-region AWS architectures at scale.
Prior work in a SOC2-compliant environment with direct involvement in audit readiness.
Track record of measurably reducing cloud spend through architectural and operational improvements.
Familiarity with container networking, ALB/NGINX routing, and EKS tuning.
Experience supporting data infrastructure or AI/ML workloads in production environments.
PHYSICAL AND WORK ENVIRONMENT REQUIREMENTS
This role is primarily performed in an office or remote work setting and requires the ability to:
Sit for extended periods while working on a computer.
Occasionally stand, walk, reach, stoop, or bend during the course of work.
Communicate clearly and effectively via video conferencing, phone, and email.
Occasionally lift and move items varying in weight and size (e.g., office equipment, packages, marketing materials).
Occasionally travel, including air and ground transportation and overnight stays, if required.
EQUAL OPPORTUNITY & ACCOMMODATIONS
Pantomath is proud to be an Equal Opportunity Employer. Employment decisions are made without regard to legally protected characteristics and are based on qualifications, merit, and business needs. We are committed to providing reasonable accommodations to qualified individuals — whether during the application and interview process or throughout employment. To request an accommodation, please contact Human Resources.
FLSA COMPLIANCE STATEMENT
This position is classified as Exempt under the Fair Labor Standards Act (FLSA), meaning it is not eligible for overtime compensation. This classification is based on responsibilities involving advanced knowledge in a field of science or learning, the consistent exercise of discretion and independent judgment, and compensation on a salary basis meeting applicable thresholds.
- Department
- Engineering
- Locations
- Remote, San Francisco Bay Area, CA
- Remote status
- Fully Remote
- Employment type
- Full-time
About Pantomath
Pantomath is an automated data operations platform that addresses the challenges organizations face with data reliability, where manual, time-consuming processes and reliance on tribal knowledge often hinder effective problem resolution.
By automating data monitoring and impact analysis, Pantomath streamlines operations and improves data confidence, quality, and reliability. Enterprises using Pantomath significantly reduce mean time to acknowledgement, mean time to root cause, and mean time to resolution of all data issues across their entire data ecosystem.