Principal Site Reliability Engineer (SRE)
Organization: INFINITE CHOICE LLC
Location: United States
About the RoleWe're seeking an exceptional Principal Site Reliability Engineer to architect, design, and build our SRE foundation from the ground up at InfiniteChoice. This is a rare greenfield opportunity to establish SRE practices, develop custom tooling, and create the reliability culture that will support our platform serving millions of users and billions in transaction volume.As our Principal SRE, you'll combine deep technical expertise with strategic vision to build world-class monitoring, observability, and automation systems. You'll have the autonomy to define our SRE processes, select technologies, and create the framework that ensures our systems are reliable, scalable, and performant.Location: Remote - US basedWhat You Will DoSRE Foundation & Process DevelopmentBuild SRE practices from scratch - define SLIs, SLOs, error budgets, and reliability metricsEstablish incident response procedures, on-call rotations, and post-mortem processesCreate reliability engineering standards and best practices across all engineering teamsDevelop disaster recovery and business continuity strategiesDesign and implement capacity planning and performance optimization frameworksArchitecture & Tool DevelopmentDrive architecture decisions for comprehensive application and infrastructure monitoring solutionsDesign and develop custom SRE tools for automated monitoring, alerting, and remediationBuild observability platforms that provide deep insights into system performance and user experienceCreate automation frameworks for deployment, scaling, and incident responseArchitect logging, metrics, and tracing systems for distributed microservices environmentsGoogle Cloud Infrastructure ExcellenceLeverage Google Cloud Platform services to build resilient, scalable infrastructureImplement cloud-native monitoring using Stackdriver, Cloud Monitoring, and Cloud LoggingDesign auto-scaling and self-healing systems using GKE, Cloud Functions, and managed servicesOptimize cloud costs while maintaining high availability and performance standardsEstablish security and compliance frameworks within GCP environmentsInnovation & Continuous ImprovementResearch and implement cutting-edge SRE tools and methodologiesLeverage AI and machine learning for predictive analytics, anomaly detection, and automated remediationCreate dashboards and reporting systems that provide actionable insights to engineering and business teamsEstablish feedback loops for continuous improvement of reliability and performanceStay current with industry best practices and emerging technologies in the SRE spaceWhat You Must HaveSRE & Infrastructure Expertise12+ years of experience in Site Reliability Engineering or Infrastructure Engineering5+ years in lead SRE roles building and scaling SRE teams and processesProven track record designing and implementing monitoring and observability solutions at scaleDeep understanding of distributed systems, microservices architectures, and cloud-native patternsExperience with infrastructure as code, configuration management, and deployment automationGoogle Cloud Platform ProficiencyHands-on experience with Google Cloud Platform is requiredExpertise with GCP monitoring and observability stack (Cloud Monitoring, Cloud Logging, Cloud Trace)Experience with GKE, Compute Engine, Cloud Functions, and other core GCP servicesKnowledge of GCP networking, security, and compliance capabilitiesUnderstanding of GCP cost optimization and resource managementTechnical SkillsStrong programming skills in Python, Go, Java, or similar languagesExperience with monitoring tools (Prometheus, Grafana, Datadog, New Relic, or similar)Proficiency with containerization (Docker, Kubernetes) and orchestration platformsKnowledge of CI/CD pipelines, automated testing, and deployment strategiesUnderstanding of database performance tuning and optimization (SQL and NoSQL)AI & AutomationFamiliarity with AI-driven development tools and methodologies is a huge plusExperience with machine learning for operations (AIOps), anomaly detection, or predictive analyticsKnowledge of automated incident response and self-healing systemsUnderstanding of AI/ML tools for log analysis, pattern recognition, and intelligent alertingProblem-Solving & MindsetStrong analytical and troubleshooting skills for complex distributed systemsExperience with high-pressure incident response and crisis managementDetail-oriented with commitment to operational excellence and continuous improvementComfortable with ambiguity and building processes in a fast-growing environmentPassion for reliability, automation, and engineering best practicesDemonstrated experience building SRE programs and processes from the ground up is a HUGE plusEducationBachelor's degree in Computer Science, Engineering, or equivalent professional experienceIndustry certifications (Google Cloud Professional, SRE or related certifications preferred)What We OfferGround-floor opportunity to build SRE practices and culture from scratchFull autonomy to define processes, select technologies, and establish best practicesDirect impact on platform reliability serving millions of usersOpportunity to create lasting engineering culture and operational excellenceRemote-first culture with in-person meeting in Dallas, TX on need basisCollaborative environment with smart, passionate engineers and cross-functional teamsAccess to cutting-edge technologies and AI-driven development toolsCompetitive compensation, equity participation, and comprehensive benefitsReady to Build World-Class Reliability?Join us in creating the SRE foundation that will power InfiniteChoice's next phase of growth. If you're passionate about reliability engineering, love building systems from scratch, and want to establish the operational excellence that scales with our business, we'd love to hear from you.About InfiniteChoiceInfiniteChoice was founded to help people find the experiences they want simply and effortlessly. We leverage a new type of business model and platform that uniquely applies automation and technology to solve the challenges of scale and complexity in experience discovery.Existing business and marketing technologies can no longer handle the demands of connecting millions of consumers with vast inventories of experiences across a fragmented, global marketplace of people, partners, and providers.Our mission is to disrupt this status quo by creating seamless connections between consumers and experiences. We're just at the beginning of this journey, but our approach is working: we've helped over 275 million visitors connect to millions of experiences, generating over $2 billion in revenue for our brands and partners.