Senior Software Engineer - SRE

at Roku

Bengaluru, India

Teamwork makes the stream work.

Roku is changing how the world watches TV

Roku is the #1 TV streaming platform in the U.S., Canada, and Mexico, and we've set our sights on powering every television in the world. Roku pioneered streaming to the TV. Our mission is to be the TV streaming platform that connects the entire TV ecosystem. We connect consumers to the content they love, enable content publishers to build and monetize large audiences, and provide advertisers unique capabilities to engage consumers.

From your first day at Roku, you'll make a valuable - and valued - contribution. We're a fast-growing public company where no one is a bystander. We offer you the opportunity to delight millions of TV streamers around the world while gaining meaningful experience across a variety of disciplines.

About the team

The Platform Infrastructure team ensures that all Roku systems run smoothly. These systems support over 100M+ users and billions in transaction revenue per year. We are a group of highly skilled infrastructure and software engineers who help build and operate systems at internet scale, including Platform (Kubernetes, Istio, Envoy, operators, and more) and Observability (OSS/CNCF-supported observability projects). We engage with multiple teams to achieve company-impacting results.

About the Role:

We are seeking a talented and experienced SRE (Site Reliability Engineering) Senior Software Engineer to join our dynamic team. The ideal candidate will have a strong background in SRE practices, cloud infrastructure management, and automation. If you have a consistent track record of architecting and building large-scale systems, enjoy solving intriguing system challenges at internet-scale, and if you are innovative at heart, and have a great balance of skills in learning, organizing, building, and enjoy making an impact, this role might be a great fit for you!

What you’ll be doing:

Design & Infrastructure
- Contribute to postmortem culture by facilitating comprehensive, blameless post-incident reviews that identify root causes, contributing factors, and actionable remediation items. Track incident trends to identify systemic issues and prioritize reliability improvements.
- Implement chaos engineering practices to proactively identify failure modes, validate system resilience, and build confidence in recovery procedures. Conduct game days and disaster recovery exercises.
SRE Process & Principles Implementation
- Deploy and evolve SRE practices across the organization by establishing core SRE principles, frameworks, and methodologies. Define and implement service reliability practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets, to balance innovation velocity with system reliability.
- Manage Error Budgets as a mechanism for making data-driven decisions about feature velocity vs. reliability. Track, report, and enforce error budget policies, facilitating conversations between engineering and product teams about risk tolerance and release decisions.
Reliability Engineering & Infrastructure
- Reduce toil through automation by identifying repetitive operational work and systematically eliminating it through infrastructure-as-code, automation frameworks, and intelligent tooling. Measure and track toil reduction efforts, aiming to keep toil below 50% of team time.
- Implement capacity planning processes that ensure systems have adequate headroom to meet SLOs during peak traffic, unexpected load spikes, and degraded states. Develop predictive models and automated scaling mechanisms.
Observability, Monitoring & Reporting
- Build comprehensive observability systems that provide deep visibility into service health, performance, and user experience. Implement monitoring strategies based on the Four Golden Signals (latency, traffic, errors, saturation) and USE/RED methodologies.
- Create SRE dashboards and reporting mechanisms that provide real-time visibility into SLO compliance, error budget consumption, and system reliability metrics. Develop executive-level reporting on reliability trends, incident impact, and improvement initiatives.
- Establish alerting strategies that are actionable, symptom-based, and aligned with SLOs. Reduce alert fatigue by tuning thresholds and eliminating noise while ensuring critical issues trigger appropriate responses.
Collaboration and Leadership
- Partner with development teams to implement reliability from the design phase using SRE principles. Conduct design reviews focused on failure modes, scalability, observability, and operational concerns. Guide teams in building services that meet SLO requirements.
- Collaborate through code reviews and design reviews, ensuring infrastructure-as-code, automation scripts, and reliability improvements follow best practices, are well-documented, and maintain high-quality standards.
- Manage project priorities using error budgets as a decision-making framework. Leverage agile methodologies while ensuring reliability work gets appropriate prioritization alongside feature development.
Operational Excellence & Continuous Improvement
- Identify and eliminate performance bottlenecks through detailed analysis of metrics, traces, and profiles. Optimize system resources, tune configurations, and implement auto-scaling to ensure SLO compliance during varying load conditions.
- Drive continuous improvement through SRE feedback loops by analyzing SLO violations, incident trends, and toil metrics to identify systemic improvements. Champion the reliability roadmap and advocate for technical debt reduction.
- Maintain a culture of documentation and knowledge sharing by creating comprehensive runbooks, operational guides, system architecture documentation, and disaster recovery procedures. Ensure operational knowledge is distributed across the team.
- Track and report on SRE metrics, including SLO compliance rates, error budget consumption, mean time to detection (MTTD), mean time to resolution (MTTR), toil percentage, and reliability improvement velocity.
On-call & reliability
- Participate in a 12x7 on-call rotation and be available to work with global teams in the event of critical outages

We’re excited if you have:

Preferably 8+ years of experience in DevOps/SRE roles, with demonstrated expertise in implementing SRE principles, SLO/SLI frameworks, and error budget policies in production environments.
Deep experience with observability and monitoring platforms such as Prometheus, Grafana, Datadog, New Relic, or equivalent, including experience building custom dashboards, alerts, and SLO-based monitoring.
Strong background in incident management, including experience as an Incident Commander, conducting blameless postmortems, and implementing systematic reliability improvements based on incident learnings.
Strong understanding of distributed systems and reliability engineering, including failure modes, fault tolerance patterns, circuit breakers, bulkheads, rate limiting, and graceful degradation strategies.
Experience with a number of the following: Kubernetes, Docker, Service Mesh such as Istio, Envoy, Linkerd, Solo & ECS
Experience in cloud-focused software development, preferably in Go, Python, or other object-oriented programming languages
Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation
Experience with CI/CD automation, including GitLab pipelines and other related tools.
Strong hands-on experience with cloud platforms such as AWS, GCP or Azure
Proven track record of implementing scalable, high-performance infrastructure solutions in fast-paced, dynamic environments
Demonstrated ability to communicate clearly with both technical and non-technical project stakeholders, with the ability to work effectively in a cross-functional team environment
Self-driven and detail-oriented with the ability to understand complex distributed systems and identify reliability risks proactively.
Certifications in relevant technologies, such as Certified Kubernetes Administrator (CKA), AWS Certified DevOps Engineer, or Certified Information Systems Security Professional (CISSP), are preferred
BS Degree in Computer Science or Equivalent.

#LI-SK8

Our Hybrid Work Approach

Roku fosters an inclusive and collaborative environment where teams work in the office Monday through Thursday. Fridays are flexible for remote work except for employees whose roles are required to be in the office five days a week or employees who are in offices with a five day in office policy.

Benefits

Roku is committed to offering a diverse range of benefits as part of our compensation package to support our employees and their families. Our comprehensive benefits include global access to mental health and financial wellness support and resources. Local benefits include statutory and voluntary benefits which may include healthcare (medical, dental, and vision), life, accident, disability, commuter, and retirement options (401(k)/pension). Employees are supported in taking time off, in accordance with local leave policies and other personal needs to support their evolving work and life needs. It's important to note that not every benefit is available in all locations or for every role. For details specific to your location, please consult with your recruiter.

Accommodations

Roku welcomes applicants of all backgrounds and provides reasonable accommodations and adjustments in accordance with applicable law. If you require reasonable accommodation at any point in the hiring process, please direct your inquiries to EmployeeRelations@Roku.com.

The Roku Culture

Roku is a great place for people who want to work in a fast-paced environment where everyone is focused on the company's success rather than their own. We try to surround ourselves with people who are great at their jobs, who are easy to work with, and who keep their egos in check. We appreciate a sense of humor. We believe a fewer number of very talented folks can do more for less cost than a larger number of less talented teams. We're independent thinkers with big ideas who act boldly, move fast and accomplish extraordinary things through collaboration and trust. In short, at Roku you'll be part of a company that's changing how the world watches TV. 

We have a unique culture that we are proud of. We think of ourselves primarily as problem-solvers, which itself is a two-part idea. We come up with the solution, but the solution isn't real until it is built and delivered to the customer. That penchant for action gives us a pragmatic approach to innovation, one that has served us well since 2002. 

To learn more about Roku, our global footprint, and how we've grown, visit https://www.weareroku.com/factsheet.

By providing your information, you acknowledge that you want Roku to contact you about job roles, that you have read Roku's Applicant Privacy Notice, and understand that Roku will use your information as described in that notice. If you do not wish to receive any communications from Roku regarding this role or similar roles in the future, you may unsubscribe at any time by emailing WorkforcePrivacy@Roku.com.