Manager, Site Reliability Operations
- Location SUNNYVALE, CA
- Career Area Software Development and Engineering
- Job Function Software Development and Engineering
- Employment Type -
- Position Type -
- Requisition 1025705BR
What you'll do at
You're right for the job if you are comfortable handling major incident response leading a technical team of engineers to resolve and restore service across complex distributed architectures. You will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our SRE and DevOps teams to manage our next generation “always up” cloud based e-commerce platform.
- Deep understanding of incident management processes and procedures.
- Focus on internal and external customer requirements (SLA’s & KPI’s)
- Demonstrate advanced understanding of business processes being supported by assigned system(s)
- Develop clear tactical and strategic goals for the SRO related to function, capabilities and capacities.
- Make recommendations regarding improving situational awareness and alerting to potential business impacts, either internal or external influencers.
- Responsible for immediate coordinated response of critical incidents to reduce impact and increase availability.
- Responsible for leadership and communications between the business customer and technology teams.
- Identify and recommend processes or system enhancements for the SRO.
- Leads the resolution of high complexity Incidents as required.
- Manages the analysis, communication and resolution of incidents.
- Manages others in researching and recommending alternative actions for incident resolution.
- Analyze trends to proactively prevent incidents and to provide historical summary reports.
- Mentor and grow talent within your team to build a best in class SRO function.
- Calm under pressure orchestrating major incident response to mission critical systems.
- Function as part of a global SRO management team to deliver continuous improvement.
- Excellent communication and stakeholder management skills.
- Technically strong within infrastructure or software engineering.
- Ability to assess system impact and formulate accurate problem statements to distribute across the management and technical communities.
Additional responsibilities may include:
- Develop a deep understanding of the various services and applications that come together to deliver Walmart e-commerce products
- Monitor and discover failures/issues in a timely fashion and work with engineers to identify root cause and fix issues
- Root-cause analysis complex problems involving multiple parties, networks, hardware, software and cloud technologies.
- High focus on collecting and inferring metrics.
- Identify and drive the automation of systems that maintain system health.
- Drives standardization and service focused instrumentation to resolve break/fix scenarios, engaging broader teams where necessary. Contributes to command and control related activities focused on restoration of complex outages. May work independently or as part of a team on more complex projects. Provides mentoring and guidance to more junior team members.
- Networking responsibilities: Understanding and performing TCP dumps, snoop, and other network sniffers. Understands and applies knowledge of most protocols (TCP/IP, HTTP, UDP, etc.)
- Application Technologies: Provides recommendations and advice to the team and/or department in the areas of web services, OS, and storage, including being an active liaison to Development, QA and the Business.
- Analyzes systems and makes recommendations to prevent possible incidents using knowledge of complex and company-wide systems.
- Lead end-to-end audit of monitors and alarms based on subsystem knowledge.
- Utilizes time management and project management skills to lead the resolution of incidents in a timely and organized manner, effectively communicating necessary information. May consult directly with developers or third party vendors; provides subject matter expertise.
- Experience in leading and troubleshooting service impacting incidents across large-scale enterprise systems.
- Methodical and systematic problem solving approach, combined with a solid awareness of ownership, initiative and drive.
- Experience controlling and leading a team to deliver in highly pressurized situations delivering clear and concise communication to partners and stakeholders.
- Experience of command and control tools in a production environment.
- Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way. Experience administering Linux systems in a production environment
- Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell
- Bachelor's Degree in Computer Science or a related field, or relevant work experience
- Experience with cloud technologies such as AWS, AZURE OpenStack
- Experience with enterprise monitoring solutions like AppDynamics, New Relic, Prometheus, Graphite, Nagios, Sensu, Splunk, Grafana and Greylog.
About Walmart Labs
Hello, Silicon Valley
You don’t have to choose between your career and your lifestyle in Silicon Valley. Here, you can have both.Discover Silicon Valley
Filoli Gardens, Woodside
View an art exhibit, take a nature hike, explore the historic Filoli House, or take a class at this gorgeous 654–acre property.
Get your art fix at this internationally recognized collection of over 30,000 works of modern and contemporary art.
Computer History Museum
Large-scale exhibits, an acclaimed speaker series, docent-led tours and an award-winning education program bring computer history to life.
Hike or jog throughout the year on terrain dedicated to academic programs, environmental restoration and habitat conservation.
Golden Gate Park, SF
Events, attractions, meadows, lakes, and a Japanese Tea Garden provide for a true escape, without leaving the city.
The Tech Museum
This family-friendly interactive science and technology center in San Jose provides a glimpse into the most inventive place on Earth — Silicon Valley.
Santana Row - San Jose
Stylish boutiques, world-class shopping, and delectable cuisine = a San Jose shopping trifecta.
Pacifica State Beach
Learn to surf or visit the “World’s Most Scenic Taco Bell” at this 0.75 mile long crescent shaped escape, a symbol of successful habitat restoration.
Golden Gate Cemetery
This national cemetery comprises 161 acres dedicated to all the members of the armed forces who served our country.
All the benefits you need for you and your family
- 100% coverage for in network preventative care
- Retirement Plan
- Vision Plans
- Dental Plans
- Exclusive Discounts