Site Reliability Operations Engineer
- Location SUNNYVALE, CA
- Career Area Information Technology
- Job Function Information Technology
- Employment Type -
- Position Type -
- Requisition 1181080BR
What you'll do at
You're right for the job if you are comfortable contributing to major incident response in technical team of engineer’s laser focused on restoring service across complex distributed architectures. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our SRE, Engineering and DevOps teams to support our next generation “always up” cloud based e-commerce platform.
The SRC Site Reliability Operations Engineer is responsible for pro-actively monitoring, detecting and resolving site issues before they become customer and availability impacting. Technically you will understand the full end to end stack and use this knowledge to detect error/failures and take corrective action to mitigate. During a major incident, you will draw on your technical skills and knowledge to triage, differentiating between symptom and cause, to help restore impacting issues. Your ability to continuously challenge yourself and develop a strong network within your peer group will see you exceed in this role. Our goal is to protect the customer experience and deliver outstanding levels of availability. To do so, you will need strong skills in the following areas:
- Understanding of incident management processes and procedures.
- Calm under pressure when participating in major incident response.
- Technical understanding of core infrastructure, cloud services, platforms and micro-services.
- Ability to understand and capture key data from logs.
- Ability to understand traffics flows and key dependencies between services.
- Ability to effectively triage – be able to detect and determine symptom vs cause.
- Detect and quantify impact.
- Analyze trends to pro-actively prevent incidents.
- Focus on immediate restoration vs root cause.
- Research and recommend alternative actions for incident resolution – Develop procedures and documentation to support this.
- Create and maintain procedural documentation.
- Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
- Absorb knowledge and understand complex distributed systems - ability to share and impart this knowledge into your peer group and beyond.
- Build tools to improve visibility, pro-actively detect issues and restore system availability.
- Develop automation and self-healing with DevOps, Engineering and SRE partners.
- Strong focus on collecting and inferring metrics.
- Clear communication skills.
- Ability contribute to multiple incidents at any given time.
- Analyzes systems and makes recommendations to prevent possible problems. Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
- Scripting and software development to automate and help enhance existing solutions.
Additional responsibilities may include:
- Actively provide data for and participate in root cause analysis.
- Adhere to SRC onboarding process when accepting new systems into service.
- Share knowledge globally between SRC teams.
- Analyze systems and make recommendations to prevent possible incidents.
- Strive for continuous improvement and make recommendations based on SRC process.
- Other duties and responsibilities as assigned.
- Bachelor's Degree in Computer Science or a related field, or relevant work experience.
- Strong and demonstrable incident management skills with relevant experience in an enterprise organization.
- Experience and exposure working is a 24/7 operations support environment.
- Methodical and systematic problem solving approach, combined with a solid awareness of ownership, initiative and drive.
- Experience investigating, analyzing and troubleshooting large scale enterprise systems.
- Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
- Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell.
- Experience administering Unix/Linux in a production environment.
- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
- Experience working with and developing enterprise monitoring/tooling solutions like Grafana, Kibana, Splunk, Graphite, Nagios, New Relic, Greylog and HPOM.
- Working knowledge of one or more cloud technologies such as AWS, AZURE OpenStack.
About Walmart Labs
Hello, Silicon Valley
You don’t have to choose between your career and your lifestyle in Silicon Valley. Here, you can have both.Discover Silicon Valley
Filoli Gardens, Woodside
View an art exhibit, take a nature hike, explore the historic Filoli House, or take a class at this gorgeous 654–acre property.
Get your art fix at this internationally recognized collection of over 30,000 works of modern and contemporary art.
Computer History Museum
Large-scale exhibits, an acclaimed speaker series, docent-led tours and an award-winning education program bring computer history to life.
Hike or jog throughout the year on terrain dedicated to academic programs, environmental restoration and habitat conservation.
Golden Gate Park, SF
Events, attractions, meadows, lakes, and a Japanese Tea Garden provide for a true escape, without leaving the city.
The Tech Museum
This family-friendly interactive science and technology center in San Jose provides a glimpse into the most inventive place on Earth — Silicon Valley.
Santana Row - San Jose
Stylish boutiques, world-class shopping, and delectable cuisine = a San Jose shopping trifecta.
Pacifica State Beach
Learn to surf or visit the “World’s Most Scenic Taco Bell” at this 0.75 mile long crescent shaped escape, a symbol of successful habitat restoration.
Golden Gate Cemetery
This national cemetery comprises 161 acres dedicated to all the members of the armed forces who served our country.
All the benefits you need for you and your family
- 100% coverage for in network preventative care
- Retirement Plan
- Vision Plans
- Dental Plans
- Exclusive Discounts