Principal Site Reliability Engineer
- Location SUNNYVALE, CA
- Career Area Software Development and Engineering
- Job Function Software Development and Engineering
- Employment Type -
- Position Type -
- Requisition 1029669BR
What you'll do at
You're right for the job if you're comfortable with deep technical Linux, networking topics, and distributed architectures. You will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our Software Engineering teams to build our next generation “always up” cloud based e-commerce platform.
Site Reliability Engineers are hybrid systems and software engineers who are responsible and take ownership for reliability, scalability, automation, and other issues related to uptime and availability of Walmart’s e-commerce platform. Our goal is to build, scale and guard the systems that delights the customers.
o Engender reliability and availability starting with metrics and measurements
o Enable scaling by providing tools, developing training and/or augmenting processes
o Build tools/automate to prevent re-occurrence of problem to mission critical products/services.
- Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.
- Participate in capacity planning, demand forecasting, software performance analysis and system tuning.
- Develop a deep understanding of the various services and applications that come together to deliver Walmart e-commerce products
- Design new tools to monitor and smart alerts that help discover failures/issues in a timely fashion and work with engineers to identify root cause and fix issues
- Influence, design and create new architectures, standards and methods for large-scale enterprise systems.
- Root-cause analysis complex problems involving multiple parties, networks, hardware and software that relate to scaling and performance
- Participate in on-call rotation.
- Secure the system from issues, be they real, perceived or notional
- High focus on collecting and inferring metrics
- Experience with configuration management tools such as Ansible, Saltstack, Chef and Puppet
- Build and drive the automation systems that maintain system health
- Eliminate Single Point of failure and test disaster recovery and HA regularly.
- 12+ years in a software development, DevOps role, or SRE role.
- Experience in designing, investigating, analyzing and troubleshooting large-scale enterprise systems.
- Methodical and systematic problem solving approach, combined with a solid awareness of ownership, initiative and drive.
- Fluency with running services at scale; In depth understanding of Unix systems internals and networking.
- Networking knowledge and in depth understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way. Experience administering Linux systems in a production environment
- Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell
- Bachelor's Degree in Computer Science or a related field, or relevant work experience
- Experience with distributed version control like Git or similar
- Experience with IaaS and PaaS providers such as AWS, AZURE OpenStack
- Experience with enterprise monitoring solutions like AppDynamics, New Relic, Prometheus, Graphite, Nagios, Sensu and Splunk
- Familiarity with continuous integration/deployment processes and tools such as Jenkins, Maven, Nexus, etc.,
- Operating System Internal
- Networking and Networking Internals
About Walmart Labs
Hello, Silicon Valley
You don’t have to choose between your career and your lifestyle in Silicon Valley. Here, you can have both.Discover Silicon Valley
Filoli Gardens, Woodside
View an art exhibit, take a nature hike, explore the historic Filoli House, or take a class at this gorgeous 654–acre property.
Get your art fix at this internationally recognized collection of over 30,000 works of modern and contemporary art.
Computer History Museum
Large-scale exhibits, an acclaimed speaker series, docent-led tours and an award-winning education program bring computer history to life.
Hike or jog throughout the year on terrain dedicated to academic programs, environmental restoration and habitat conservation.
Golden Gate Park, SF
Events, attractions, meadows, lakes, and a Japanese Tea Garden provide for a true escape, without leaving the city.
The Tech Museum
This family-friendly interactive science and technology center in San Jose provides a glimpse into the most inventive place on Earth — Silicon Valley.
Santana Row - San Jose
Stylish boutiques, world-class shopping, and delectable cuisine = a San Jose shopping trifecta.
Pacifica State Beach
Learn to surf or visit the “World’s Most Scenic Taco Bell” at this 0.75 mile long crescent shaped escape, a symbol of successful habitat restoration.
Golden Gate Cemetery
This national cemetery comprises 161 acres dedicated to all the members of the armed forces who served our country.
All the benefits you need for you and your family
- 100% coverage for in network preventative care
- Retirement Plan
- Vision Plans
- Dental Plans
- Exclusive Discounts