Nvidia is hiring a
Senior DevOps Engineer
NVIDIA is looking for a world class engineer to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior DevOps and SRE Engineer. The position will be part of a fast-paced crew that develops and maintains sophisticated build & test environments for a multitude of hardware platforms both NVIDIA GPUs and Tegra Processors along with various operating systems (Windows/Linux). The team works with various other business units within NVIDIA Software such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, Robotics and Driverless Cars to cater to their infrastructure & system's needs.
What you'll be doing:
Monitoring & supporting critical high-performance, large-scale services running on a farm of 10000+ hosts.
Ensure more than 95% availability for the build and test farms.
Participate in triaging & resolution of complex build and test infra related issues.
Collaborate with our other engineering teams to expose any defects and constraints.
Collaborate with software development teams to deliver reliable, robust, and high-performance capability of the underlying infra.
Perform Root Cause Analysis & Implement Corrective Actions for any persistent & user impacting issues.
Implementing high availability infrastructure and disaster recovery solutions.
Large scale deployments across multiple Kubernetes, ESXi clusters to support CI/CD pipelines for NVIDIA products.
Design and implement monitoring solutions to gain more insight into applications and system health. Implement critical metric using various analytics methods and dashboards.
Craft and develop tools needed for automating workflows.
Take part in prototyping, crafting, and developing cloud infrastructure for NVIDIA.
What we need to see:
Strong background in Terraform for orchestrating VMs on ESXI, KVM hypervisors.
Proficient with configuration management tools like Chef and Puppet, source code management & binary repository systems like GitLab, GitHub, Artifactory etc.
Experience creating/managing repositories for hosting binaries/libraries at scale.
Strong background in Linux and Windows Administration.
Experience with tools like qemu, packer, winpe is a plus.
Strong background with Gitlab, Jenkins and/or other CI/CD systems.
Proficient with Kubernetes administration, dockers & virtualization. Knowledge of standard methodologies related to security.
Proficient with data analytics/visualization & monitoring tools like Kibana, Grafana, Splunk, Zabbix, Prometheus and/or similar systems.
Solid programming background in python/shell and/or similar scripting languages.
Experience of maintaining cloud infrastructure and highly available production environment.
Strong background in dockers, containerization and managing large scale container/pod deployments for Kubernetes clusters.
Excellent debugging, problem solving and analytical skills.
Strong understanding of architectural requirements and development processes involved in building reliable, robust, scalable data products and pipelines.
Demonstrable experience working in large scale enterprise production systems.
5+ years of proven experience.
Bachelor’s or Master’s degree in computer science, Software Engineering, or equivalent experience.
Ways to stand out from the crowd:
Solid understanding of containerization and microservices architecture, K8s.
Knowledge of Java based applications is good to have.
Thrives in a multi-tasking environment with constantly evolving priorities.
Ability to analyze complex problems into simple sub problems and then reuse available solutions to implement most of those. Ability to design simple systems that can work efficiently without needing much support.
Prior experience with large scale operations team.
Outstanding interpersonal skills and communication with all levels of management.
Please mention that you found the job on ARVR OK. Thanks.