Nvidia is hiring a
Senior Site Reliability Engineer, NeMo Services
NVIDIA is the leading artificial intelligence computing company and paving the way with innovations in generative AI, conversational AI, supercomputing, gaming and visualization. Nvidia gives research institutions, cloud providers, large companies and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems.
As a Site Reliability Engineer, you will join an enthusiastic and dedicated site reliability engineering team serving the forefront of the latest science and technology trends. Working together with the NeMo development team, you will build and run large-scale, fault-tolerant systems and services able to run in any cloud. Are you passionate about infrastructure and looking for complex meaningful issues? Are you ready to run the next generation of cloud services, design and code innovative solutions that address the needs of a whole organization? Then we are excited to have a motivated person like you!
What You Will Be Doing:
The NeMo Service team is responsible for building and deploying Generative AI services, including large language models and BioNeMo - our drug discovery cloud service. You will apply engineering leadership and deep knowledge of infrastructure and software development at scale to own the operation, adoption, and evolution of these services. You will lead by example, mentor the site reliability engineering and engineering teams, and establish credibility through quality technical execution, including hands-on contributions to code and automation to keep things running smoothly.
Design, implement and support large scale Kubernetes clusters with monitoring, logging and alerting
Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
Practice sustainable incident response and blameless postmortems
Be part of an on call rotation to support production systems
What We Need To See:
BS degree in Computer Science or related technical field involving coding , physics or mathematics, or equivalent experience
Minimum of 3 years relevant experience
Excellent interpersonal and written communication skills.
Experience with algorithms, data structures, complexity analysis and software design
Experience in one or more of the following: Golang, Python, Node, C++, CUDA
Outstanding teammate who can collaborate and influence in a multifaceted environment
Ways To Stand Out From The Crowd:
Interest in crafting, analyzing and fixing large-scale distributed systems
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
Ability to debug and optimize code and automate routine tasks
Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker
You will also be eligible for equity and benefits.
Please mention that you found the job on ARVR OK. Thanks.