Posted Aug 2

Nvidia is hiring a
Senior Site Reliability Engineer, NeMo Services

US, CA, Santa Clara • US, Remote • 2 Locations • 4 Locations
Full time

NVIDIA is the leading artificial intelligence computing company and paving the way with innovations in generative AI, conversational AI, supercomputing, gaming and visualization. Nvidia gives research institutions, cloud providers, large companies and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems.

As a Site Reliability Engineer, you will join an enthusiastic and dedicated site reliability engineering team serving the forefront of the latest science and technology trends. Working together with the NeMo development team, you will build and run large-scale, fault-tolerant systems and services able to run in any cloud. Are you passionate about infrastructure and looking for complex meaningful issues? Are you ready to run the next generation of cloud services, design and code innovative solutions that address the needs of a whole organization? Then we are excited to have a motivated person like you!

What You Will Be Doing:

The NeMo Service team is responsible for building and deploying Generative AI services, including large language models and BioNeMo - our drug discovery cloud service. You will apply engineering leadership and deep knowledge of infrastructure and software development at scale to own the operation, adoption, and evolution of these services. You will lead by example, mentor the site reliability engineering and engineering teams, and establish credibility through quality technical execution, including hands-on contributions to code and automation to keep things running smoothly.

  • Design, implement and support large scale Kubernetes clusters with monitoring, logging and alerting

  • Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement

  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity

  • Practice sustainable incident response and blameless postmortems

  • Be part of an on call rotation to support production systems

What We Need To See:

  • BS degree in Computer Science or related technical field involving coding , physics or mathematics, or equivalent experience

  • Minimum of 3 years relevant experience

  • Excellent interpersonal and written communication skills.

  • Experience with algorithms, data structures, complexity analysis and software design

  • Experience in one or more of the following: Golang, Python, Node, C++, CUDA

  • Outstanding teammate who can collaborate and influence in a multifaceted environment

Ways To Stand Out From The Crowd:

  • Interest in crafting, analyzing and fixing large-scale distributed systems

  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive

  • Ability to debug and optimize code and automate routine tasks

  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker

The base salary range is $144,000 - $270,250. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Please mention that you found the job on ARVR OK. Thanks.