Nvidia is hiring a
Senior HPC Telemetry Infrastructure Engineer
NVIDIA has been continually redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s an outstanding legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you will be immersed in a diverse, encouraging environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world!
We are looking for an outstanding telemetry infrastructure engineer for architecture, deployment and performance optimization of datacenter systems and applications. Be a key player to the most exciting computing hardware and software to driving the latest breakthroughs in artificial intelligence and GPU computing. Provide insights on at-scale system design for collecting, visualizing, and acting on a wide variety of data. You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft state of the art monitoring pipelines which enable key insights. You will interact with HPC, OS, GPU compute, and networking specialists to envision, develop and bring up large scale systems.
What you’ll be doing:
We provide engineering solutions to enable large scale data storage and visualization of GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assisting system architects and machine learning/deep learning engineers in building creative solutions based on NVIDIA technology. Be an internal reference for telemetry concepts and methodologies among the NVIDIA technical community. Test, evaluate, and benchmark new technologies and products and work with vendors, partners and peers to improve functionality and optimize performance.
What we need to see:
5+ years of experience designing and running monitoring systems in large datacenter/AI/HPC solutions.
Proven understanding of databases, dashboards, and alerting systems.
Experience using and installing Linux-based server platforms.
Python/Bash/Golang programming/scripting experience.
Experience working with engineering or academic research community supporting HPC or deep learning.
Strong teamwork and both verbal and written communication skills.
Ability to multitask efficiently in a very dynamic environment!
Action driven with strong analytical and troubleshooting skills.
Desire to be involved in multiple diverse and innovative projects.
BS in Engineering, Mathematics, Physics, or Computer Science (or equivalent experience). MS or PhD desirable.
Ways to stand out from the crowd:
Background with multiple monitoring stacks such as Prometheus+Grafana, Elasticsearch+Kibana, Splunk, Zabbix, etc. Familiarity with newer and emerging monitoring products.
Experience deploying containerized services.
Experience designing, implementing, and running data backup systems.
Demonstrated work with Open-Source software: building, debugging, patching and contributing code.
Experience tuning memory, storage, and networking settings for performance on Linux systems.
Exposure to scheduling and resource management systems.
We have some of the most forward-thinking and hardworking people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you are creative, curious, and motivated with real passion for technology, we want to hear from you!
You will also be eligible for equity and benefits.
Please mention that you found the job on ARVR OK. Thanks.