NVIDIA offers $1,000 bounty to fix RTX 5090 reset bug

CloudRift, a cloud GPU service provider, has offered a $1,000 (approximately 148,000 yen) bounty to resolve the issue, claiming that there is a reproducible bug in NVIDIA's GPUs, the
Bug Bounty: NVidia Reset Bug | CloudRift Blog
https://www.cloudrift.ai/blog/bug-bounty-nvidia-reset-bug

Nvidia RTX 5090 reset bug prompts $1,000 reward for a fix — cards become completely unresponsive and require a reboot after virtualization reset bug, also impacts RTX PRO 6000 | Tom's Hardware
According to CloudRift, the Blackwell-equipped RTX Series 2 products have a bug that causes them to become unresponsive when used in a virtual machine.
This bug occurs after passing a GPU to a virtual machine using KVM and VFIO. When the guest OS shuts down or the GPU is reassigned, the host issues a PCIe Function Level Reset (FLR), a standard procedure for cleaning up pass-through devices. However, the GPU does not return to a normal state and becomes unresponsive. It is not ready 65,535 milliseconds after the FLR, and the process is abandoned. According to CloudRift, the only way to recover is to power cycle the GPU.

CloudRift points out that this issue is specific to the RTX 5090 and RTX PRO 6000, as it does not occur with older generation models such as the RTX 4090. Threads on the Proxmox forum and the Level1Techs community suggest that general users and early buyers of the RTX 5090 are also experiencing the same issue. NVIDIA has not officially acknowledged this issue, and no workarounds exist.
CloudRift has offered a $1,000 reward to anyone who can provide a valid mitigation or fix, and has asked for help in finding a solution to the problem.
Related Posts: