Add support for `rsqrt()` and `rcp()`

On CUDA there are two versions of `rsqrt()`
  - [`rsqrt()`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#standard-functions)
  - [`__frsqrt_rn()`](https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__SINGLE.html#_CPPv411__frsqrt_rnf)

`rsqrt()` is accurate to 2 ULPs, while `__frsqrt_rn()` should be correctly rounded.