CUDA and/or OpenCL usage for DCT and other functions?

Current execution time is horrible for any practical use case. I was wondering why don't we simply plug-in available functions from NVIDIA's NPP library? It wouldn't mean porting anything to CUDA simply plugging in available functions. Anyone working on GPU port?