Current execution time is horrible for any practical use case. I was wondering why don't we simply plug-in available functions from NVIDIA's NPP library? It wouldn't mean porting anything to CUDA simply plugging in available functions. Anyone working on GPU port?