For image sizes that are not a multiple of the tile size (16) in each direction I am producing out-of bounds memory access perhaps, that is why CPU and GPU results are not matching. Help me resolve this.
Will give the code. If you are expert you will solve it i in 15 minutes.
Budget only INR 1200
Deadline is 1 hour
I'm working in the field of image processing algorithm development with C/C++ for last 8 Years and working with CUDA for last 4 years. Can we discuss over chat? I can debug it asap.