2011246c17
- add cudaDeviceSynchronize() at every kernel launch - remove small address bug at cudaMemcpy, if host array is used - in parallel test cases, replace fixes thread number with variable - overworked shared memory kernel