My code (mostly) works, produces reasonable results for certain total number and batch sizes. But at other batch sizes I occaisionally get memory issues etc.
Additionally, in general the addition of more threads significantly slows down the whole process.
Anything I'm doing wrong?
If the random number generation is parallelized, you should see speed up for moderate number of mpi processes, but at the higher numbers of processes, and definitely when you need more than one node, you can expect some slowdown.