Simply adding a dimension to the variables we want to replicate (because of the thread local storage emulation) could cause stack overflow for big block sizes. Malloc is the only solution? We could also spawn more CPU threads (smaller block sizes) in most cases, but is not general purpose. We could need a "safeness" translation tag for the latter solution.