- Rule #1: No random writes to non-local memory
- Chunk the data, redistribute, and then each core sorts/works on local data.
- Rule #2: Only perform sequential reads on non-local memory
- This allows the hardware prefetcher to hide remote access latency.
- Rule #3: No core should ever wait for another
- Avoid fine-grained latching or sync barriers.