Why running a parallel job on a multicore (dual or quad) machines does not perform as well as running the same job on an equivalent number single core machines?




The failure to achieve linear scaling on multicore CPUS machines is related to the machine configuration and not the OS.

The degree of speedup depends on what CPUs/Nodes are being used`

2-socket/2-core over 4 machines?
1 socket/4 core over 2 machines?


The type of configuration will directly effect performance. 11.0 should be generally faster than 10.0.

The same behavior can be observed on Windows as well as Linux. For some chip/bus speed combinations you may or may not get linear scaling.

It is possible to get good scale up when the chip speed and bus speed are more closely matched, with only 2 cores. For higher number of cores the performance starts to degrade, because the pipe into memory gets saturated more easily, as more cores are trying to pull data through from main memory.

Increase the core clock speed and the problem gets worse because each core is trying to pull the data through faster.

Add more cores AND increase the core clock speed and it's even worse yet.

The hardware manufacturers try to compensate by increasing the bus speed, which can help a bit. However, if the bus speed is increased by 30% and the possible increase in "bandwidth" demand is up by a factor of "cpu speed increase * number of cores", (ie: go from 2 to 3
Ghz (factor of 1.5) and then go to 4 cores from 2) that creates a factor of 3 additional demand on memory bandwidth.

In the above example, to maintain linear scaling, the bus/memory speed would also need to be increased by a factor of 3 as well. So far, there is no memory available which can feed data at 4 GHz.

For Multi-core CPUs, this is going to be an issue for some time.


Performance can be improved by doing the following:
************************************************

1. For multidomain meshes with GGI, try running 11.0 with Multidomain Option = Coupled. Set this on the partitioner advanced controls tab in the Solver Manager or use the "-part-coupled" command line flag.

2. Also try running this case with `run with control volume sector weighting` activated. This is an expert parameter: 'part cvs weighting = t'.

3. You can also manually lock processes into CPUs after the job has been started using the task manager (not applicable for batch runs). This can help keep the OS from swapping things around.





Show Form
No comments yet. Be the first to add a comment!