Locality-Flexible and Cancelable Tasks for the APGAS Library

Since today's HPC clusters typically consists of multicore machines, modern parallel programs should simultaneously exploit both node-internal shared-memory and inter-node distributed-memory parallelism.

This work proposes a novel hybrid work stealing scheme, which maps tasks dynamically and transparently to any resource in the overall system, so that the workload is balanced over both nodes and cores. The scheme combines a lifeline-based variant of distributed task pools with Java’s Fork/Join framework for node-internal load balancing.

We implemented our scheme by extending the APGAS library for Java, which adds asynchronous to the Partitioned Global Address Space (PGAS) model. Therefore, tasks can be spawned dynamically on user-defined places, synchronous or asynchronous, i.e., the parent task either waits, or does not wait, for the termination of a child.

With our new extension, programmers can now submit locality-flexible and cancelable tasks through a new asyncAny construct. finishAsyncAny blocks until all spawned asyncAny-tasks within are finished. mergeAsyncAny adds a passed task result to the worker result. reduceAsyncAny reduces all partial results to a global result. cancelAllAsyncAny cancels all unprocessed asyncAny-tasks and prevents new tasks.

Our scheme (and implementation) fits all usage scenarios that match our following task model:

  • Tasks are free of side-effects.
  • Task processing may generate a task result and new tasks.
  • Task results must be reducible using a commutative and associative operator.
  • Each worker maintains a partial result.
  • A global result is computed from the partial results by reduction.

In our implementation, each place runs one manager deamon thread, which monitors the local pool and tries to steal tasks, if the pool becomes empty. All incoming and outgoing steals are logged. Stealing is always performed immediately, but in a coordinated way: a thief tries to pull half of the unprocessed tasks out of the victim’s internal pool. When a place runs out of tasks, the steal log is sent to place~0 and the place goes inactive. With the help of this information place~0 realizes the global termination detection of the surrounding finishAsyncAny. An inactive manager is reactivated when at least one asyncAny-task is inserted into its local pool.

We have carried out performance measurements with six benchmarks: Unbalanced Tree Search (UTS), NQueens, Betweenness Centrality (BC), Travel Salesman Problem (TSP), Pi and Matrix Multiplication (MatMul).

Experiments were run on at Goethe-HLR in Frankfurt, Germany. Each node is equipped with two 20-core Xeon Skylake Gold 6148 CPUs (a total of 40 cores per node), and 192 GB of main memory. We allocated up to 128 nodes a single island, and started on each 40 workers.

Comparing our asyncAny implementations running on 128 nodes (5120 workers) to the base running on 1 node (40 workers), UTS achieves a speedup of 100, NQueens of 120, BC of 129, TSP of 65, Pi of 107 and MatMul of 109. Good intra-place speedups and a low overhead for cancellation-bookkeeping were reported in previously publications. The following table shows our measured processing times in seconds: 

Nodes (Worker)

UTS

NQueens

BC

TSP

Pi

MatMul

1 (40)

3118.42

3109.94

4694.96

1145.78

4115.45

4756.84

2 (80)

1590.91

1708.84

2210.75

645.56

2131.20

2414.77

4 (160)

857.91

805.80

1168.04

361.55

1069.64

1179.07

8 (320)

450.06

399.82

464.18

189.89

533.24

594.96

16 (640)

217.81

194.25

258.08

110.53

267.55

299.66

32 (1280)

109.51

99.91

133.70

51.00

135.96

152.83

64 (2560)

55.89

50.53

68.21

27.91

71.19

78.83

128 (5120)

30.94

25.87

36.38

17.47

38.31

43.36

The experiments were executed on Goethe-HLR Cluster.

Future work may consider fault tolerant asyncAny-tasks to tolerate permanent place crashes. Moreover malleability could support adding and removing places at runtime.

Related papers

J. Posner, C. Fohry: Hybrid Work Stealing of Locality-Flexible and Cancelable Tasks for the APGAS Library. The Journal of Supercomputing, Vol. 47, No. 4, 2018, pp. 1435–1448.

J. Posner, C. Fohry: A Combination of Intra- and Inter-Place Work Stealing for the APGAS Library. Proc. Int. Conf. on Parallel Processing and Applied Mathematics, Workshop on Language-Based Parallel Programming Models, 2018, Springer LNCS 10778, pp. 234–243.

Contact

Jonas Posner
University of Kassel, Germany