A Generic Reusable Java Framework for Fault-Tolerant Parallelization with the Task Pool Pattern

Work stealing is used in task-pool based parallel programs, and serves to balance the load of irregular applications. It can be implemented in either a cooperative or a coordinated way. Unfortunately, scaling a work-stealing application to a large number of cluster-nodes increases the probability of failures.

We developed a generic reusable framework for fault-tolerant parallelization with the task pool pattern. It is written in Java and thus addresses many potential programmers. Users of this framework can focus on coding sequential tasks for their problem, while respecting some framework contracts.

First, we implemented the two work-stealing approaches, cooperative and coordinated, for lifeline-based global load balancing, which is the algorithm used by X10's GLB framework. All implementations deploy the APGAS library, which brings the parallel programming concepts of X10 to Java. Our cooperative variant resembles the original GLB framework. The coordinated variant enables concurrent access to local task pools by using a split queue data structure. In experiments, both APGAS variants had similar execution times, without a clear winner. However, both outperform the original GLB framework for X10, when compiled with Managed X10.

Afterwards, we extended the cooperative variant by a fault-tolerance scheme, which utilizes Hazelcast’s distributed and fault-tolerant IMap. Our fault-tolerance scheme uses two system-wide maps, in whichit stores, e.g., backups of local task pools. Framework users mayconfigure the number of backup copies to control how manysimultaneous failures are tolerated. The algorithm is correct in the sense that the computed result is the same as in non-failure case, or the program aborts with an error message. In experiments, we observed an overhead of at most 35% in comparison to the non-fault-tolerant base variant.

Our frameworks are probably among the first programs ever written with the APGAS library by users from outside the development team. The combination of APGAS and Hazelcast's IMap can be recommended. Compared to X10, APGAS offers distinct advantages for the programmer, for example comfortable auto completion and debugging.

In the future, I plan to implement various extensions of the scheme within the APGAS runtime.

Poster: A Generic Reusable Java Framework for Fault-Tolerant Parallelization with the Task Pool Pattern