In this chapter, we discussed two scheduling methods for DAG structured computations on general-purpose multicore processors and manycore proces- sors. The first method is called collaborative scheduling and the second is called hierarchical scheduling. Both methods consist of a set of plug-and-play components. Various heuristics could be used for optimization within each component. For example, randomization techniques could be used for the Al- locate module [15].
Collaborative scheduling is designed for general-purpose multicore pro- cessors. In this method, we developed lock-free local task lists and weight counters that are different from traditional lock-free data structures. The lock-free mechanism reduces scheduling overhead due to synchronization and contention across the cores. The hierarchical scheduling is designed for ho- mogeneous manycore processors, where we divided the threads into groups, each having a manager to perform scheduling at the group level and several workers to perform self-scheduling for the tasks assigned by the manager. A supermanager was used to dynamically adjust the group size, so that the scheduler could adapt to the input task dependency graph.
In the future, we plan to reduce the number of variables used by the lock-free data structure and explore the heuristics for each component in the proposed scheduler. For example, for the task fetch module, we plan to inter- leave computationally intensive tasks with memory access intensive tasks for the threads assigned to the same core to improve the overall performance. We also plan to study data layout for high throughput processors to efficiently use the data cache of the UltraSPARC processors, since the L2 cache is no more than 4 MB, shared by up to 64 hardware threads.
[1] I. Ahmad, Yu-Kwong Kwok, and Min-You Wu. Analysis, evaluation, and comparison of algorithms for scheduling task graphs on parallel proces- sors. In Proceedings of the 1996 International Symposium on Parallel Architectures, Algorithms and Networks, page 207, 1996.
[2] I. Ahmad, S. Ranka, and S.U. Khan. Using game theory for schedul- ing tasks on multi-core processors for simultaneous optimization of per- formance and energy. IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–6, April 2008.
[3] S.R. Alarm, R.F. Barrett, J.A. Kuehn, P.C. Roth, and J.S. Vetter. Char- acterization of scientific workloads on systems with multicore processors. In IEEE International Symposium on Workload Characterization, pages 225–236, 2006.
[4] James H. Anderson and Srikanth Ramamurthy. Lock-free and practical doubly linked list-based deques using single-word compare-and-swap. In Proceedings of the 8th International Conference On Principles Of Dis- tributed Systems, pages 240–255, 2005.
[5] Olivier Beaumont, Larry Carter, Jeanne Ferrante, Arnaud Legr, Loris Marchal, and Yves Robert. Centralized versus distributed schedulers for multiple bag-of-task applications. In International Parallel and Dis- tributed Processing Symposium (IPDPS), pages 1–10, 2006.
[6] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Technical report, Cambridge, 1996.
[7] Charm++ programming system. http://charm.cs.uiuc.edu/research/charm/. [8] Huey-Ling Chen and Chung-Ta King. Eager scheduling with lazy retry
in multiprocessors. Future Gener. Comput. Syst., 17(3):215–226, 2000. [9] E. G. Coffman. Computer and Job-Shop Scheduling Theory. John Wiley
and Sons, New York, NY, 1976.
[10] Guojing Cong and David A. Bader. Designing irregular parallel algo- rithms with mutual exclusion and lock-free protocols. Journal of Parallel and Distributed Computing, 66:854–866, 2006.
[11] M. De Vuyst, R. Kumar, and D.M. Tullsen. Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors. In IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–6, 2006.
[12] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990.
[13] Intel Threading Building Blocks. http://www.threadingbuldingblocks.org/. [14] Tommi S. Jaakkola and Michael I. Jordan. Variational probabilistic in-
ference and the QMR-DT network. Journal of Artificial Intelligence Re- search, 10(1):291–322, 1999.
[15] Yu-Kwong Kwok and Ishfaq Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406–471, 1999.
[16] Jason Liu, David M. Nicol, and King Tan. Lock-free scheduling of logical processes in parallel simulation. In Proceedings of the 2000 Parallel and Distributed Simulation Conference, pages 22–31, 2001.
[17] OpenMP Application Programming Interface. http://www.openmp.org/.
[18] Christos Papadimitriou and Mihalis Yannakakis. Towards an
architecture-independent analysis of parallel algorithms. In Proceedings of the twentieth annual ACM symposium on Theory of computing, pages 510–513, 1988.
[19] Denis Sheahan. Developing and tuning applications on UltraSPARC T1 chip multithreading systems. Technical report, 2007.
[20] M. S. Squillante and R. D. Nelson. Analysis of task migration in shared memory multiprocessor scheduling. In ACM Conference on the Measure- ment and Modeling of Computer Systems, pages 143–155, 1991.
[21] Xinan Tang and Guang R. Gao. Automatically partitioning threads for multithreaded architectures. J. Parallel Distrib. Comput., 58(2):159–189, 1999.
[22] Yinglong Xia, Xiaojun Feng, and Viktor K. Prasanna. Parallel evidence propagation on multicore processors. In The 10th International Confer- ence on Parallel Computing Technologies, pages 377–391, 2009.
[23] Yinglong Xia and Viktor K. Prasanna. Scalable node-level computation kernels for parallel exact inference. IEEE Trans. Comput., 59(1):103–115, 2010.
[24] Henan Zhao and R. Sakellariou. Scheduling multiple DAGs onto het- erogeneous systems. In IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–12, 2006.