Thread Assignment in Multicore/Multithreaded Processors: A Statistical Approach

Показать описание

The introduction of multicore/multithreaded processors, comprised of a large number of hardware contexts (virtual CPUs)
that share resources at multiple levels, has made process scheduling, in particular assignment of running threads to available hardware
contexts, an important aspect of system performance. Nevertheless, thread assignment of applications running on state-of-the art
processors is an NP-complete problem. Over the years, numerous studies have proposed heuristic-based algorithms for thread
assignment. Since the thread assignment problem is intractable, it is in general impossible to know the performance of the optimal
assignment, so the room for improvement of a given algorithm is also unknown. It is therefore hard to decide whether to invest more
effort and time to improve an algorithm that may already be close to optimal. In this paper, we present a statistical approach to the
thread assignment problem. First, we present a method that predicts the performance of the optimal thread assignment, based on the
observed performance of each thread assignment in a random sample. The method is based on Extreme Value Theory (EVT), a
branch of statistics that analyses extreme deviations from the population mean. We also propose sample pruning, a method that
significantly reduces the time required to apply the statistical method by reducing the number of candidate solutions that need to be
measured. Finally, we show that, if no suitable heuristic-based algorithm is available, a sample of several thousand random thread
assignments is enough to obtain, with high confidence, an assignment with performance close to optimal. The presented approach is
architecture and application independent, and it can be used to address the thread assignment problem in various domains. It is
especially well suited for systems in which the workload seldom changes. An example is network systems, which typically provide a
constant set of services that are known in advance, with network applications performing a similar processing algorithm for each packet
in the system. In this paper, we validate our methods with an industrial case study for a set of multithreaded network applications on an
UltraSPARC T2 processor. This article is an extension of our previous work [44], which was published in Proceedings of 17th
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-2012).