Sergey Abramov1, Alexei
Adamovich1, Alexander Inyukhin2, Alexander
Moskovsky1,
Vladimir Roganov1, Elena
Shevchuk1, Yuri Shevchuk1, and Alexander
Vodomerov2
1 Program System Institute, Russian Academy of
Sciences, Pereslavl-Zalessky, 152020, Russia, Yaroslavl Region. +7
08535 98 064 (phone&fax) abram@botik.ru
2 Moscow, 119192,
Michurinsky prosp., 1, Institute of Mechanics of MSU,
Russia
Abstract. The paper is dedicated to an open T-system (OpenTS) - a programming system that supports automatic parallelization of computations for high-performance and distributed applications. In this paper, we describe the system architecture and input programming language as well as system's distinctive features. The paper focuses on the achievements of the last two years of development, including support of distributed, meta-cluster computations.
Open T-System (Open TS) is a recent dynamic program parallelization technology for high-performance and distributed applications. It originates from functional and metaprogramming technologies [1, 2] and tries to achieve maximum performance of single/multi-processors, supercomputers, clusters and meta-clusters. Another goal was the development of easy-to-use tools for parallel programming, with high learning curve and easy legacy code support. With initial implementations of T-system dated back to nineties and end of eighties of the last century, Open TS is a third generation of the T-system [3]. The Open TS approach allows addressing in a uniform way parallel computing problem for mutli-core processors, SMP systems, computational clusters and distributed systems. As well, Open TS facilitates parallel applications with non-uniform parallelism grains or parallelism grains defined at runtime.
The Open TS design utilizes many concepts of parallel computing. First of all, it devises high-level parallelizing approach, while many of them currently exist [4]. Secondly it utilizes an extension of C++ language to express parallelism, while many other extensions of C and C++ for parallel computing were developed [5]. Thirdly, the concept of T-system is based upon functional programming approach [1], that make it very similar to parallel implementations of functional languages [6]. At last, Open TS runtime implementation utilizes Distributed Shared Memory (DSM) [7], mutli-tier architecture [8] and C++ template- based design [9]. Here we note separately only small fraction of all works in this field, not comprehensive but representative, as we hope:
While many parallel programming techniques, like Unified Parallel C [16] and CxC [17] are not covered in our comparison, Open TS distinctions will be virtually the same.
Unlike many tools for parallel programming, T-System does not try to change the usual programming model too much. Native input language is a transparent attributebased extension of C++; however, other T-dialects of programming languages are in the development stage: T-FORTRAN, T-REFAL. Only two new notions are really important for programming: T-function and T-value. T-values are extensions of basic C values with non-ready value, read access to a non-ready value stops execution of a T-function, unless C-value is provided during computation. T-functions are pure Cfunctions forming functional model at the top level of program structure. However, imperative C exists inside T-functions enabling potential for low-level optimization. Support for object oriented-model is forthcoming.
An important feature of Open TS is a separation of the computation code from the scheduling code. In Open TS, the programmer is enabled to develop complex strategies for dynamic parallelization without affecting the computational code itself.
Parallel execution is based on a completely conflict-free data-flow model, and the "macro-scheduling" algorithm distributes computational tasks (active T-functions) over all available computing resources on the fly. Thus, latency hiding should enable very high computational power utilization. Moreover, heterogeneous (e.g. different CPU speeds) computational clusters can be efficiently loaded with that approach.
Special hardware such as application-specific accelerators and processors can be also considered as specific computational resources, it is dynamically loaded in the same way.
(1) Opens supports the usage of more than one million of threads even in one usual processor - this was shown practically, this was used in real applications. |
Millions of threads1 can work in a cooperative and conflict-free way enabling latency hiding: any time non-ready T-value is reached, T-System switches rapidly to another ready-to-compute task. In this way, T-System avoids blocking computation in many cases when communication infrastructure permits. In brief, T-System may be a good candidate to fill up the gap between fast recent CPUs and latency-restricted communications.
T-application is a self-contained, dynamically linked executable. In a nutshell, it recognizes the execution environment and automatically loads a corresponding communication driver on the fly. The execution environment may be one of the following.
Thus, T-applications don't need to be recompiled or re-linked for all possible communication flavors. This is important in many cases, especially in meta-clusters with heterogeneous MPI implementations.
Open T-System runtime has a microkernel-based design. Microkernel, or TSuperstructure, is a central part of the runtime. It contains all essential entities that a typical program needs to be run on. T-Superstructure has a "snowman" architecture of three tiers: `S' ("super-memory" and "super-threads"), `M' (mobile objects and references) and `T' (T-values, variables, references, functions). Being compact in size (less than 5 000 lines in about 100 C++ classes), it suits for various extensions: enhanced task schedulers, memory allocation schemes, custom thread systems, and so on. A special class 'Feature' is used to register extension plug-ins, which are typically dynamically linked at the startup stage. The microkernel can be easily ported to `almost pure' hardware, because it is almost self-contained. C++ [cross] compiler only is required for such porting. However, since C++ templates are used extensively, a modern C++ compiler is required.
Fast context switch is a special feature of Open TS, which is very important for efficient T-applications. Since T-applications are known to create millions of simultaneous threads, fast switching is key important. Today, the T-context switch is 10 times faster than the fastest standard thread library switch.
A "Supermemory", or special kind of distributed shared memory, is located outside of program data and used to manage T-values. Novel communication technologies such as hyper-transport can be directly incorporated into the "Supermemory" layer to avoid an unnecessary MPI overhead. Super memory is utilized in six different ways:
The fault-tolerance support has been implemented with the help of LAM MPI BLCR checkpoint system [18]. It is integrated with the T-System runtime, thus making fault-tolerant computing easier.
Since the T-system originates from the functional programming model, it is possible to implement the fault-tolerance on the base of re-computing of T-functions. This work is forthcoming.
Two approaches are followed to develop compilers for T++ programs.
The first, "converter", approach utilizes OpenC++ [19] parser to translate a T++ program to a C++ program using Open TS runtime library calls. Advantage of that approach is that the best-of-breed C++ compiler can be used, with the best processorspecific optimization available. The drawback is some C++ syntax features that are not supported seamlessly due to Open C++ limitations.
An alternate, "compiler", approach is based on an open-source GNU C++ compiler. An extra front-end language for T++ has been implemented, it has a smooth and comprehensive support of all C++ language features. However, if the GNU C compiler optimization is not on a par with the other compilers of the target platform, a performance loss might happen.
First of all, T++ is a transparent attribute-based dialect of C++. The T++ code can be trivially mapped to the sequential C++ program by masking T-attributes on the preprocessor stage. To start, the T++ code may be developed and debugged without T-System.
Then, the `t++' compiler may be used to obtain T-executables which should be able to normally run on the unicomputer. Thus, the second stage of the development process is to check whether everything works correctly on the unicomputer - this involves usual testing and debugging for the traditional (one-processor) case.
Furthermore, the same executable may be run on the `cluster emulation'. The simplest way to do this is to use LAM on various Linux systems: the command
mpirun n0,0,0,0 <t-executable>
will emulate the 4-node cluster. Some tuning can be done at this stage. Finally, run T-executable on the desired target platform.
T-System has a number of built-in profiling, tracing and debugging facilities.
First of all, debugging is facilitated by several modes of compilation: "optimized", "normal" and "debug". The "optimized" mode uses the runtime version with heavy optimization. The runtime of the "normal" mode is simplified as compared to the "optimized" version. If an application is compiled in the "normal" mode and a problem persists, it should be attributed to the application itself - not the runtime - with high degree of confidence. Moreover, the "debug" mode generates a large amount of debug output, which helps programmers to understand the current situation in T-runtime and applications. This output can be filtered with the help of regular expressions.
A full-fledged Trace facility has also been implemented for T-applications. When the program is finished, some statistical data is printed (see figures below). It includes minimal/medium/maximal (depending on computational nodes) values of the following parameters: used CPU time, communication time, idle time. This hot profiling information may be very useful for the tuning of applications. Communication message logs can be called in order to understand which communication traffic occurred during the program execution. A T-function call graph can also be obtained.
If the program crashes, some information (including program call stack with source line numbers) is printed. Optionally, the debugger is started at the same time, which may be very convenient for a rapid problem discovery.
Finally, a special heartbeat logic is used to discover broken program/communication state. If heartbeat timeout is reached without any data exchange, then all T-processes will exit automatically.
The example program is the calculation the Fibonacci number. Since it is not very hard computationally, it is a good test for the runtime system, and it illustrates well the simplicity of T++ programming.
The only T-function is the "fib" function which recursively calls itself. Since the result of "fib" is a non-ready value, explicit casting to int is necessary for the program to run correctly. The casting results in the "main" thread wait until the result of "fib" is ready. "fib" recursively calls itself creating a tree, while the tree branches can be computed in parallel.
Compiling the program is possible with either t++ or tg++.
t+ -o fib0 fib0.tcc
The process of the program execution is illustrated in Fig. 1 (running on single processor) and Fig. 2 (running on four-cluster nodes). You may see some speedup demonstrated by "fib". The example has been a mere illustration that doesn't reflect the real quality of T-system, benchmarking results will be published elsewhere.
The T++ language is a semantically and syntactically "seamless" extension of C++. The language constructions are enumerated below with short descriptions following them:
tfun - a function attribute which should be
placed just before the function declaration. Now, the function
cannot represent a class method but must be an ordinary C function.
A function with the "tfun" attribute is named
"T-function".
tval - a variable type attribute
which enables variables to contain a non-ready value. The variable
can be cast to the "original" C++-type variable, which makes the
thread of execution suspend until the value becomes
ready.
tptr - a T++ analogue of C++ pointers
which can hold references to a non-ready
value.
tout - a function parameter attribute
used to specify parameters whose values are produced by the
function. This is a T++ analog of the "by-reference" parameter
passing in C++.
tct - an explicit T-context
specification. This keyword is used for specification of additional
attributes of T-entities.
tdrop - a T++
-specific macro which makes a variable value ready. It may be very
helpful in optimization when it's necessary to make non-ready values
ready before the producer function finishes.
The detailed performance study of Open TS runtime is out of the current paper scope and will be published elsewhere. However, overall runtime performance and quality is good enough to stimulate many groups outside of Program Systems Institute to develop their own applications with Open TS (see below). Best speedup achieved with image-processing application is approximately 60% of linear speedup on 32-processor computational cluster with Scalable Coherent Interface (SCI) interconnect.
Approximately a dozen of applications have been developed with the help of Tsystem. Some of them are the following:
Open T-System is being developed in the Program System Institute of the Russian Academy of Sciences (PSI RAS) as a key technology in the SKIF Super-Computing project. The system support can be obtained via e-mail: opents@botik.ru (developers' conference).
We are also working on various application-oriented T-libraries. Such libraries are represented as the T++ code (working also in pure C++) and may be used without any knowledge of T++ or even parallel programming at all. Using the C++ inheritance mechanism, an application programmer just needs to define several applicationspecific methods - virtual functions - to obtain a complete highly-parallel computational component for a custom high-performance application. Other development areas if macro-scheduling schemas for meta-clusters and other distributed systems.
This work is supported by joint "SKIF" supercomputing project of Russia and Belarus and basic research grant from Russian Academy of Science program "Highperformance computing systems on new principles of computational process organization" and basic research program of Presidium of Russian Academy of Science "Development of basics for implementation of distributed scientific informational-computational environment on GRID technologies".