In the previous chapter, HPL was presented as a way to simplify the program- ming of heterogeneoussystems in single-device environments. In this chapter we extend this tool to exploit multiple devices while keeping its characteristics of min- imum user effort and maximum performance. In a first step we extend HPL with a totally general data coherency scheme for the data structures it manages as well as a mechanism to make assignments between these structures so that they can be easily copied. The implementation is efficient, as it not only requires the minimum num- ber of transfers, but it also applies the most efficient mechanisms to perform these transfers. This latter characteristic implies a dynamic adaption capability of our library, as different transfer mechanisms suit better different systems. In a second stage the host API is improved with three mechanisms to reduce the programming effort of applications that exploit several heterogeneous devices. A first idea is the ability to use subarrays, that is, regions of arrays, that can be used in one or several devices without becoming separate data structures, so that it is always possible to keep a view of the underlying full array, while automatically keeping the consistency of the data. The second idea are mechanisms to split kernels for their execution in several devices, giving place to what we call subkernels. Our experiments show that these mechanisms can largely improve programmability, their overhead being min- imal. The last idea is an alternative to our initial subkernel proposal that is more flexible and allows to easily explore and choose the best workload distribution when a computation is accelerated using devices with different capabilities. This proposal,
Scalability is one of the most important features in exascale computing. Most of this systems are heterogeneous and therefore it becomes necessary to develop models and metrics that take into account this heterogeneity. This paper presents a new expression of the isoefficiency function called H-isoefficiency. This function can be applied for both homogeneous and heterogeneoussystems and allows to analyze the scalability of a parallel system. Then, as an example, a theoretical a priori analysis of the scalability of Floyd’s algorithm is presented. Finally a model evaluation which demonstrate the correlation between the theoretical analysis and the experimental results is showed.
There are many proposals to simplify the programming and management of accelerator devices, and the hybrid programming mixing accelerators and CPU cores, as we state in Sect. 2.3. However, in many cases, portability compromises the efficiency on different devices. Depending on the proposal, some details concerning the coordination of different types of devices are still tackled by the programmer, such as computation partition and balance, data mapping and locality, or data movement coordination across different memory hierarchies. In this chapter we introduce the Multi-Controler (MCtrl), an abstract entity implemented in a library, that coordinates the management of heterogeneous devices, including accelerators with different capabilities and sets of CPU-cores. It helps the programmer to handle the computation partition, mapping, and transparent execution of complex tasks in such hybrid and heterogeneous environment, independently of the target devices selected at run-time. Our proposal allows the exploitation of simple generic kernels that can fit in any device; very specialized kernels defined and optimized by the programmer for each architecture; and even wrappers to call third-party predefined libraries (such as e.g. cuBLAS ). This allows the exploitation of native or vendor specific programming models, in a highly efficient way. The most appropriate kernels for each target device are automatically selected by the entity during program execution. Our proposal improves state-of-the-art solutions, simplifying the data partition, mapping, and transparent deployment of both, simple generic kernels portable across different device types, and specialized implementations defined and optimized using specific native or vendor programming models (such as CUDA for NVIDIA's GPUs, or OpenMP for CPU-cores).
This work describes a novel way to select the best computer node out of a pool of available potentially heterogeneous computing nodes for the execution of computational tasks. This is a very basic and difficult problem of computer science and computing centres tried to get around it by using only homogeneous compute clusters. Usually this fails as like any technical equipment, clusters get extended, adapted or repaired over time, and you end up with a heterogeneous configuration.
nACC, a directive-based standard  generalized for multiple types of massively parallel processors such as GPUs, multicores or manycore accelerators. OpenACC directives can be used in C/C++ or Fortran programs in order to extract parallelism, underlying compilers being responsible of finding particular optimizations compati- ble for both the language and the architecture. This feature is its main advantage and, at the same time, its main drawback, since programmers must trust blindly in the optimization capabilities of the compiler, and sometimes they are more based in theoretical platitudes than in real device-specific properties. OmpSs  is another directive-based proposal to program heterogeneoussystems created as an effort to extend OpenMP by supporting complex dependency patterns, heterogeneity  and data movement for task parallelism. Concurrently with the development of these new directive-based approaches to heterogeneoussystems programming, such capabilities were steadily added to the OpenMP standard itself. Thus, version 4.0 introduced a whole set of directives that allowed users to distribute loop iterations among device threads, to pack those threads in groups called teams or to manip- ulate device-owned data structures. Released in 2015, OpenMP 4.5 is the latest version of the standard and it improves those device memory management capabil- ities, by complementing directives with explicit routines to allocate, deallocate and map structures to device memory as well as to perform data transfers .
Abstract During the last decade, parallel processing architectures have be- come a powerful tool to deal with massively-parallel problems that require High Performance Computing (HPC). The last trend of HPC is the use of heterogeneous environments, that combine different computational processing devices, such as CPU-cores and GPUs (Graphics Processing Units). Maxi- mizing the performance of any GPU parallel implementation of an algorithm requires an in-depth knowledge about the GPU underlying architecture, be- coming a tedious manual effort only suited for experienced programmers. In this paper, we present TuCCompi, a multi-layer abstract model that sim- plifies the programming on heterogeneoussystems including hardware accel- erators, by hiding the details of synchronization, deployment, and tunning. TuCCompi chooses optimal values for their configuration parameters using a kernel characterization provided by the programmer. This model is very use- ful to tackle problems characterized by independent, high computational-load independent tasks, such as embarrassingly-parallel problems. We have evalu- ated TuCCompi in different, real-world, heterogeneous environments using the All-Pair Shortest-Path problem as a case study.
CIM (Common Information Model)  is an information model that describes the information of an enterprise distributed system. The standard is maintained by an enterprise consortium, the DMTF (Distributed Management Task Force). CIM provides an object-oriented model for representing the managed elements. The specification extensively uses inheritance mechanisms to provide a modular and extensible characterization of them. The modular structure of CIM is composed by a common core that defines the basic elements, and a set of extensions. Each extension provides additional details over an aspect of the system (e.g., databases, application servers, policy definition or hardware devices). Extensions can either extend the base model, or further refine the concepts presented by other extensions. On the one hand, this approach provides a rich characterization of every managed element, with a specific class governing its attributes and management operations. On the other hand, the standard is so extensive that supporting tools only implement specific subsets of the implementation, and the required modeling effort for covering additional elements complicates its adoption to very heterogeneoussystems.
One of the most important modules on the distribution management system (DMS) is the fault location engine, which requires an effective algorithm to identify the fault line channel from the data provided by different intelligent electronic devices (IEDs). The accuracy of this process leads to smart features of intelligent network monitoring and efficient isolation. This paper proposes the algorithm core of a fault locator software, based on undirected Dijkstra im- plemented in MATLAB, and co-simulated in three standard feeders systems with distributed generation using DSSim-PC and integrated in LabVIEW as a whole framework of a fault loca- tion manager. Furthermore, a first prototype of a test bench is proposed and implemented as an academic platform for research objectives and Advanced Distributed Automation (ADA) purposes.
Cederman et al. propose two STM systems  focused on the conflicts occur- ring among different work-groups. They do not consider potential interactions between single work-items. Both implementations follow a lazy-lazy scheme: lazy conflict detection and lazy version management. Their first implementa- tion, called blocking uses a set of irrevocable locks to protect memory positions accessed within the transaction (this is, only transactions able to get all the re- quired locks are able to commit, but serially). They discuss that the correctness of this implementation is highly dependent on the fairness of the GPU scheduler: work-groups waiting for a lock can swap the work-group holding a lock repeat- edly and, if the later is not scheduled again, the program may deadlock. Their second proposal is obstruction-free. In this approach, comitting transactions try to acquire locks that protect the locations to be updated and, at the same time, announces the values to be written. Conflicting transactions (i.e., those that try to acquire the same locks) now have two options. In case a conflicting transaction is able to get all the locks, but has not updated memory, value to be written can be forwarded from one transaction to the other. Another option is, if the con- flicting transaction did not get all the locks or the values in memory missmatch, to abort the transaction. In this implementation transactions either commit or abort without waiting for other transactions to finish their commit. In average, the obstruction-free algorithm produces better performance and less amount of aborted transactions compared to the blocking implementation.
We extend and merge Mahut’s and Newell’s car-following models to include heterogeneous desired speeds of vehicles, which becomes relevant for proper mimicking of platoon formation and evolution. The proposed model offers a complete description of traffic dynamics over a given highway stretch, where delays occur at the end. Illustrative numerical examples are conducted with several model specifications, showing scattering of the fundamental diagrams, the “capacitydrop” phenomena, and stop-and-go waves related to “phantom jams”; therefore it shows traffic hysteresis as well.
There are several proposals in order to facilitate the programmability of these architectures: Autotuning , directives , automatic compilers  or acceler- ated libraries . Autotuning   is a very interesting option for applications whose execution time, memory usage or energy consumption can vary depending on a set of parameters and their execution environment. The autotuner determines the best parameter combination to maximise an user-defined metric. Nevertheless, this technique requires writing code in a parametrized way to accommodate various performance tuning parameters. Another approach is the use of directives such as OpenACC  or hiCuda . Most of this kind of libraries require to have GPU expertise. Furthermore, the code is not easily readable and there are also some lim- its, for example, the programmer cannot use CUDA intrinsic functions within the accelerator region. Automatic compilers are another interesting option that auto- matically generate code for GPUs, such as Par4all  and Bones , saving time and effort to programmers. However, these approaches sometimes rely on the user knowledge for tuning applications. In addition, some systematic code translations, without a previous analysis of the problem, can lead to reduce performance. Fi- nally, the use of accelerated tuned libraries for each architecture version, such as SkePU , MAGMA  or SkelCL , can enable applications to fully exploit the power of current heterogeneus parallel systems. Due to the fast GPU market evolution, each GPU architecture version highly vary its desing from one generation to another, and the parameters which influence on performance also change and must be re-adjusted.
2. BERNOULLI’S THEOREM, THE CLASSICAL HOMOGENEOUS FIGURES AND OURS Before going on with our heterogeneous model, we wish to discuss Bernoulli’s theorem as is com- monly employed for obtaining the classical homoge- neous figures (Lyttleton 1953; Dryden 1956), par- ticularly the Dedekind ellipsoids which, along with our models, are static. The steady-state equations of motion for a self-gravitating fluid are
Today, every aspect of our lives is influenced by networked computer systems. The newly minted domain of cyberspace is so pervasive that the US Department of Defense has put cyberspace on par with land, sea, and air as a war-fighting domain . The Internet, which provides transportation to all types of information including complex real-time multi-media data, is the universal network of millions of interconnected computer systems, organized as a network of thousands of distinct smaller networks. However, systems on the cyberspace are constantly faced with cyber threats every day. Incident handlers across the globe are faced with compromised systems, running some set of malicious programs, providing some kind of unintended service to an intruder who has taken control of someone else's computers. In 2015, Symantec discovered more than doubled to 54 zero-day vulnerabilities, a 125 percent increase from the year before . Since threats cannot be eliminated thoroughly, the strategy to secure cyberspace is to reduce vulnerabilities to those threats before they can be exploited to damage the cyber systems . Therefore, it is very necessary to control this increasing risk tendency.
As stated by Mr. Trichet’ s words: “the key lesson we would draw from our experience is the danger of relying on a single tool, methodology or paradigm. Policy-makers need to have input from various theoretical perspectives and from a range of empirical approaches. ...we need to develop complementary tools to improve the robustness of our overall framework. ...In this context, we would very much welcome inspiration from other disciplines: physics, engineering, psychology, biology. Bringing experts from these fields together with economists and central bankers is potentially very creative and valuable. Scientists have developed sophisticated tools for analysing complex dynamic systems in a rigorous way. These models have proved helpful in understanding many important but complex phenomena: epidemics, weather patterns, crowd psychology, magnetic fields. Such tools have been applied by market practitioners to portfolio management decisions, on occasion with some success”.
This checklist might usefully be completed by those who drafted the plan and/or by employees in the government itself. However, it also is important to have independent reviewers. Those involved in drawing up the plan might have personal or political interests or might be too closely involved with the plan to see anomalies or provide critical input. Ideally, an independent multidisciplinary team should be convened to conduct an evaluation. A team also is advantageous because no single person is likely to have all the relevant information required, and debate is crucial for arriving at an optimal plan for the country. Furthermore, when relevant interest groups have been involved in the process of the development of the plan and/or in their evaluation, which leads to changes being made to the plan, it is likely that they will be implemented more effectively. It is useful to include consumer organizations, family organizations, service providers, professional organizations and nongovernmental organizations, as well as representatives of other government departments affected by the mental health plan. Finally, although the checklist should be interpreted in terms of the document that outlines the mental health plan, it is important to have, or be familiar with, other relevant and related documentation. Often, items are not covered in the plan because they are comprehensively covered elsewhere. For example, plans for health information systems or human resources might include mental health and are therefore deliberately not repeated in the mental health plan. This explanation should then be noted in the relevant section.
This work has presented an architecture to integrate a KNX bus, a robot and different assistive and monitorization devices into an AmI architecture for emer- gency management. The main advantage of this heterogeneous approach is that the developer can choose the most appropriate device for each task, rather than focusing on a single standard or resigning to whatever (limited) information commercial interfaces offer via the usual web interface. The gateway developed is based on EIBD daemon and is able to understand all the non-programming commands. Our translation tool provides total access to the KNX bus and shows how it can interact with any other devices.
The remaining parts of the article are organized as follows. In Sec. 2 we provide basic definitions (including central notions such as that of in- stitution, entailment system, and structure building operations), as well as Borzyszkowski’s calculus. In Sec. 3 we develop one of the main contribu- tions of this article by analyzing the calculus proposed by Borzyszkowski and discussing its possibilities and limitations. We show a modified version of Borzyszkowski’s calculus that is complete and requires weaker conditions, thus providing a complete calculus for many logics ubiquitous in software modeling. In Sec. 4.1 we present Borzyszkowski’s calculus for heterogeneous structured specifications, and discuss its limitations. In Sec. 4.2 we extend our calculus in order to deal with structured heterogeneous specifications related via institution representations. Finally, in Sec. 5, we draw some conclusions.
OMP2MPI is a tool proposed to facilitate the portability of OpenMP source code to MPI. I showed how it effectively and automatically translates OpenMP source code to MPI because it is able to go outside the node, thus allowing the program to exploit non-shared-memory architectures such as clusters or NoC-based MPSoC. This automatic task is very useful because the programmer can keep working with the OpenMP model, which is easily readable, and simply compile it over the OMP2MPI compiler to obtain the advantages of the MPI model offer (e.g., speedup and scalability). The readability of the generated code is acceptable, so further optimizations can be done by experts to improve performance results. The experimental results obtained on the Polyhedral benchmark in Figures 4.11, 4.12, 4.13, 4.14, 4.15, and 4.16 are promising, especially when considering it is an effort- less version. They also produce better scalability than the original OpenMP code. The speedup figures for 64 cores are higher than 60× in some cases, and also higher than the original OpenMP code. These results show again that, as mentioned in the introduction, OpenMP does not always perform better than MPI in shared memory systems.