• No se han encontrado resultados

There have been many attempts at parallelising ray tracing and eventually, high fidelity rendering (Crockett, 1997; Chalmerset al., 2002), the first of which were targeted specifically at large supercomputers (Wald et al., 2009). Data-parallel approaches that subdivide problems in image-space, where the entire ray tracing tree of a pixel is processed by a single processing element, are preferred for world data models. Woodwark (1984) introduced a hierarchical method for image-space

4. Parallel and Distributed Rendering 69

subdivision that improved workload distribution across the employed processing elements at the cost of some computation efficiency. Instead of assigning a sin- gle contiguous area of a picture to each processing element, a number of pixels on screen are assigned using a recursive subdivision scheme, where each proces- sor would effectively be working at a lower resolution of the actual image, thus achieving more evenly spread workloads between processors. Plunkett & Bailey (1985) proposed a vectorised ray tracing algorithm for a pipelined vector com- puter which used image-space subdivision to generate a number of ray-object intersection queries. These queries were then queued for processing by the vector processor, achieving very fine-grained parallelism. The initial query set was gen- erated for primary rays; for each ray processed, secondary rays, such as shadow or reflection rays, were generated, queued and eventually processed. The results showed that the vectorised algorithm demonstrated a substantial speed-up, at least an order of magnitude over the scalar implementation. Gaudetet al. (1988) proposed a scheme for reducing the complexity of the interconnection network via the use of adaptive broadcasting. Specifically, while arguing both against repli- cating scene data at each processing element and a global memory, the first due to excessive replication of data and the second due to communication overhead and memory contention resulting from the approach, they proposed to adaptively broadcast data to all processors instead. Their results show that system efficiency declines quickly with an increasing number of processing elements due to a higher latency induced by larger broadcasts.

The object-space partitioning approach entrusts each processing element with the monitoring of a cell or volume in a spatial-partitioning structure. The pro- cessing element is responsible for testing all rays that enter the assigned cell against all objects that have surfaces intersecting the cell. Rays that travel from one cell to another assigned to a different processor must be propagated across processors. The propagation of rays across multiple processors, as well as the redundant testing of rays against objects which straddle multiple cells may cause excessive overhead. Another problem of object-space approaches is imbalances in the workload caused by a non-uniform distribution of rays and objects among the processors (Lin & Slater, 1991). Dippe & Swensen (1984) proposed an object- space subdivision algorithm, where the shape of each subregion was adaptively controlled to maintain a roughly uniform distribution of computation load. The subregions were bounded by tetrahedra, forming a general cube, and subject to fixed connectivities. Transfers of load between subregions occurred when the

4. Parallel and Distributed Rendering 70

workload of a region was higher than that of its neighbours and was carried out by moving the vertices of the region’s bounding volume; for simplicity, only a corner at a time was moved. The computational effort, or difficulty, involved in shifting the load from one subregion to another was taken into consideration when choosing how to carry out the redistribution. Nemoto & Omachi (1986) argued that load transfer among subregions by moving corners of a general cube affects eight subregions sharing the vertex, making the problem of selecting a corner and choosing a direction and magnitude of movement a difficult operation. Moreover, boundary-intersection calculations for general cubes, as well as the determina- tion of which objects in a subregion should be moved during redistribution are also expensive operations and pose a significant overhead. Thus, they proposed a space subdivision algorithm where subregions are orthogonal parallelepipeds con- sisting of unit cubes aligned to the coordinate axes of the containing space. Each processing element was assigned to one subregion, thus communicating with six neighbours. Redistribution occurred by moving, or sliding, the boundary surface between two subregions by one unit, transferring the load from the shrinking subregion to the growing.

Salmon & Goldsmith (1989) proposed a hierarchical subdivision of space us- ing rectangular extents; the upper levels of the resulting tree (termed forest by the authors) were replicated at each processing element, while the lower lev- els pointed to the subtrees making up the remaining part of the hierarchy and associated object database, stored at different processors. Each processor also controlled a subset of pixels. A primary ray was initially traced through a pixel and the forest at the originating processor. If the traversal lead to a subtree located at a different processor, the ray was forwarded to the concerned unit; otherwise it was computed locally. Scherson & Caspary (1988) augmented the work of Salmon & Goldsmith (1989) with dynamic load balancing. The traversal and ray-object intersection calculations were decoupled from the other bounding- volume calculations which could run on any processor due to the ubiquitously available forest, to be handled by two different processes. The load balancing technique adopted was that of shifting the traversal of the forest to idle pro- cessors when a specific processor was busy computing ray-object intersections. Although Scherson & Caspary (1988) solved the problem of load balancing, their solution was still subject to network congestion caused by the large number of messages exchanged. Priol & Bouatouch (1989) underscored the degradation in performance experienced by previous distributed algorithms due to an increase

4. Parallel and Distributed Rendering 71

in boundary ray intersections and message traffic as the number of processing elements increased. They also note that the large number of messages may cast some processors in a situation of deadlock. Thus, they proposed a static load balancing strategy using image sub-sampling, which is carried out prior to the synthesis phase. The ray tracing of the sub-sampled image acts as a guidance in the subdivision of the scene by means of 3D space partitioning. To avoid the congestion of the communications network, messages representing light rays tran- sitioning from one processor to another are aggregated and replaced by a light volumes in the form of a pyramid. Notwithstanding, the results show that the efficiency of the algorithm rapidly decreases to 30% when the number of proces- sors is increased to 64. This stems from the increase in the number of ray-object intersections; subdivided regions sharing the same objects perform repeated inter- section calculations when rays move from region to region. Pitot (1993) proposed another static load distribution strategy where the scene is spatially partitioned into a number of small regular cells, independent of the number of processors. The 3D grid formed from partitioning the scene is then mapped onto a 3D torus, with multiple scattered cells possibly mapping to a single processing element. The subdivision process divides space at two levels; the first, metavoxels, is dis- tributed among the processors. The metavoxels are then divided into voxels that are not distributed. The algorithm was shown to be efficient when synthesising images with complex ray-trees but still suffered from a high communication cost and the rigidity of regular subdivision.

A number of hybrid approaches based on a combination of various degrees of image and object-space partitioning were also put forward. Green & Paddon (1990) compiled a minimum set of design criteria for the development of a flex- ible and efficient general-purpose multiprocessor solution for ray tracing. They observed that systems exploiting coherence in object-space were either designed specifically for a particular architecture, or as in the majority of cases, the ar- chitectures were designed to complement the algorithms used. They proposed a hybrid image-space approach where task granularity is dependent on the size of the image regions used, is demand-driven and thus, automatically load balanced. A local cache employing a direct mapping scheme is held at each processor and keeps a partial view of the object database. The cache is divided into two sets, a statically allocated (resident set), and one based on a dynamic mechanism. The resident set holds the objects that were referenced most during the generation of an image. The estimate is computed via the generation of a low-resolution image,

4. Parallel and Distributed Rendering 72

similarly to Salmon & Goldsmith (1989). The results show that an important factor affecting the efficiency of the system was the ratio of dynamic to static storage, and that a larger dynamic section is required as the memory size was decreased. Badouel & Priol (1992) use a dynamic demand-driven image-space partitioning algorithm; the image is initially partitioned into a number of regions equal to the number of processors. When a processor completes its work, it sends a request for more work to another node which is currently busy. Experimental results showed that a 3×3 pixel work item yielded a good balance of communica- tion activity and computation. Furthermore, Badouel & Priol (1992) employed an object-based VSM system (see §4.2.1), to provide their system with the abil- ity to synthesise databases that are larger than the available physical memory. Reismanet al. (2000) presented a scheme for ray tracing images at a fixed frame rate by using progressive rendering on distributed systems.

Notwithstanding the partitioning approaches employed, in their survey of load balancing strategies for parallel ray tracing, Heirich & Arvo (1998) conclude that static load balancing strategies result in unacceptably high load imbalances, and thus are non-optimal for use in time-constrained parallel ray tracing on a large number of computers.

4.3.1

Irradiance Cache

Strategies for parallelising the irradiance cache have been proposed both for shared memory and distributed systems. Shared memory approaches can benefit from the use of a single cache that is contemporaneously updated by multiple threads or processes in the system. This essentially makes the irradiance cache a shared data structure, and while helping to avoid work duplication on behalf of each processor, access to the cache must be controlled to prevent any simultane- ous access by multiple threads from leaving the data structure in an inconsistent state. Straightforward approaches employ the use of lock-based mechanisms and paradigms, such as readers-writers, which provide mutually exclusive access to the cache. Such an access pattern may create lock contention between the proces- sors, reducing the scalability of the solution. Debattista et al. (2011) proposed a wait-free version of the parallel irradiance cache for shared memory systems which avoided the traditional locking approach.

In distributed systems, a parallel irradiance cache must strike a balance be- tween cache misses and communication overhead. The standard radiance distri-

4. Parallel and Distributed Rendering 73

bution (Ward, 1994; Larson et al., 1998) uses the Network File System (NFS) to provide shared access to the irradiance cache in a distributed environment. Con- tention was dependent on the efficiency of the lock manager used. Koholka et al. (1999) shared irradiance sample batches between worker processes after every 50 calculated samples using the Message Passing Interface (MPI). Robertson et al. (1999) proposed a master-worker model where, for a predetermined batch size, each worker calculates and stores irradiance samples at the master; the worker would gather samples computed by other workers from the master according to some threshold. Debattista et al. (2006) used a component-based approach to partition the computation of indirect diffuse from the other rendering, dedicating a set of nodes to its computation.

Documento similar