FASE DE TRATAMIENTO: - Las TIC en la nueva oficina judicial

In general, most graphics architectures assume that textures are stored as a finite sequence of texels arranged in a two dimensional grid. In this case, for a given texture of dimensions(m, n), there is specialized hardware for looking up the proper address in memory for given a two dimensional texel coordinate(x, y)∈_Zm×Zn. For the simplest representation, where the texels are placed in row-major

order, the addressA(x,y)for a given location of a texel of sizebbytes will be

A₍_x,y₎=b(my+x)

. For over a decade, GPUs have taken advantage of the fact that texture access isspatially coherent. In other words, if a texel is requested at location(x, y), it is likely texels at locations(x±1, y±1)will be requested soon as well, such as during texture filtering. For this reason, when textures are loaded into the GPU, they are laid out in a manner such that accessing a texel at one location will bring additional texels into the L2 and L1 memory caches. One such layout that preserves spatial coherence is the Z-order curve, where the texels are ordered as shown in Figure 7.1. The addressA(x,y)computation for this method is performed by interleaving the bitwise representations ofxandy. While difficult to describe mathematically, the hardware for this operation is surprisingly simple to implement.

The addressing scheme for compressed textures is designed to take the format’s block size into account. For example, DXT compresses4×4blocks down to64bits, or8bytes. Hence, for a naive raster-order layout of DXT blocks, the address is computed using the formula

A(x,y)= 8 _j y 4 km+ 3 4 + jx 4 k

Figure 7.1: A Z-order curve at different resolutions. Each increased resolution follows a similar structure from the previous resolution. Image courtesy of David Eppstein.

Once fetched, the eight bytes corresponding to this texel are passed through the hardware DXT decoder and the local(xmod4, ymod4)texel is returned. Sincepmod2kand₂pk can be performed efficiently

in hardware by bitwise shifts and masks, these calculations are usually very fast. The cache locality is preserved in two ways, first by storing blocks themselves in a Z-order curve, and second by storing the full decompressed block in L1 cache. In the case of DXT, since a single address corresponds to sixteen texels, storing the entire uncompressed block in L1 cache is usually good for performance.

In Chapter 4.3.3 we showed that by allowing a two level addressing scheme, we can effectively reduce the amount of data needed for storing a compressed texture. In particular, most novel compressed texture architectures, such as the binary trees proposed by Inada and McCool (2006), take a unique and format-specific approach to texel addressing. In some situations, such as streaming of real-time assets, CPU-GPU transfer is significantly more expensive than the data access. Each individual program must specify their own method for accessing and reading texture data in accordance to the fixed-function addressing scheme currently imposed by graphics hardware.

With texture programs, the application has explicit control over how to access memory. Certain textures that may be split into constituent parts may allow for better compression of each part individually than for the texture as a whole. For such textures, some parts may be compressed to such a degree as to cover large blocks of texels, amortizing the memory cost across the entire rendering task. Instead of

assuming a data layout or offering a fixed number of available addressing units, texture programs would offer a small but targeted set of instructions for computing one or more memory addresses. The data stored at these memory locations would then be used to construct a texel value to return to the remainder of the texture pipeline. As a preliminary implementation, the existing SIMD compute units in the GPU can be used to generate these texel values, as described in Figure 7.2.

7.1.1 Caching Benefits of Texture Programs

Allowing programmable addressing for a single texture access provides the programmer many benefits. First, abstracting the addressing into texture programs allows the developer to maintain transparent texture access for the rest of the GPU. In other words, the texture access would be defined on a per-texture basis for all other shader programs (vertex shader, fragment shader, etc), reducing code duplication. Second, by restricting the set of operations that can be performed by this addressing unit, the architecture can be optimized for integer operations, boosting performance. Additionally, the developer’s choice in layout could benefit the total memory costs of the entire application’s rendering task with that texture. For example, a binary tree representation that describes when to look-up texels and when not to could fit almost entirely in cache, allowing significant performance and energy improvements to rendering (Inada and McCool, 2006).

One of the largest benefits of the data layouts for fixed function addressing schemes is the cache coherence of multiple texel accesses. In general, the latency for each texture fetch is expected to be amortized over the cost of subsequent accesses. For this reason, the data layout is constructed in a way that matches the expected access pattern for displaying images. However, this only assumes a spatial coherence of texel accesses and has no notion of spatial coherence of image data. In effect, an image containing random noise will have the same access patterns as any game texture or natural image. In order to improve the cache properties of texture data, texture units can take advantage of intra-texture redundancies by incurring a few additional memory fetches that themselves become amortized over large number of nearby texel accesses. For example, the organization of three-channel texture data can be split into different data streams and collected later for rendering. One such example is to store the R, G, and B (or transformed Y, U, and V) planes of transformed data separately. Then, applications do not have to re-interleave the data prior to uploading them to the GPU and can rather load the planes separately into memory, possibly at different levels of compression. The texture program would request addresses for

Compute Memory

...

Texture Request Global Memory Constant Global Memory

...

Select Mip Level Find Samples

...

_Filter Samples

Fetch Data for Each Sample (optionally decode) Compute Memory

...

Texture Request Global Memory Constant Global Memory

...

Select Mip Level Find Samples

...

_Filter Samples

Run texel program for each sample

Figure 7.2: (left) The current GPU programming model with respect to texture accesses. (right) The proposed GPU programming model where texel values are generated by running small programs as part of the texture pipeline. Each GPU program (compute) that runs may make one or more memory requests (outlined) from different memory types usually handled in parallel. During each memory request, GPU programs are preempted to allow other parallel work to execute. As a result, the proposed architecture changes would be to allow an additional program to compute the texel samples needed as a part of each texture access. These programs would be free to make their own memory requests in order to compute the final pixel values. Each of the other existing caching mechanisms would remain in place, allowing the programmer to make a choice between optimizing caching behavior and compressed texture size. the three separate channels and combine them into the final texel value. Certain applications that only require two of the three channels, such as compactly packed normal maps, would not have to pay the cache penalty of fetching all three. Although the first texel requested will incur three times the latency from separate memory accesses, the subsequent texel fetches will find all of the required data already loaded into cache. Waveren and Casta˜no (2007) show that even using a fragment shader to reconstruct YCoCg encoded DXT5 textures often times provides benefits.

Another advantage that texture programs may have on the caching properties of textures is that we can use a VQ-style dictionary compression for different texture features. In that case, large textures with areas of both low and high detail can be specified with less data. The VQ will replicate the low-detail

areas in the dictionary and many more of the dictionary entries will reside in cache for any given memory address. The resulting application will usually hit the cache for the less detailed areas and miss the cache in the high detail regions, but if a majority of the texture is low detail, the amortized cost over the entire texture may be much smaller. These kinds of cache savings would be very good for managing battery performance with current mobile GPUs.

In document Las TIC en la nueva oficina judicial (página 102-109)