V FORMAS DE VIDA LOCALES.
5.2 Procesos de cambio y formas actuales de vida.
5.2.7 Los aspectos de género.
In order to create a realistic 3D AR experience, the rendered 3D objects need to have the correct appearance in accordance with the viewer’s position and orientation. A key challenge in MARBLE is to accurately estimate the viewer’s position and viewing angle, which is discussed in Section 5.4. This section describes how MARBLE prepares necessary information to enable proper rendering of captured 3D content on a viewer’s device.
Beacons are required to provide visual features of the scene (as seen from each beacon) so that these features can be matched with the viewer’s own view to determine where he is looking at. I describe what are these features, and how they are generated, compressed, and stored inside beacons.
Figure 5.4: The process of feature filtering: hundreds of ORB features are extracted from a reference camera image in (a) and feature entropyθis computed. The resultant values are weighted by the 2D Gaussian function of (b). Four highest scored features are shown in a zoomed in region in (c). They are successfully matched with features extracted from a different image of the same scene even though some of the objects have been moved, as shown in (d).
5.3.1 Visual Features
MARBLE extracts visual features from images taken by the reference cameras deployed in the environment. I use ORB features (Rublee et al., 2011) that are faster to compute than commonly used scale-invariant feature transform (SIFT) (Lowe, 1999) and SURF (Bay et al., 2006) features. Each ORB feature describes a small local region in the image. ORB features are robust to rotation, scaling, and translation due to changes in viewing angles. This makes them ideal for registering two images taken with slightly different viewing positions and angles. Once I find matches between ORB feature locations on an image from a reference camera to another image from the viewer’s
relative rotation and translation to the reference camera. I deploy reference cameras surrounding the environment such that they cover the whole scene. In practice, I don’t need multiple reference cameras. Only one camera can be placed at different locations to generate visual features to be stored in beacons. The storage requirement in bytesBf fornORB features is:
Bf =n∗(k+d w
4e) (5.1)
where,k is the length of a descriptor in bytes,wis the bit length for describing the horizon- tal/vertical offset of the location of a feature on the image,kfor ORB feature is 32, andwin my case is 10.
An indoor photo taken by a smartphone usually contains hundreds of ORB features. But a BLE 5.0 beacon is only able to broadcast seven features in one broadcasting period. This capacity mismatch motivates us to design a visual feature selection algorithm that selects only four (among hundreds) features that are necessary for the homography algorithm. This results in a storage requirement of 4*(32+3) = 140 bytes, which fits in a single BLE 5.0 packet.
5.3.2 Selecting Unique and Useful Features
During feature selection, I consider two factors: 1) the uniqueness of the descriptor, and 2) the chance of finding this feature in the second view during matching.
To determine the uniqueness of the descriptor, I compute feature entropy (Cao et al., 2014):
θD =−
X
i
P(di) log2P(di) (5.2)
Here, feature descriptorDis a vector{d1, . . . , dn}, andP(di)is the probability ofdi in a feature. P(di)is estimated from all the features extracted from the reference cameras.
In my application, the difference in viewing angles between a viewer and a reference camera is usually within 45 degrees. Within this range, there is a high chance of having overlapping regions
to the co-planar constraint in homography method, I select the feature points that share a plane. I observe that the feature points on the upper half of the image from the first view are more likely to be co-planar since they tend to be extracted from the far background (the wall).
In summary, if I denotewandhto be the width and height of the reference camera image, I would prefer the feature points near the upper half center (w/2,h/4) on the frame. I implement this by multiplying everyθD computed on every feature descriptor by the 2D Gaussian function value centered on (w/2,h/4) on that location to determine the final feature weight. This is illustrated in Fig. 5.4 (a) and Fig. 5.4 (b).
Finally, I selectnfeatures with the highest weight among all candidates and store them in the beacon. The entire process is illustrated in Fig. 5.4.
5.3.3 Storing Camera Properties
Besides visual features, I also need to obtain theintrinsic parametersof each reference camera to compute accurate estimation of viewer’s location and viewing angle in the rendering phase. The camera intrinsic parameters contain information about the focal length, aspect ratio, and principal points. These parameters along with the camera’s location and orientation information are measured and stored inside a beacon only once.
5.3.4 Generating AR Content
MARBLE is designed to support animated 3D content, i.e. a time series of changing 3D contents. This section describes the 3D content representation of MARBLE.
MARBLE’s 3D content consists of one or more 3D objects. A 3D object in digital systems is commonly represented by a set of surface meshes or a skeleton model of its internal structure, plus a distance function to its surface (Siddiqi and Pizer, 2008). Both representations require shape-related data points to be densely sampled and stored. Typically, hundreds of 3D points and their connection information are used to describe the surface of a non-trivial 3D object that is not simply a cube or a sphere. Since I am targeting animated 3D contents, the storage requirement will further increase
by an order of magnitude. In total, this may exceed 1MB, even after applying state-of-the-art compression techniques. This data size does not match MARBLE’s storage limit.
To address this problem, I design my data retrieval component to make use of some prior knowledge about captured objects. For example, in case of a human body as the captured object, my system divides a human body into several major components such as two hands, a torso, and a head. 3D representations of these parts are pre-loaded in the viewer’s app. In the AR content generation phase, the 3D position and orientation of these major components in the environment are detected and stored. This require less than 100 bytes of storage. In the rendering phase, these components’ 3D position and orientation information is received by the viewer’s device. The viewer’s device then combines this information with pre-stored components’ 3D models to render the full 3D object. Using eight BLE 5.0 beacons, MARBLE can transmit 112 frames of human gestures in one broadcast packet.
In addition to captured data, MARBLE has the flexibility to store and broadcast synthesized virtual 3D object data which can either be pre-written into BLE beacons, or if the Internet connec- tivity is available, they can also be downloaded from the cloud. This enables MARBLE to render virtual objects with more details or objects that never appeared in the scene.