• No se han encontrado resultados

1. MARCO REFERENCIAL

3.9. ANÁLISIS E INTERPRETACIÓN DE RESULTADOS

into atoolglass-like device [BSP+93]. Furthermore, it allows users to bridge large distances by moving the entire device (i.e., positioning the reference frame with the non-dominant hand) as well as fine-grained manipulation by performing touch input with the dominant hand [EB09].

Figure 6.1: Coarse and fine-grained bimanual interaction. (a) Moving the device spans the

reference frame. (b) A finger can then interact in this reference frame.

Bimanual interactions can be carried out in two ways: first, both hands can work independently. This is the case when accessing multiple items in order to pile them at a predefined location. In other cases, the hands depend on each other. The prime example is to write on a sheet of paper. Guiard found that the non-dominant hand permanently repositions the paper spanning the refer- ence frame for the dominant hand [Gui87]. The area within which participants wrote was much smaller compared to the size of the paper page. Our interaction style falls into the second category (i.e., one hand influences the other one). In a comparative study, Kabbash et al. found that these

dependenttechniques outperform theindependentones [KBS94] since they use less motor oper-

ations. Guimbretière et al. confirmed these results and found merging command selection and direct manipulation to be the most important factor [GMW05]. Buxton et al. further found that single-handed input as used chapter 5 performs worse compared to bimanual input for selection, positioning, and navigation tasks [BM86]. Latulipe et al. extended this set of operations with manipulation tasks for multi-parameter functions such as image corrections [LKC05]. Hilliges et al. present a bimanual approach for tabletops [HBB07]. Similar to our system, the non-dominant hand spans the reference frame (here: a period of time) while the dominant hand can precisely interact with content within the reference frame. They found that such a design provides benefits for browsing, organizing and sharing digital media collections. Considering these results, we can assume that our bimanual, touch-based approach will outperform single-handed interaction.

6.2

Enabling Real-time Interactions

Although already a promising candidate for interacting with external displays, Shoot & Copy

(see chapter 5) had two limitations. The user’saction(i.e., selecting an item by taking a picture) and the system’sreaction(i.e., showing a list of options regarding the selected content) happened

in a sequential way. However, to ensure direct feedback on the mobile device without the need of explicitly selecting an item first, the spatial relationship between the external screen and the mobile device needs to be sensed permanently. The second limitation is the need for centering the item of interest in the mobile device’s viewfinder (unless the crosshair can be shifted explicitly). To overcome this, we chose to use a touch-enabled mobile device allowing for a more flexible input technique. In this section, we discuss the necessary real-time tracking (see section 6.2.1) followed by a description of bringing touch to displays in the environment (see section 6.2.2).

6.2.1

Real-time Identification of Target Displays

b a

d c

Figure 6.2:Feature point tracing to allow for real-time image processing: (a) The first frame

is processed fully to calculate the correct transformation. (b) The corner points for the final transformation on the target display. (b) These points are found in the next frame. (c) One of the four feature points cannot be detected.

6.2 Enabling Real-time Interactions 115

In theory, the detection process described in chapter 5 can be used for real-time identification and tracking as well. This would result in a higher processor load which is especially harmful when multiple users interact at the same. The recognition time of 180 ms (or eventually more depending on the amount of displays and content items) does not allow for true real-time inter- action. According to Azuma, augmented realty systems are considered real-time if the delay is 100 ms or less [Azu97]. With this in mind, the presented approach – already being too slow for real-time interaction – does not scale and hence limits theflexibility of the environment. Never- theless, this registration is needed to allow for real-time feedbackthrough the display. Thus, we needed to optimize the algorithm’s performance. One solution is to use information gathered in the previous frame which in turn may decrease the steps needed during a single detection cycle.

Run full image processing Display detected? Detect feature points

Wait for next video frame Enough feature points found? No Update feature points Yes Calculate Transformation Matrix No Yes

Figure 6.3: The image recognition process for real-time feedback. The shaded areas denote

the processes with high computational requirements. These can be left out as long as feature points are available and can further be detected in the new frame.

During the detection process of a display (and its content), we obtain the corner points of each identified polygon. Four of these points (describing the largest possible quadrangle) are used to calculate the final transformation. It is sufficient to detect these points only. As our camera (i.e., the camera of the mobile device) does not remain still during the interaction, these points have to be identified repeatedly to permanently allow a correct transformation between both image planes. We can assume that the mobile device will not move far between two subsequent frames. Hence, there is a chance that the points detected in one frame will still be visible in the next frame although slightly shifted. If these points can be found in subsequent frames, the entire image processing can be reduced which in turn saves time and processor load. Especially removing

theHough Transformationwould decrease the detection time significantly. Since the system still

has the points’ coordinates in the coordinate system of the external display, it can calculate the transformation as described earlier. This approach requires both high accuracy in tracing the points as well as a clear association between points in the previous frame and those in the current one. If a point in the current frame is associated with the wrong one in the previous frame, the calculated transformation will be incorrect leading to unexpected responses. The system also needs to have a fallback solution if one of the points cannot be found.

Similarly to the optical flow analysis used in chapter 4 for the motion detection of the mobile device, we employ this method to find points in successive frames [BJB09]. In the mentioned system, we chose to track so-calledfeature pointsthat have been detected in a single frame and traced in subsequent ones [ST94]. These points are significantly different from others in the image and can therefore be identified easier in successive images. We cannot guarantee that the identified feature points are corner points in the image. Most likely, these points lie inside of images as the frequency (i.e., their set of surrounding features) is extremely high compared to others. As we have to match points detected in the video image with points on the external display, we have to know the exact location of each point in screen coordinates. This is necessary to calculate the transformation once the points have been detected. The feature points need to be predefined by the system to ensure that only points with known coordinates are tracked. Using the

Lucas-Kanade Optical Flow Method, we use the detected corner points (from a frame analyzed

with full image processing) as feature points [LK81]. Their implementation uses acoarse-to-fine

approach in scale-space (i.e., a pyramid). This means, that first the larger surroundings of pixels are matched followed by (if still appropriate) a matching of closely neighboring pixels. As our feature points are directly on the corner of a polygon, this method allows for good results as it takes the item’s visual content into account (see figure 6.2).

By using this method, we can reduce the processing time significantly at least for subsequent frames. This method allows the detection of these points in less than 20 milliseconds. Including the subsequent matrix calculation, the entire process for each frame with predefined points (i.e., identified in the first frame) lasts 25 milliseconds at most. We found that the approach is suitable for real-time interaction. As mentioned before the success of this method depends on whether these points can be found in the frame. There are two main reasons that limit the chance of detecting these points: first, the video frame contains motion blur which may cause the method to fail. And second, the device moved far enough in a way that not all points are still within the video frame. In both cases, the system is unable to findallpoints. One obvious solution to these problems is to rerun the frame fully as described in chapter 5. This approach causes a slight delay during the interaction. A second alternative is to use four other points in the image. These points then do not describe the largest possible quadrangle which in turn decreases the accuracy. Our solution (as shown in figure 6.3) uses the first approach of running a full image analysis when the system does detect all points in a subsequent frame. To limit the processing time especially for environments with multiple screens, we first use the display and its content respectively that has been detected during the last successful cycle. If matching is still not possible, the system compares the given frame with all other displays unless it finds a target display. If no display has been found, the system assumes that the mobile device is currently not pointed towards a screen. If a target display has been found the system stores the detected corner points and processes each subsequent frame usingoptical flow analysis until a full rerun is needed (i.e., not all four points were detected in a frame). With this, the average detection time per frame is reduced significantly. This, of course, depends on the movement of the device as well as the quality of the images produced by the device’s camera.

6.2 Enabling Real-time Interactions 117

6.2.2

Bringing Touch to Displays in the Environment

Since we need to process images in real-time, we chose to switch from Bluetooth to the faster and more reliable wireless LAN 802.11g. While the connection procedure is still similar to the Bluetooth version (i.e., discover the environment manager followed by a connection), it is much faster. Thus, this step can be integrated into the application launch as it does not consume much time. Discovering the environment manager is done using IP multicast. The response of the environment manager then includes its IP address and port. Displays in the environment connect using the exact same procedure. The communication with the environment manager is then done via TCP to allow for a reliable point-to-point connection. Similarly to the first prototype, the underlying connection is already established before the user interacts with the system. As the process is hidden in the application launch, users are not aware of being connected to the system already. A different approach would be to use the images coming from the live video as discovery messages and subsequently establish a point-to-point connection to the display. This in turn would decrease the detection performance significantly.

Once the connection is established, the mobile device starts to stream the live video content frame-by-frame to the environment manager. The centralized instance then analyzes the image using the previously described procedure (i.e., full image processing versus point tracing). If a display has been found, visual feedback is given to the user in the form of luminous virtual LEDs. The necessary information of transforming input points from the local device into the coordinate system of the current target display (i.e., the transformation matrix) is known by the environment manager throughout at all times. From the technical point of view, the managing instance tracks the mobile device based on the delivered camera stream which does not require additional equipment in the environment. When a touch event occurs, the centralized instance transforms it with the current transformation and sends it to the target display. The input is then performed on the external display. As users see the content through their viewfinder, they perceive the content to be manipulated directly on their mobile device. Users can interact with displays in the environment using touch input as if they were touching them directly.

Assuming a one-to-one relationship between users and their mobile devices, we can further as- sociate touches with persons. Especially for research regarding interactive surfaces, this seems to be a rather limiting factor. However, there are situations which require this association. For example, some items may only be accessed by certain users but should not perform any action if touched by others. Nearly all systems (with the exception of the DiamondTouch [DL01]) suffer from the missing association between finger and user. Users located around such an interactive surface have equal rights. Our prototype easily allows this by associating all touch events coming from a single mobile device to one person only. The concept of ownership (and the resulting access rights respectively) of items can now be used. This means that users can only interact with an item (e.g., moving it from the target display to the mobile device) if they have the access rights to do so. Naturally, different levels of access rights can also be used.

Documento similar