Instead of interacting on virtual content on the personal device, live video streams can be used. Users can directly interact on objects shown in such streams according to the paradigm ofDirect
Manipulation [Shn83]. Two different setups can deliver these live images: first, cameras are
stationary with respect to the environment delivering perspective different from the user’s one. And second, the personal device uses its own camera to allow a view more aligned to the user’s perspective. When used on a stationary display, both setups deliver stable images. Today’s mobile devices usually feature a camera as well. Themobilityof such devices is the advantage and disadvantage at the same time. Whilemobilityallows for higher flexibility in terms potential targets, the produced camera images suffer from instability. Furthermore, allowing users (and the camera respectively) to move around requires permanently tracking the device. In this section we review both static and dynamic setups. We first describe techniques that make use of external cameras (see section 2.5.1). Subsequently, we focus on the usage of cameras built into mobile devices (see section 2.5.2). We pay close attention to possible tracking solutions for detecting the device’s position and (if necessary) its orientation.
2.5.1
Fixed Camera Setups
Stationary cameras offer a fixed view into a predefined portion of the environment. Tani et al. use this setup to allow workers in an industrial factory to manipulate distant machinery through computers in a control room [TYT+92]. In thisHyperplant, users can click and drag on the live video image of a machine shown on a regular desktop screen (see figure 2.13a). Actuators built into the machines respond to these actions: if users drag a slider in the video image, it moves in the real world (and in the video image respectively) to give immediate feedback. The three- dimensional representation of the environment is created by hand a priori and is hard to change and can further not be adapted to dynamic machinery (e.g., moving robots).
Fixed cameras can further be used to command (i.e., give directions to) movable machines such as robots. In Sketch and Run, users can draw strokes within live video to control cleaning robots [SHII09]. This system relies on cameras mounted on the ceiling which observe the en- tire room. Drawn lines in the orthographic view correspond to paths which are then followed by robots (see figure 2.13b). In contrast to Tani’s approach, movable machines are tracked with
2.5 Interacting through Live Video 35
markers. The only information known upfront is the position of each camera with respect to the room’s coordinate system. The robots’ speed was limited to the detection speed of the mark- ers. Furthermore, the detection might fail when robots traverse from one viewport to another. Although the interaction takes place on a portable device, the live video images are stable. Nev- ertheless, the system allows users to move around and control the cleaning robots from any place within the room. As described before, installed cameras reduce theflexibilityof the environment.
Interacting on External Displays through Live Video:
Chiu et al. use video cameras installed in a conference room to facilitate meetings [CKRW99]. In their scenario, attendees can annotate live video image from a presentation source. Liao et al. extend this metaphor to connect virtual items to physical locations [LLK+03]. A video- representation of the room allows users to annotate slides. These presentation slides can also be dragged between displays. The video representation of the room further allows dragging digital information onto physical objects. For example, users can drag presentation slides onto the printer in the room to obtain a physical copy. The entire interaction happens within the video image. This system relies on fixed cameras similar to Tani’s approach. The corresponding virtual representation of the room needed to be done upfront. Hence, this system only allows annotating stationary displays but does not offer a solution for portable computers.
Figure 2.13: Interacting through video from static cameras: a) Hyperplantallows users to
operate machines shown in live video from a control room [TYT+92]. b) InSketch and Run
users draw paths to maneuver cleaning robots [SHII09]. c) CRISTALallows the control of home appliances in live video shown on a tabletop [SHS+09].
With the increasing number of digital tabletop systems, users can further interact in live video images using multi-touch input (see figure 2.13c). Seifried et al. present CRISTAL, a system showing live video images on a large tabletop [SHS+09]. In a living-room setting, users can control everyday appliances, such as light sources, an audio device, a TV, digital picture frames, or robotic vacuum cleaners (cf. [SHII09]). The authors compared a perspective camera position (i.e., in the room’s corner) to an orthographic placement (i.e., in the middle of the room’s ceiling). They found that a majority of participants preferred the latter setup. Again, the room geometry needs to be known to allow direct manipulation of objects.
2.5.2
See-Through Devices
Today, mobile devices are equipped with cameras and can be used for augmented reality appli- cations in conjunction with large canvases (e.g., a paper-based map). The mobile device would allow a more detailed view of the map. Schöning et al. demonstrate how the mobile camera- equipped device can be used as such a lens [SKM06]. To track the device, two-dimensional visual markers augment the large map. These markers occlude valuable map space and add in- formation unnecessary to the user. Rohs et al. reduce the visual clutter by using dot markers – a two-dimensional grid of black dots surrounded with white rings [RSKH07]. The system’s performance is nearly unchanged whereas the map’s readability increased significantly (see fig- ure 2.14). Wagner et al. present a toolkit that allows different tracking solutions for mobile phones [WLS08]. The authors were able to reduce computational complexity to allow image processing on mobile devices at interactive framerates. Rohs et al. present a model of target ac- quisition when using mobile devices as magic lenses [RO08]. This model divides the interaction interaction into two consecutive phases: coarse acquisition followed by fine-control pointing.
Figure 2.14:Pointing with the mobile device’s camera: Top shows thedotmarkers used for
map navigation [RSKH07]. Bottom denotesPoint & Shoot procedure to select content on a distant screen [BRS05].
Recognizing Fiducial Markers through Live VIdeo:
Fiducial markers can also be combined with digital canvases such as large displays. Ballagas et al. present Point & Shoot, a system that allows the selection of virtual objects on a remote display through the mobile device’s viewfinder [BRS05]. First, users aim at the item of interest using a crosshair shown in the live video image. When they press the joystick (to indicate a selection), a grid of visual markers is temporarily superimposed on the content to identify the targeted item. After finding the item, the markers are removed and the item is highlighted to
2.5 Interacting through Live Video 37
indicate a successful selection (see figure 2.14). The markers can further be used to encode the Bluetooth address of the display they are shown on allowing for a non-modal connection procedure. Their system does not allow for continuous interaction as at least one marker then needs to be seen permanently by the camera. Furthermore, multiple simultaneous users would increase the time markers would be shown on the screen even if they only would select items in a
discretefashion. Alternatively, their system could show some markers permanently to avoid the
aforementioned flashing of markers.
Madhavapeddy et al. use camera-equipped mobile devices in combination with markers tocon-
tinuously control content on remote displays [MSSU04]. The markers (known as SpotCodes)
used allow the detection of the device’s position and orientation continuously (see figure 2.15a). Their envisioned system enables users to zoom into certain parts on a world map. As soon as they have selected a country, a list of airports is shown on the mobile device. These markers can further be used to rotate the mobile device which is then translated into manipulating a slider. The virtual slider can be shown on either the external display or the mobile device depending on the scenario. Since the markers are shown virtually, this system allows for aflexibleenvironment. Their displayed size needs to be adjusted to allow for distant interaction.
Figure 2.15:Virtual content shown in live video on mobile devices: a) The device’s position
and orientation is used to interact with content (through markers) [MSSU04]. b) Dynamic markers are used to obtain the transformation between mobile device and remote display in 3D [PJO09]. c) TheiCamallows to place content at three-dimensional positions [PRA06]. Besides measuring position (two-dimensionally) and the device’s orientation, the distance to an external display can be lead to new interaction techniques. Pears et al. present a system that identifies the spatial relationship between the mobile device and the external display it is pointed at [PJO09]. The display shows a simplified marker that allows the calculation of a homography (i.e., the transformation between the mobile device’s local coordinate system and the target dis- play’s coordinate system). The crosshair shown on the mobile device is replicated on the external screen (see figure 2.15b). Users are able to select, move and rotate images similarly to previous systems. As the system is also able to detect the distance to the external display, users can scale images by moving away or getting closer to the screen. While this system is of great impor- tance for our approach, the drawbacks of markers (or pointers) shown on the remote display still remains; limiting the potential for multiple simultaneous users.
Interacting through Live Video in Three Dimensions:
Physical objects and their locations can be augmented with markers as well. Rekimoto et al. present the NaviCam, a palmtop device that tracks itself with respect to the environment using a camera and color-coded markers [RN95]. Digital information is superimposed on the physical world seen in live video depending on the direction the device is pointed at. To reduce visual clutter caused by fiducial markers (both virtual and physical) is to let the environment locate the device. Patel et al. present an infrastructure to detect the position of a mobile device in three dimensions [PRA06]. Their system further determines the rotation around all axes allowing for six degrees of freedom. When users place local information into the environment, a laser is used to determine the distance to the object. The absolute position of the digital information can then be calculated using the relative information obtained with the laser and the absolute positioning information given with the location sensors (see figure 2.15c). While this system is accurate, it is not robust against dynamic changes in the environment (e.g., moving a book in a shelf).