Capturing and counting the size of the audience and their demographics in front of a public display has been a focus both in a commercial context as well as in research. Commercial tools mostly use visual computing techniques to count the number of people in the vicinity of a display and refer to the field as “Anonymous Video Analytics” [Int18;Ham+09;Fraa] (AVA). The name originates from the fact that the sensing and processing of the video feed is performed on or close to the visual sensor instead of in the cloud, and that only the outcomes of the computation are stored instead of the entire video feed. Using this approach, the recognition of the same viewer and thus the computation of unique viewers metrics is not possible [Sla11]. However, “protecting viewer privacy by design” [Cav11] is a clear advantage of using AVA and therefore an important step toward the deployment of such systems that help understand more about the audience without revealing personal-identifiable and other comprehensive insights of individuals. In particular the extraction of demographic information of the audience and retail customers is one of the main selling points of commercial state-of- the-art products. From the use of AVA a set of metrics can be derived: “potential audience” (anyone in vicinity of the display) and “useful audience” or impressions (those who actually glanced at the display) [Sla10b]. Demographic information from commercial signage analytics products that use video analytics include an estimate of the customers or viewers age, gender, and, in some cases, a form of attention and mood measure [Int18;Fraa;See;Qui16;IBM13; NEC13] and ethnicity [GH15]. The overall goal of signage analytics is to help advertisers and content creators quantify the “effectiveness of dollars they spent” [Sla10a] and provide support for targeted advertising on public displays [Far+14;Sla10a].
Intel’sAnonymous Video Analytics(AVA) [Int18] is a commercial application for process- ing video analytics feeds in real time with the processing component located on the sensing device. Intel uses a face classification algorithm that identifies people whose heads are facing the camera (and thus also the display), and generates counts and aggregations of the number of visitors of a display. The audience counts are based on the total number of viewers and returning viewers – the application is not able to identify unique viewers. However, AVA does provide insights into the average dwell times of viewers spend to look at the screen. Intel’s AVA features an additional set of metrics to the typical audience, impression or dwell time counts to provide transparent insights into the accuracy of the measurements. For example,
theimpression count errordescribes the accuracy of the current viewer count by comparing
the viewer count provided by AVA with ground truth data (e.g. collected through manual counting) – giving insights into the accuracy of the face classification algorithm [San+11]. The system further collects a set of basic demographics on rough age estimates (child, young adult, adult and senior), and gender. Such metrics are included into audience analytics reports as an additional dimension and enable the user of the reports to aggregate viewer and dwell times by demographics.
Researchers from Fraunhofer developed a video analytics system that is comparable with Intel’s AVA: FraunhoferSHORE[Frab] is a facial video analytics engine underpinning the
2.3 Data Capture 20
Anonymous Video Analytics for Retail and Digital Signage(“AVARD”) product [Fraa]. Similar
to Intel, the face recognition is performed on or close to the sensor, and the researchers claim that the algorithm is robust enough to work in differing lighting conditions, even when utilising consumer-grade video cameras. While digital signage is one application domain, the software is advertised to be used in retail, health and other application areas. SHORE [Frab] goes one step further than just retrieving basic demographic information and is additionally capable of determining the emotional state of viewers purely through video analytics techniques. The emotional state is categorised in terms of four facial expressions:happy,sad,surprised, and
angry. Fraunhofer SHORE additionally provides a fine granular age estimation and returns
an actual age instead of an age range – including a deviation metric to communicate the accuracy of the estimated age. This product is the foundation and visual analytics engine for Fraunhofer’sAnonymous Video Analytics for Retail and Digital Signage[Fraa] software suite, which provide an example use case SHORE and the supported metrics and reports specifically in the digital signage domain.
The detection of emotional states and moods is not a unique feature. Other commercial visual analytics products have been specifically developed for the digital signage domain that use a similar approach for face classification. Quividi [Qui16] provides similar visual analytics tools that are capable of detecting a number of metrics that are in common with Fraunhofer SHORE: “opportunities to see” the content (i.e. counting people who walked by the display but have not glanced at the screen), number of viewers, dwell and attention times, gender and age estimates, attention states and moods “from very unhappy to very happy” [Qui16]. As a unique feature, Quividi additionally supports the detection of facial attributes including facial hair, glasses and sunglasses [Qui16]. Specifically designed for analytics for kiosks, Meridian uses video analytics to detect and count “potential users” and actual users and the retrieval of the collected data in real-time, e.g. for the use of interactive and adaptive display content [Sla16]. The classification, however, is limited to age and gender, though could be extended with other video analytics products and enriched with interaction logs captured directly through the interactive kiosk software.SCALA Advanced Analytics[Sca] even allows the plugin of a range of sensors and actuators that can be individually programmed and dynamically change their behaviour based on audience presence and viewer counts. For example, displays could change the content displayed based on an approaching audience or interactions in proximity to the display [Sca].
A broader approach is used by IBMIntelligent Video Analytics[Ham+09;IBM13] and NEC’sFieldAnalyst[NEC13] software. While the previous products focused mainly on face recognition classifiers and required the camera to be mounted on the screen, IBM focuses on analysing video feeds from CCTV cameras [Ham+09]. Similarly, theFieldAnalystsoftware captures faces and people from video streams and is capable of measuring a basic set of demographics (age, gender, distance to the screen, and viewing time), the number of viewers of a display and, additionally, the number of entrances and exits in a space without the need to place the camera at the display [NEC13]. Similar to Fraunhofer AVARD [Fraa], FieldAnalyst is designed for the digital signage and retail domain and provides ways for “target analysis”
2.3 Data Capture 21
and “non-buyer” analysis – helping display owners and content providers to understand which user groups are engaging with displays. Seemetrix is able to return a similar set of metrics: it consists of the capability to capture a rough age and gender classification of viewers [See]. Reports are extended by an attention measure per viewer which is calculated from the total duration a viewer has spent looking at (or in the direction of) the display, i.e. the duration in which the viewer has been “attentive” [See].
While commercial products focus on providing an end-to-end system for capturing and reporting information about the audience, one of the main focuses in the development and use of visual analytics tools is to answer questions about the user behaviour and the ability to track individuals [LCK13]. Of course, systems that support audience tracking in the context of pervasive displays are also capable of generating audience numbers.
Examples of specific signage analytics work include the analysis of pedestrian traffic around a public display performed by Williamson and Williamson [WW14]. The authors placed a video camera on the display and used visual computing techniques to both count and track people walking in the surrounding area of the display deployment. While the focus of the system was to track people, the same approach could be used to simply count the number of people who are in the immediate vicinity of the display and produce an audience count measure. Using a depth-camera mounted to the display and facing the audience as a source, Tomitsch et al. deployed a public display to conduct a study to understand the level of care and attention of viewers toward content that is shown on the displays [Tom+14]. The video stream was recorded as part of the deployment and the authors were able to use it for a better understanding about the audience (including the number of people) and their behaviour in front of the display. In a similar approach, Farinella et al. equipped a public display with cameras and developed a system that supports the identification and recognition of returning viewers at a public display based on biometric features [Far+14]. Parra, Klerkx, and Duval used visual computing techniques to automatically generate an audience count of people passing by at an in-the-wild deployment at Brussel’s largest train station [PKD14].
In addition to simple audience counts, visual analytics based systems are often also capable of capturing the user dwell time in proximity of the display, and their view times of the display and content [RS13]. More recently, Elhart et al. published the “Audience Monitor” – a toolkit specifically designed to count the number of people approaching a display and their dwell time [Elh+17]. Utilising a mix of different sensing technologies, Gillian et al. developed “Gestures Everywhere”, a system that is able to track an individual across multiple displays through a number of sensing technologies such as Bluetooth Low Energy beacons and video cameras [Gil+14]. In addition to providing context-aware content to the viewer, the system also supports the tracking of individuals across multiple displays and locations and serves as a basis for the generation of analytical insights such as audience counts.
Whilst the presented work typically requires the use of video cameras mounted at the display facing the audience, other video analytics tools utilise surveillance cameras that capture a broader view of the vicinity of the display. IBM Intelligent Video Analytics uses such an approach in which it is possible to search for specific faces, the extraction and filter
2.3 Data Capture 22
for detail demographics, however, is not possible [IBM13]. Note that, for the purpose of conducting and analysing research experiments, demographic information of an audience have often been conducted manually through observations, e.g. in [Alt+11a] and [PTK18].
The use of video analytics techniques in order to capture audience numbers and demo- graphics can potentially impose a privacy risk to individuals present in front of the display. Previous work, however, has developed approaches that address the potential privacy risks. For example, Intel AVA [Cav11] conduct the analysis of the video feed close to the sensor and report the generated numbers (e.g. the number of viewers engaging with the display) instead of the video feed. Similar approaches are taken with the concept of “Edge Analytics” in which computations are performed on the edge of the cloud close to the sensor both for performance reasons and for privacy preservation [Sat+15].