Capítulo IV. Resultados
3. Acciones de mejora para superar las limitaciones al logro de la calidad
C1.1.1 - Acoustic Characteristics- Voice Authentication (VA) is a well researched area and
it is possible to authenticate based on a voice profile. However, as this chapter has shown it is necessary to consider VA more specifically in the context of PVA instead of general ASR systems. Research work needs to consider how to apply VA in current PVA and how to protect specifically against spoofing attacks in this context.
C1.1.2 - 2nd Factor Authentication- We believe that work using a 2nd factor for authenti-
cation is promising, in particular when this information can also be derived from the acoustic channel. This approach makes such solution very practical. Although some work exists describing 2nd factor authentication, research has not looked into bypassing such methods. This is a crucial step to ensure such methods are robust against attacks.
92 Taxonomy for PVA Security and Privacy Related to Acoustic Channel
C1.2.1 - Hardware Non-Linearity- Research has shown that such attacks are very feasible.
However, there has not been much work on defence mechanisms against this type of attack. Furthermore, attacks (and the few reported defence mechanisms) rely on sophisticated hardware. We would like to see if the requirement for additional equipment can be overcome.
C.1.2.2 - Obfuscated Commands- Feasible attacks are described. However, this form of
attack is not very convincing as noise will still be noted. For this line of work more detailed studies on perception of audio samples are required to fully determine feasibility of these attack types.
C.1.2.3 - Adversarial Commands- This line of work has produced very sophisticated
attacks. Complex sentences can be embedded in audio samples, hidden from users. Psychoa- coustic masking is increasingly used and attacks over the air considering room characteristics are feasible. However, most attacks still consider white-box scenarios, the internal structure of the ASR is known. Work should consider black-box scenarios and investigate how to craft hidden commands effective on different ASR. This field of work would benefit from a standardised evaluation environment to make attacks comparable. Research here should target the latest attention/transformer-based end-to-end ASR. Finally, more work to defend against such powerful attacks should attract more research work.
C.2.1 - Privacy Preservation- There is little work on privacy preservation in general and
more work in this area is required. Work has looked at transforming speech signals such that user specific features (voice profile or paralinguistic information) cannot be extracted. Although such methods are effective, it is not clear how they can be integrated with existing systems and how a user would exercise control.
C.2.2 - Consent Management - Only one work has so far investigated how users can
provide consent. Some work on DoS has been carried out as mechanism of revoking consent. We believe that this would be an important area for users and that more research in this domain is required.
C.3.1 Skill Market- The chapter has shown that the operation of the skill markets represent
an attack surface. The mapping between voice commands and actions can be exploited by an attacker. Transcriptions of speech are subject to errors which can be exploited. However, a full scale systematic misinterpretation analysis is yet to be completed followed by work proposing suitable defense mechanisms.
Specifically, for the misinterpretation error, only more studies into the misinterpretation error of current COTS skill market are carried out, more advanced skill name cencership algorithm can be designed. For attacks exploiting implicit capability of skills such as skill switch, defence like static algorithm trying to detect malicious skill at the back end may fail as the implementation details of the registered skill is unknown and only the interface is
3.3 Chapter Conclusion and Discussion 93 registered at the market, which means malicious skill can evolve all the time. Therefore, as mentioned in Dangerous Skills’19, dynamic way of analysis is promising which can try to invoke malicious action from potential adversarial skills. Such skills can be discovered once suspicious actions are spotted.
C.3.2 Jamming- Jamming of PVAs via the acoustic channel is feasible. Noise can be
added to prevent a PVA from functioning. Existing work does not use sophisticated jamming methods (i.e. inaudible jamming, jamming preventing detection and localisation). Also, jamming so far had the aim to block a signal entirely; however, it might also be possible to add interference very targeted to introduce ASR more subtle transcription errors. Defence methods to detect jamming or to design PVA resilient to jamming are missing.
C.4.1 Passive Sensing- The chapter has shown that the acoustic channel can provide
a rich set of information in addition to speech. The acoustic channel has been extensively used to infer user interaction patterns with devices (mainly interaction with phones). It has also been shown that a wide variety of other user behaviour (walking, eating, sitting) can be inferred. However, an detailed analysis of what information can be extracted via a PVA is missing. Also, no defence mechanisms against the use of a PVA as acoustic sensor has been reported.
C.4.2 Active Sensing- The work in this category is similar to the line of work on passive
sensing. However, as active signal generation is used, more detailed information can be obtained. It has not yet been investigated in detail how active sensing can be carried out on smart speaker type PVA, work so far has focused on phone based PVAs. Specifically how an active sensing signal can be hidden or embedded in expected audio signals (hidden sensing) has not attracted work. For example, sound (voice, music, ...) emitted from a smart speaker could be designed such that it functions well as active sensing signal too. Work on how to detect or defend against an active acoustic sensing signal has not been explored yet.
Chapter 4
Adversarial Command Detection Using
Parallel Speech Recognition Systems
4.1
A Defence Method against Adversarial Commands Tar-
geting ASR
A PVA can be integrated as functionality in other devices such as smartphones or TVs or may be implemented as dedicated device referred to as smart speaker. We use PVAs to interact with infrastructures such as our smart home and services such as e-mails and news.
There are a number of PVA security and privacy concerns and research has investigated a large variety of attacks on these systems. One prominent attack example is the so called
adversarial attack as introduced in Section 2.4.3. Related work about it is presented in
Section 3.2.1-B3. The aim of such attack is to supply a specially crafted voice signal, referred to as adversarial command, to the PVA which is interpreted differently by the PVA than it is by humans. For example, the supplied adversarial command may be interpreted by humans as “Alexa, tell me what the weather is like” while the ASR of the PVA interprets this signal as “Alexa, open the front door”. An adversarial command is created by adding small perturbations to an audio recording until the PVA’s ASR recognises the intended command of the attacker instead of the command contained in the original audio recording. If the perturbations are small and added carefully, a human will not notice the modification of the audio signal while the ASR algorithms recognise different words.
How to create adversarial commands has been studied in detail 2.4.3, different methods exist to generate these [11, 130, 69] and studies have been carried out demonstrating their
96 Adversarial Command Detection Using Parallel Speech Recognition Systems effectiveness [67, 11, 24, 156, 130, 69, 118, 152, 108, 129, 120, 28]. Less effort has been put into devising defence methods against this serious attack form [153].
In this chapter we describe a novel defence method against adversarial commands. Our method makes use of a second ASR - we call it the protection ASR - component within a PVA which analyses the supplied voice sample in parallel to the main ASR. The speech transcription output of the protection ASR is compared with the transcription output of the main ASR and only if both outputs are a close enough match the transcription output is accepted and the command is executed. The protection ASR uses different training data or even an entirely different ASR architecture compared to the main ASR. Thus, it is infeasible for an attacker to add unnoticeable perturbations to the original audio such that two entirely different ASRs are tricked into producing the same transcriptions.
In this chapter we demonstrate the feasibility of using a protection ASR to prevent an PVA from processing adversarial commands. We use the Kaldi ASR as main PVA ASR and demonstrate that an effective protection ASR based on either PocketSphinx [65] or Kaldi can be constructed.
The protection ASR does not have to produce the same transcription quality as the main ASR. Speech recognition of this component must only be sufficiently accurate to provide protection, transcription accuracy is delivered by the main ASR. Thus, the protection ASR can be simpler and can also be based on much smaller training data. Thus, it is possible to implement the protection ASR without much resource requirements and it is possible to use frequent re-training. Frequent re-training adds additional complexity for a potential attacker that may try to craft an adversarial command targeting main and protection ASR jointly. The structure of the protection ASR in this case is a moving target.
The main contributions of this chapter are:
• Adversarial Command Detection (ACD): We describe a novel protection mechanism against adversarial commands using parallel ASR systems.
• Demonstration of ACD: We demonstrate the effectiveness of ACD using 20 adversarial commands and show that our ACD using PocketSphinx and Kaldi can detect all adversarial commands. We also show that the ACD can be set to not prevent normal PVA operations due to false positives while still maintain certain adversarial command detection sensitivity.
• ACD Complexity: We show that the protection ASR can be significantly less complex than the main ASR in terms of architecture and training data. Thus, frequent retraining of the protection ASR is feasible, providing a ACD as moving target defence.
4.2 Preliminaries 97