• No se han encontrado resultados

The main contributions described in this chapter of the thesis are the following:

 The performance of the so-called DUET algorithm has been evaluated in a variety of scenarios including linear and binaural anechoic mixtures, echoic mixtures, and mixtures of speech with other types of sources such as noise and music. It has been demonstrated the need for more advanced clustering techniques in such situations.

 A novel source separation algorithm that combines the mean shift clustering technique with the basis of DUEThas been proposed. The clustering step in DUET, which is based on a weighted histogram, is replaced by a generalized version of the mean shift algorithm. A weighted-Gaussian kernel mean shift vector has been inferred and included in an iterative process to clusterize the bidimensional feature input space composed of the level and time differences between the two microphones. The proposedWG-MSalgorithm has been tested in different scenarios: linear and binaural anechoic speech mixtures, echoic speech mixtures with different reverberation coefficients, and anechoic mixtures of speech with noise and speech with music. The WG-MS algorithm has been compared to the original DUET algorithm and a modification thereof which introduces the k-means algorithm in the clustering step. The analysis of the results obtained demonstrates that the WG-MS algorithm clearly outperforms the originalDUETand its modification using k-means.

 The WG-MS algorithm, which was originally proposed for two microphones, has been extended to the case of any number of microphones and array geometry. The mean shift algorithm allows clustering a feature space of any dimension. A newMLsource estimator that considers any number of microphones has been inferred. Several experiments varying the number of microphones support the suitability of the method, which shows a special robustness in the case of echoic mixtures.

 A novel speech source enumeration algorithm has been proposed. The algorithm is based on information theoretic criteria and the estimation of the source delays between the signals received by two microphones. The algorithm has obtained very good results and it has shown good robustness in the enumeration of anechoic mixtures up to 5 speech sources. Additionally, the potential of the algorithm to enumerate sources in echoic mixtures has been demonstrated.

The contributions obtained in this chapter have originated the publications [Ayll´on et al., 2010], [Ayll´on et al., 2011a], [Ayll´on et al., 2012a], [Ayll´on et al., 2012b] and [Ayll´on et al., 2013d].

Single-channel speech enhancement

for monaural hearing aids

4.1

Introduction

This chapter tackles the problem of single-channel speech enhancement and its application to monaural hearing aids, considering that the main goal is to improve the intelligibility of speech in noise. Single-channel speech enhancement can be performed from two different approaches: noise reduction and source separation. A comprehensive review of single-channel speech enhancement algorithms has been carried out in sections 1.3.1.1 and 1.3.2.1. Nevertheless, single-channel source separation algorithms inspired inCASAare either too complex or the performance is too limited to be applicable in hearing aids. These algorithms typically involve complex operations for feature extraction, segregation and grouping, which makes difficult a real-time implemen- tation. Nevertheless, the time-frequency masking approach inspired in CASA can be useful in hearing aids, as long as the mask computation is relatively simple.

The main problem associated to single-channel noise reduction algorithms resides in the fact that they are commonly designed to improve the speech quality rather than to improve the speech intelligibility, which is the final purpose for hearing impaired people. The correct approach is to design the algorithms to optimize an objective measure correlated with speech intelligibility instead of correlated with speech quality. It has been demonstrated in [Ma et al., 2009] that the fwSNRseg and thePESQare two good objective measures highly correlated with speech intelligibility. The other alternative, originated in the field of CASA, is time-frequency masking. This approach is based on the application of a gain function or mask to the time- frequency representation of a corrupted speech signal, removing portions of the signal that are considered noise and allowing the remaining signal to pass through unaltered. The mask may be either a binary mask (i.e. takes values of 0 and 1) or a soft mask (i.e. takes continuous values between 0 and 1). The ideal binary mask (IBM) commonly defined in CASA systems [Hu and Wang, 2001, 2004] is the one that takes values of zero or one by comparing the local SNR in each time-frequency point against a threshold, which is usually set to 0 dB. It is demonstrated in [Loizou and Kim, 2011] that theIBM maximizes the articulation index (AI), a metric highly correlated with speech intelligibility [Kryter, 1962]. Consequently, the use of the IBM for noise reduction also entails an improvement in speech intelligibility. Unfortunately, the computation of the IBM needs to have access to the clean speech and noise signals, information that is not available in practice.

The design of a speech enhancement algorithm based on time-frequency masking consists 73

in estimating the IBM from the corrupted observations of the signal. The CASA approach performs this estimation using features inspired in the human auditory system (pitch, amplitude and frequency modulation, onset/offset, etc.). However, it is conceptually and computationally simpler to use machine learning techniques to identify each time-frequency point as speech- dominated or noise-dominated.

In this chapter, a time-frequency masking algorithm is proposed for single-channel speech enhancement in monaural hearing aids. The algorithm is designed bearing in mind the reduced computational resources available in state-of-the-art commercial hearing aids. The system uses a soft mask and is designed to maximize the output PESQ, which is an objective measure correlated with intelligibility.