Wiener filter通过滤波(矩阵或者其他模型的形式)来从信号和噪声的混合中提取信号,维纳滤波的核心,就是计算这个滤波器,也就是解Wiener-Hopf方程。
SEGAN
SEGAN直接在时域对信号建模,通过端对端训练,实现信号增强
the G network performs the enhancement. Its inputs are the noisy speech signal x together with the latent representation z, and its output is the enhanced version x' = G(x).
chose the L1 norm, as it has been proven to be effective in the image manipulation domain
add a secondary component to the loss of G in order to minimize the distance between its generations and the clean examples.
the least-squares GAN (LSGAN) approach substitutes the cross-entropy loss by the least-squares function with binary coding (1 for real, 0 for fake).
修改GAN的loss function,将GAN中Discriminator Loss Label从以前的真假(0,1 离散的)换为连续的值,该值即为(normalized之后的)需优化的指标(如PESQ,范围为-0.5~4.5,归一化到0~1作为groundtruth of discriminator);
在D Loss新加入一项:最小化noisy speech and clean speech 之间的difference
训练D时,将之前生成的speech添加到输入数据集,history_portion设置为0.2,alternative training as algorithm 1 ( replay buffer )
由于信号频域存在相位信息,所有频域幅值不能直接相加,虽然时域noisy=clean+noise,但频域clean spec amp/noisy spec amp不一定小于1,并且不同的频段信号和噪声的谱之间的关系/差异不同;因此MetricGAN+引入了learnable sigmoid function,对不同的frequency bin的后处理参数(mask estimation)不同。注意,learnable不是可训练,是指针对不同的frequecy,参数是不一样的,是根据整个数据集,通过前处理(分析),获得不同frequency bin对应的alpha,论文图4给出了alpha取值;alpha akins to mean and std value of ImageNet.
UNetGAN
在GAN的基础上,将generator改为unet with dilated conv
在-20dB数据上做了实验,但只有objective evaluation,没有subjective evaluation,即使objective evaluation of PESQ improves,无法确定是否达到预期的降噪效果,可能降噪之后噪声在信号中依然占主导
WaveNet
Network
The intuition behind this configuration is two-fold.
First, exponentially increasing the dilation factor results in exponential receptive field growth with depth. A 1, 2, 4, …, 512 block can be seen as a 1x1024 conv with receptive field of size 1024
Second, stacking these blocks further increases the model capacity and the receptive field size.
Main features
μ-law companding transformation
Quantizing 16-bit integer (65536 possible values) to 256 possible values, , where .
This non-linear quantization produces a significantly better reconstruction than a simple linear quantization scheme.
For speech, we found that the reconstructed signal after quantization sounded very similar to the original.
Preprocess dataset, downsampling from 16-bits to 8-bits
For each time-step, final softmax outputs 256 values, regression problem like GPT
Gated activation units
Output:
In initial experiments, it is observed that this non-linearity worked significantly better than the rectified linear activation function for modeling audio signals
Conditional wavenet
global conditioning
Condition eg.: a single latent representation h
Output:
local conditioning
Condition eg.: a second timeseries h_t, a lower sampling frequency than the audio signal
Output:
Complementary approach to increase the receptive field: Context stacks
A context stack basically processes a longer part of the audio signal and locally conditions the model on that part
SEFLOW
目前唯一一篇完全基于Normalizing flow for speech enhancement:
flow block/mapping function is constructed with similar DNN architecture as WaveNet
u-law preprocessing to achieve nolinear input as WaveNet
In contrast to WaveGlow, Input & condition of training & sampling process in SEFLOW都是时域信号. Since both signals are of the same dimension, no upsampling layer was needed.
The NF part aims to transform the observed variable vector into the transformed variable vector with a known distribution, e.g., a Gaussian distribution. and are the same dimension
The VAE part then aims to find the low dimensional latent variable vector for the high-dimensional transformed variable vector , which ultimately represents the high-dimensional observed variable vector .
Given the NF-VAE, a lower bound on the log-likelihood of is obtained by combining loss of VAE and normalizing flow, achieves end-to-end training
We adopt the multi-task-learning strategy for incorporating speaker-aware feature extraction for speech enhancement.
其实就是一个大网络,针对不同的要求,融合了多个不同模块;例如引入multi-head self attention(MHSA)/ BLSTM for long-time dependency, CNN for feature extraction
self-adaptation的是通过speaker classification实现的,引入的classification task,使网络训练是能区分不同的speaker;同时,model adaptation to the target speakerimproves the accuracy,make the model can be adopted to unknown speakers without any auxiliary guidance signal in test-phase.
Sounds of Silence可以看做是一个模型,但包含三个子模块,每个子模块赋予不同的功能,但第一个silent interval detection模块没有贡献loss(作者也试了将引入第一个模块的loss:Sec3.2&3.3,但发现效果反而更差;对该现象的解释为natural emergence of silent intervals)。
因此Sounds of silence可以看做是two-stage regression network:noise estimation & noise removal,训练时优化sum of absolute regression loss
The proposed model is composed of an encoder, a two-stage transformer module (TSTM), a masking module and a decoder.
The encoder maps input noisy speech into feature representation.
The TSTM exploits four stacked two-stage transformer blocks to efficiently extract local and global information from the encoder output stage by stage
The masking module creates a mask which will be multiplied with the encoder output. Finally, the decoder uses the masked encoder feature to reconstruct the enhanced speech.