The HW this time is a speaker identity recognition task. For a given segment of speech input, the model needs to identify the category of the speaker's identity. This is also a classification task, and the dataset used is VoxCeleb, with the specific details as follows:
The course objective is to learn how to use Transformers, which combine the advantages of RNNs considering the entire sequence and CNNs processing in parallel.
The basic framework of the model used in this HW is as follows:
The specific task requirements are as follows:
Hints#
Simple#
Note that the link in the provided code for downloading the dataset using wget is no longer valid. You can download it from the official link.
The data used in this experiment is the Mel spectrogram obtained after preprocessing the original waveform.
Since the lengths of different audio files may vary, during training, the input audio of different lengths will be segmented into fixed-length segments for the model to learn.
That is,
Note that in the above process, lengths shorter than the selected segment will be padded in subsequent processing:
The requirements for Simple are as follows; you only need to run its sample code directly.
The results are as follows:
Medium#
The requirements for Medium are as follows:
First, I modified the hidden_layer dimension in pred_layer. Although the results on train_set and val_set were good, it only improved by about 0.2 on Kaggle.
Based on this, I made several changes.
However, I found that the train_acc could reach 1 after more than 40,000 epochs, but the final val_acc was only 0.85, and the result on Kaggle was only around 0.66. I tried adjusting the model architecture again.
After increasing nhead and num_layer to 4, the val_acc rose to around 0.86, and the result on Kaggle improved to around 0.7, just needing a little fine-tuning.
On this basis, I added a dropout layer (0.1) after relu in pred_layer, and finally, the val_acc improved to around 0.87, and the result on Kaggle improved to around 0.71, passing Medium.
Strong#
The Strong requirement is to construct a Conformer:
Using Transformer alone makes it difficult to fully exploit local features in speech signals, while using CNN alone struggles to efficiently capture global dependencies. Therefore, it is crucial to combine the advantages of Transformers and CNNs to design a model architecture that can efficiently model both local and global features of speech signals. Conformer is such an architecture. Implementing it manually is too cumbersome, and I did not look at the structure and code again. Instead, I used the Conformer in torchaudio for implementation, with the following specific changes:
Change the encoder to
self.encoder = models.Conformer(input_dim=d_model, num_heads=4, ffn_dim=4*d_model, num_layers=6, depthwise_conv_kernel_size=31, dropout=dropout)
Since key_padding_mask is used in Transformers and their derivative models (like Conformer), which is a critical self-attention masking mechanism for handling variable-length sequence inputs, we need to input the length of each sequence during forward, so we need to modify the method for obtaining the batch:
- Training and validation batch:
def collate_batch(batch):
# Process features within a batch.
"""Collate a batch of data."""
mel, speaker = zip(*batch)
lengths = torch.FloatTensor([m.size(0) for m in mel])
# Because we train the model batch by batch, we need to pad the features in the same batch to make their lengths the same.
mel = pad_sequence(mel, batch_first=True, padding_value=-20) # pad log 10^(-20) which is very small value.
# mel: (batch size, length, 40)
return mel, lengths, torch.FloatTensor(speaker).long()
- Test batch:
def inference_collate_batch(batch):
"""Collate a batch of data."""
feat_paths, mels = zip(*batch)
lengths = torch.FloatTensor([m.size(0) for m in mels])
return feat_paths, lengths, torch.stack(mels)
Note that you just need to modify the place where you obtain the batch in model_fn accordingly.
Since the model has grown larger, the learning rate should generally be lowered a bit. Here it is set to 5e-4. Since the learning rate is lowered, the training time is extended to 140,000 epochs, and warmup is also changed to 2000 epochs. The training results show that val_acc improved to around 0.92:
The results on Kaggle are as follows:
Successfully passed Strong, with a slight margin.
Boss#
The Boss requirements are as follows:
I used the open-source self-attention pooling and additive margin Softmax, and modified them to fit into the model. The changes to the model are as follows:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio.models as models
class Classifier(nn.Module):
def __init__(self, d_model=300, n_spks=600, dropout=0.15):
super().__init__()
# Project the dimension of features from that of input into d_model.
self.prenet = nn.Linear(40, d_model)
# TODO:
# Change Transformer to Conformer.
# https://arxiv.org/abs/2005.08100
# self.encoder_layer = nn.TransformerEncoderLayer(
# d_model=d_model, dim_feedforward=4*d_model, nhead=4, dropout=dropout
# )
# self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=4)
self.encoder = models.Conformer(input_dim=d_model, num_heads=6, ffn_dim=4*d_model, num_layers=8, depthwise_conv_kernel_size=31, dropout=dropout)
self.pooling = SelfAttentionPooling(d_model)
# Project the the dimension of features from d_model into speaker nums.
self.pred_layer = nn.Sequential(
nn.Linear(d_model, 4*d_model),
nn.BatchNorm1d(4*d_model),
nn.ReLU(),
)
self.loss = AdMSoftmaxLoss(embedding_dim=4*d_model, no_classes=n_spks, scale=1, margin=0.4)
def forward(self, mels, lengths, labels=None):
"""
args:
mels: (batch size, length, 40)
return:
out: (batch size, n_spks)
"""
# out: (batch size, length, d_model)
out = self.prenet(mels)
# # out: (length, batch size, d_model)
# out = out.permute(1, 0, 2)
# The encoder layer expect features in the shape of (length, batch size, d_model).
out, _ = self.encoder(out, lengths)
# # out: (batch size, length, d_model)
# out = out.transpose(0, 1)
# mean pooling
# stats = out.mean(dim=1)
stats = self.pooling(out)
# out: (batch, n_spks)
out = self.pred_layer(stats)
logits, err = self.loss(out, labels)
return logits, err
SelfAttentionPooling is:
import torch
from torch import nn
class SelfAttentionPooling(nn.Module):
"""
Implementation of SelfAttentionPooling
Original Paper: Self-Attention Encoding and Pooling for Speaker Recognition
https://arxiv.org/pdf/2008.01077v1.pdf
"""
def __init__(self, input_dim):
super(SelfAttentionPooling, self).__init__()
self.W = nn.Linear(input_dim, 1)
def forward(self, batch_rep):
"""
input:
batch_rep : size (N, T, H), N: batch size, T: sequence length, H: Hidden dimension
attention_weight:
att_w : size (N, T, 1)
return:
utter_rep: size (N, H)
"""
softmax = nn.functional.softmax
att_w = softmax(self.W(batch_rep).squeeze(-1)).unsqueeze(-1)
utter_rep = torch.sum(batch_rep * att_w, dim=1)
return utter_rep
AMSoftmaxLoss is:
import torch
import torch.nn as nn
import torch.nn.functional as F
class AdMSoftmaxLoss(nn.Module):
def __init__(self, embedding_dim, no_classes, scale = 30.0, margin=0.4):
'''
Additive Margin Softmax Loss
Attributes
----------
embedding_dim : int
Dimension of the embedding vector
no_classes : int
Number of classes to be embedded
scale : float
Global scale factor
margin : float
Size of additive margin
'''
super(AdMSoftmaxLoss, self).__init__()
self.scale = scale
self.margin = margin
self.embedding_dim = embedding_dim
self.no_classes = no_classes
self.embedding = nn.Embedding(no_classes, embedding_dim, max_norm=1)
self.loss = nn.CrossEntropyLoss()
def forward(self, x, labels=None):
'''
Input shape (N, embedding_dim)
'''
n, m = x.shape
assert m == self.embedding_dim
if labels != None:
assert n == len(labels)
assert torch.min(labels) >= 0
assert torch.max(labels) < self.no_classes
x = F.normalize(x, dim=1)
w = self.embedding.weight
cos_theta = torch.matmul(w, x.T).T
psi = cos_theta - self.margin
logits = None
err = None
if labels != None:
onehot = F.one_hot(labels, self.no_classes)
logits = self.scale * torch.where(onehot == 1, psi, cos_theta)
err = self.loss(logits, labels)
else:
logits = cos_theta
return logits, err
After training for 350,000 epochs, with an initial lr of 1e-3 and warmup for 5000 epochs, the final training results were still slightly off. Adjusting the initial lr lower or combining with ensemble might help pass Boss. I didn't adjust further after that.
The results are as follows:
I observed that train_acc reached 1 early on, and I think applying some data augmentation might improve the results.