The HW this time is a speaker identity recognition task. For a given segment of audio input, the model needs to identify the category of the speaker's identity. It is also a classification task, and the dataset used is VoxCeleb, with specific details as follows:
The course objective is to learn how to use Transformers, which combine the advantages of RNNs considering the entire sequence and CNNs processing in parallel.
The basic framework of the model used in this HW is as follows:
The specific task requirements are as follows:
Hints#
Simple#
Note that the link provided in the code for downloading the dataset using wget is no longer available. You can download it from the official link.
The data used in this experiment is the Mel spectrogram obtained after preprocessing the original waveform.
Since the lengths of different audio files may vary, during training, input audio of different lengths will be segmented into fixed-length segments for the model to learn.
That is,
Note that during the above process, lengths shorter than the selected segment will be padded:
The requirements for Simple are as follows; you only need to run its sample code directly.
The results are as follows:
Medium#
The requirements for Medium are as follows:
First, I modified the hidden_layer dimension in pred_layer. Although the results on train_set and val_set were good, it only improved by about 0.2 on Kaggle.
Based on this, I made several changes.
However, I found that the train_acc could reach 1 after more than 40,000 epochs, but the final val_acc was only 0.85, and the result on Kaggle was only about 0.66. I tried adjusting the model architecture again.
After increasing nhead and num_layer to 4, the val_acc rose to about 0.86, and the result on Kaggle improved to about 0.7, needing just a little fine-tuning.
On this basis, I added a dropout layer (0.1) after relu in pred_layer, and finally, the val_acc improved to about 0.87, with the Kaggle result rising to about 0.71, passing Medium.
Strong#
The Strong requirement is to build a Conformer:
Using only Transformers makes it difficult to fully exploit local features in speech signals, while using only CNNs makes it challenging to efficiently capture global dependencies. Therefore, it is crucial to combine the advantages of Transformers and CNNs to design a model architecture that can efficiently model both local and global features of speech signals. Conformer is such an architecture. Implementing it manually is too cumbersome, and I did not look at the structure and code again. Instead, I used the Conformer in torchaudio for implementation, with specific modifications as follows:
Change the encoder to
self.encoder = models.Conformer(input_dim=d_model, num_heads=4, ffn_dim=4*d_model, num_layers=6, depthwise_conv_kernel_size=31, dropout=dropout)
Since key_padding_mask is used in Transformers and their derivative models (like Conformer), which is a critical self-attention mask mechanism for handling variable-length sequence inputs, we need to input the length of each sequence during forward, so we need to modify how to obtain the batch:
- Training and validation batch:
def collate_batch(batch):
# Process features within a batch.
"""Collate a batch of data."""
mel, speaker = zip(*batch)
lengths = torch.FloatTensor([m.size(0) for m in mel])
# Because we train the model batch by batch, we need to pad the features in the same batch to make their lengths the same.
mel = pad_sequence(mel, batch_first=True, padding_value=-20) # pad log 10^(-20) which is very small value.
# mel: (batch size, length, 40)
return mel, lengths, torch.FloatTensor(speaker).long()
- Test batch:
def inference_collate_batch(batch):
"""Collate a batch of data."""
feat_paths, mels = zip(*batch)
lengths = torch.FloatTensor([m.size(0) for m in mels])
return feat_paths, lengths, torch.stack(mels)
Note that you just need to modify the part where the batch is obtained in model_fn accordingly.
Since the model has grown larger, the learning rate should generally be lowered a bit. Here it is set to 5e-4. Also, since the learning rate has been reduced, the training time has been extended to 140,000 epochs, and warmup has become 2000 epochs. The training results show that val_acc improved to about 0.92:
The result on Kaggle is as follows:
Successfully passed Strong, with a slight surplus.
Boss#
The Boss requirements are as follows:
I used the open-source self-attention pooling and additive margin Softmax, and modified them to fit into the model. The modifications to the model are as follows:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio.models as models
class Classifier(nn.Module):
def __init__(self, d_model=300, n_spks=600, dropout=0.15):
super().__init__()
# Project the dimension of features from that of input into d_model.
self.prenet = nn.Linear(40, d_model)
# TODO:
# Change Transformer to Conformer.
# https://arxiv.org/abs/2005.08100
# self.encoder_layer = nn.TransformerEncoderLayer(
# d_model=d_model, dim_feedforward=4*d_model, nhead=4, dropout=dropout
# )
# self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=4)
self.encoder = models.Conformer(input_dim=d_model, num_heads=6, ffn_dim=4*d_model, num_layers=8, depthwise_conv_kernel_size=31, dropout=dropout)
self.pooling = SelfAttentionPooling(d_model)
# Project the the dimension of features from d_model into speaker nums.
self.pred_layer = nn.Sequential(
nn.Linear(d_model, 4*d_model),
nn.BatchNorm1d(4*d_model),
nn.ReLU(),
)
self.loss = AdMSoftmaxLoss(embedding_dim=4*d_model, no_classes=n_spks, scale=1, margin=0.4)
def forward(self, mels, lengths, labels=None):
"""
args:
mels: (batch size, length, 40)
return:
out: (batch size, n_spks)
"""
# out: (batch size, length, d_model)
out = self.prenet(mels)
# # out: (length, batch size, d_model)
# out = out.permute(1, 0, 2)
# The encoder layer expect features in the shape of (length, batch size, d_model).
out, _ = self.encoder(out, lengths)
# # out: (batch size, length, d_model)
# out = out.transpose(0, 1)
# mean pooling
# stats = out.mean(dim=1)
stats = self.pooling(out)
# out: (batch, n_spks)
out = self.pred_layer(stats)
logits, err = self.loss(out, labels)
return logits, err
SelfAttentionPooling is:
import torch
from torch import nn
class SelfAttentionPooling(nn.Module):
"""
Implementation of SelfAttentionPooling
Original Paper: Self-Attention Encoding and Pooling for Speaker Recognition
https://arxiv.org/pdf/2008.01077v1.pdf
"""
def __init__(self, input_dim):
super(SelfAttentionPooling, self).__init__()
self.W = nn.Linear(input_dim, 1)
def forward(self, batch_rep):
"""
input:
batch_rep : size (N, T, H), N: batch size, T: sequence length, H: Hidden dimension
attention_weight:
att_w : size (N, T, 1)
return:
utter_rep: size (N, H)
"""
softmax = nn.functional.softmax
att_w = softmax(self.W(batch_rep).squeeze(-1)).unsqueeze(-1)
utter_rep = torch.sum(batch_rep * att_w, dim=1)
return utter_rep
AMSoftmaxLoss is:
import torch
import torch.nn as nn
import torch.nn.functional as F
class AdMSoftmaxLoss(nn.Module):
def __init__(self, embedding_dim, no_classes, scale = 30.0, margin=0.4):
'''
Additive Margin Softmax Loss
Attributes
----------
embedding_dim : int
Dimension of the embedding vector
no_classes : int
Number of classes to be embedded
scale : float
Global scale factor
margin : float
Size of additive margin
'''
super(AdMSoftmaxLoss, self).__init__()
self.scale = scale
self.margin = margin
self.embedding_dim = embedding_dim
self.no_classes = no_classes
self.embedding = nn.Embedding(no_classes, embedding_dim, max_norm=1)
self.loss = nn.CrossEntropyLoss()
def forward(self, x, labels=None):
'''
Input shape (N, embedding_dim)
'''
n, m = x.shape
assert m == self.embedding_dim
if labels != None:
assert n == len(labels)
assert torch.min(labels) >= 0
assert torch.max(labels) < self.no_classes
x = F.normalize(x, dim=1)
w = self.embedding.weight
cos_theta = torch.matmul(w, x.T).T
psi = cos_theta - self.margin
logits = None
err = None
if labels != None:
onehot = F.one_hot(labels, self.no_classes)
logits = self.scale * torch.where(onehot == 1, psi, cos_theta)
err = self.loss(logits, labels)
else:
logits = cos_theta
return logits, err
After training for 350,000 epochs, with an initial lr of 1e-3 and warmup for 5000 epochs, the final training results were still lacking a bit. Adjusting the initial lr lower or combining with ensemble might help pass Boss, but since it was just a bit off, I didn't adjust further.
The results are as follows:
I observed that train_acc reached 1 early on, and I thought that some data augmentation might improve the results.