ML2022Spring HW2 - Strive236

Introduction#

HW1 is based on data from the Delphi group@CMU for COVID-19 Cases Prediction, which predicts the percentage of new positive test cases on the 5th day given the survey results from the past 5 days in a certain state in the United States. This is a regression task.

HW2 involves frame-by-frame phoneme classification using MFCC features pre-extracted from the original audio signal dataset LibriSpeech (subset of train-clean-100), which is a classification task.

Since I have not studied speech recognition-related knowledge, I first conducted a brief study of the background knowledge.

Background#

Phoneme#

A simple explanation

MFCC Features#

Audio signals are recorded in wave form. This task extracts MFCC features from the original audio signal for data preprocessing. Why?

A good answer is:

Audio classification tasks refer to categorizing audio signals based on their content. When performing audio classification, it is necessary to extract useful features from the audio signals for better classification.
Currently, most audio classification tasks use MFCC features for feature extraction. MFCC features are audio features based on human auditory characteristics, effectively simulating the human ear's ability to recognize audio signals. MFCC features mainly consist of information such as frequency and energy, capturing frequency and energy changes in audio signals, thus providing effective feature information for audio classification tasks.
Additionally, MFCC features have good robustness, achieving better performance in different audio environments. They can adapt to various audio signal sources, including human voices, instrument sounds, etc., and can suppress noise interference for audio classification tasks.

Regarding the understanding of MFCC, a particularly good explanation is available for reference. Since this is not the main content of this homework, a simple understanding is sufficient.

Note that due to framing in MFCC (each frame is a 39-dim MFCC feature), and each frame contains only 25 milliseconds of speech, a single frame is unlikely to represent a complete phoneme.

Typically, a phoneme spans multiple frames.
Adjacent phonemes are concatenated for training.

Thus, the teaching assistant provided an example, selecting data from five frames before and after a frame, concatenating them for a total of 11 frames, and then using them to predict the central frame, as shown in the figure.

Dataset and Data Format#

Requirements & Hints#

Tasks on Kaggle#

Simple#

According to its requirements, you just need to run it directly.

Results on the training and validation set:

Results on the test set:

Medium#

Since the optimization code already uses adamw, which includes rmsdrop + momentum + weight_decay, there is no need to consider optimization issues further for now. Therefore, we mainly consider concatenating several frames and designing hidden layers.

Set hidden_layers to 3 and hidden_dim to 1024, concatenating 21 frames. The results on the training and validation set are:

Results on the test set:

Why design it this way: Choosing a larger value for concatenating n frames allows each time step to perceive its contextual information, which is crucial for sequential tasks like speech recognition. The model's hidden_layer and hidden_dim were chosen to be larger, thus passing the medium task.

Strong#

First, change hidden_layers to 6 to ensure a high train_acc, while also adding regularization methods.

Some posts suggest placing ReLU before batch normalization, but I believe this is incorrect. Not only is this not the case in many network architectures, but I also think there are reasons to place batch normalization before ReLU:

The purpose of BN is to standardize the input (mean of 0, variance of 1), while ReLU is a nonlinear activation function. If BN is placed before ReLU, it can ensure a more stable data distribution for ReLU, avoiding gradient vanishing/explosion issues.
ReLU's truncation of negative values (outputting 0) may lead to some information loss. If BN is performed first, negative values may be retained (through shifting and scaling), allowing for more flexible ReLU activation.

Specific parameters are:

Since num_epoch is a bit large, to speed up training, I changed batch_size to 2048. This principle is discussed in Professor Li Hongyi's class:

Results on the training and validation set are:

Results on the test set are:

Just managed to pass the strong_baseline.

Boss#

Since the boss_baseline requires introducing RNN and other architectures, the current plan is to gain a general understanding before delving into a specific direction, so it is temporarily set aside and will be revisited when time permits.

Report Questions#

question1#

Only modifying the model structure while keeping other parameters unchanged, the results are:

Narrower and deeper:

Wider and shallower:

It can be seen that the wider and shallower model performs slightly better with a shorter training time, although the difference is not significant.

question2#

Adding dropout layers in the BasicBlock with dropout rates of 0.25, 0.5, and 0.75, while keeping the rest unchanged, the results are:

Dropout rate = 0.25:

Dropout rate = 0.5:

Dropout rate = 0.75:

As the dropout rate increases, the best val_acc decreases. Let's check with a lower dropout rate:

Dropout rate = 0.2:

Dropout rate = 0.1:

It seems that as the dropout rate increases, val_acc decreases. I speculate that this may be due to the train_acc of the simple code being too low, indicating that the model's bias is too large. Therefore, after adding dropout, the effect worsened (train_acc became smaller, so the reduction in generalization error did not improve val_acc compared to before; using an appropriate dropout rate with a few more epochs would definitely yield better results).