ML2022Spring HW1 - Strive236

I just started learning Professor Li Hongyi's deep learning course and I'm recording the process of completing the experiments.
I will try to follow the hints from the course assignments and avoid using tricky techniques, aiming to use standard methods.

The metrics to be achieved are:

private
public

Simple#

Just run the code.

Submit results.

Medium#

Feature selection
By observing the training data, it can be seen that the 0th column is the id, which is not useful, so it just needs to be filtered out.

feat_idx = list(range(raw_x_train.shape[1]))
feat_idx = feat_idx[1:]

Result

Surprisingly, even the strong metrics have been passed.

Strong#

Medium also passed the strong metrics, so there is no need to make special adjustments to the model architecture and optimizer; I will continue using the Medium approach.

Boss#

I don't want to use too many tricky techniques, so I only made the following adjustments:

Batch size from 256 to 128 (an appropriate batch size adds a small noise to the loss, similar to annealing, which can provide an opportunity to converge to a better global generalization).

The model architecture was changed to include leaky ReLU and dropout. For various activation function learning processes, refer to: Understanding Activation Functions (Sigmoid/ReLU/LeakyReLU/PReLU/ELU) - Zhihu

self.layers = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.LeakyReLU(),
            nn.Linear(32, 64),
            nn.LeakyReLU(),
            nn.Dropout(0.1),
            nn.Linear(64, 1)
        )

The optimizer is

optimizer = torch.optim.SGD(model.parameters(), lr=config['learning_rate'], momentum=0.9, weight_decay=config['weight_decay'])

Feature selection module

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

features = pd.read_csv('./covid.train.csv')
x_data, y_data = features.iloc[:, 0:117], features.iloc[:, 117]

# Try to choose your k best features
k = 24
selector = SelectKBest(score_func=f_regression, k=k)
result = selector.fit(x_data, y_data)

# result.scores_ includes scores for each feature
# np.argsort sorts scores in ascending order by index, we reverse it to make it descending.
idx = np.argsort(result.scores_)[::-1]
print(f'Top {k} Best feature score ')
print(result.scores_[idx[:k]])

print(f'\nTop {k} Best feature index ')
print(idx[:k])

print(f'\nTop {k} Best feature name')
print(x_data.columns[idx[:k]])

selected_idx = list(np.sort(idx[:k]))
print(selected_idx)
print(x_data.columns[selected_idx])

Since using KBest, the scores of the top 24 features are much greater than those of the remaining features, so I selected 24.

Data preprocessing: Min-max normalization, refer to: How to Understand Normalization? - Zhihu

# Normalization
x_min, x_max = x_train.min(axis=0), x_train.max(axis=0)
x_train = (x_train - x_min) / (x_max - x_min)
x_valid = (x_valid - x_min) / (x_max - x_min)
x_test = (x_test - x_min) / (x_max - x_min)

Parameter settings

config = {
    'seed': 5201314,      # Your seed number, you can pick your lucky number. :)
    'select_all': False,   # Whether to use all features.
    'valid_ratio': 0.2,   # validation_size = train_size * valid_ratio
    'n_epochs': 3000,     # Number of epochs.
    'batch_size': 128,
    'learning_rate': 1e-4,
    'weight_decay': 1e-4,
    'early_stop': 600,    # If the model has not improved for this many consecutive epochs, stop training.
    'save_path': './models/model.ckpt'  # Your model will be saved here.
}

In the end, the metrics on the private set are still a bit lacking, but I've spent too much time on it, so let's leave it at that for now. I hope to learn some standard methods during this process.

To achieve the boss baseline, you can refer to Machine Learning Artisan - CSDN Blog