ML2022春季作業1

剛開始學李宏毅老師的深度學習課程，記錄一下完成實驗的過程
做法儘量從課程作業的 hints 出發，不用一些 tricky 的技巧，儘量使用通法

分別需要達到的指標

private
public

Simple#

只需要運行代碼即可

提交結果

Medium#

特徵選擇
觀察訓練數據可以看出來第 0 列是 id，沒什麼用，只需把它篩掉就行

feat_idx = list(range(raw_x_train.shape[1]))
feat_idx = feat_idx[1:]

Result

居然連 strong 的指標都過了

Strong#

Medium 將 Strong 的指標也刷了，那就不特別進行模型架構和優化器調整了，沿用 Medium 做法

Boss#

不希望使用太 tricky 的技巧，只做了以下調整

batch size 從 256→128（合適的 batch size 相當於給 loss 加了一個小噪聲，類似於退火，可以有機會收斂到全局泛化更好的地方）

模型架構改為：加入了 leakyrelu 和 dropout，各種激活函數學習過程參考:一文搞懂激活函數 (Sigmoid/ReLU/LeakyReLU/PReLU/ELU) - 知乎

self.layers = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.LeakyReLU(),
            nn.Linear(32, 64),
            nn.LeakyReLU(),
            nn.Dropout(0.1),
            nn.Linear(64, 1)
        )

優化器為

optimizer = torch.optim.SGD(model.parameters(), lr=config['learning_rate'], momentum=0.9, weight_decay=config['weight_decay'])

特徵選擇模塊

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

features = pd.read_csv('./covid.train.csv')
x_data, y_data = features.iloc[:, 0:117], features.iloc[:, 117]

#try choose your k best features
k = 24
selector = SelectKBest(score_func=f_regression, k=k)
result = selector.fit(x_data, y_data)

#result.scores_ inclues scores for each features
#np.argsort sort scores in ascending order by index, we reverse it to make it descending.
idx = np.argsort(result.scores_)[::-1]
print(f'Top {k} Best feature score ')
print(result.scores_[idx[:k]])

print(f'\nTop {k} Best feature index ')
print(idx[:k])

print(f'\nTop {k} Best feature name')
print(x_data.columns[idx[:k]])

selected_idx = list(np.sort(idx[:k]))
print(selected_idx)
print(x_data.columns[selected_idx])

由於使用 KBest，前 24 個特徵遠大於後面的特徵的特徵得分，所以選 24 個

數據預處理：最大最小值歸一化，參考:如何理解歸一化（normalization）? - 知乎

# Normalization
x_min, x_max = x_train.min(axis=0), x_train.max(axis=0)
x_train = (x_train - x_min) / (x_max - x_min)
x_valid = (x_valid - x_min) / (x_max - x_min)
x_test = (x_test - x_min) / (x_max - x_min)

參數設置

config = {
    'seed': 5201314,      # Your seed number, you can pick your lucky number. :)
    'select_all': False,   # Whether to use all features.
    'valid_ratio': 0.2,   # validation_size = train_size * valid_ratio
    'n_epochs': 3000,     # Number of epochs.
    'batch_size': 128,
    'learning_rate': 1e-4,
    'weight_decay': 1e-4,
    'early_stop': 600,    # If model has not improved for this many consecutive epochs, stop training.
    'save_path': './models/model.ckpt'  # Your model will be saved here.
}

最後在 private 上指標還差一點，但是時間花的太久了，暫且先這樣吧，希望在做這個過程中學到一些通法。

想要達到 boss baseline 可參考機器學習手藝人 - CSDN 博客