Introduction#
This HW is an image classification task, classifying the food dataset food11, with the specific task requirements as follows:
The goals and hints to be achieved are as follows:
Simple#
You only need to run its original code, and the results are as follows:
Medium#
Perform Training Augmentation and extend the training time (with a larger n_epoch).
The specific operations for Training Augmentation are as follows:
train_tfm = transforms.Compose([
# Resize the image into a fixed shape (height = width = 128)
transforms.Resize((128, 128)),
# You may add some transforms here.
transforms.RandomChoice([
transforms.RandomRotation((-30,30)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomVerticalFlip(p=0.5),
transforms.ColorJitter(brightness=(0.5,1.5), contrast=(0.5, 1.5), saturation=(0.5,1.5), hue=(-0.25, 0.25)),
transforms.RandomInvert(p=0.5),
transforms.RandomAffine(degrees=(-30,30), translate=(0.1, 0.1), scale=(0.8, 1.2), shear=(-30, 30)),
transforms.Grayscale(num_output_channels=3),
]),
# ToTensor() should be the last one of the transforms.
transforms.ToTensor(),
])
The specific introduction is as follows:
- RandomRotation((-30,30)) randomly rotates the image. (-30, 30): the range of rotation angles (randomly selected between -30 degrees and +30 degrees).
- RandomHorizontalFlip(p=0.5) horizontally flips the image with a probability of 50%. p=0.5: execution probability (0.5 means 50%).
- RandomVerticalFlip(p=0.5) vertically flips the image with a probability of 50%. The meaning of p is the same as above.
- ColorJitter(brightness=(0.5,1.5), contrast=(0.5, 1.5), saturation=(0.5,1.5), hue=(-0.25, 0.25)) randomly adjusts color properties (brightness, contrast, saturation, hue). brightness=(0.5, 1.5): brightness scaling range (0.5x to 1.5x). contrast=(0.5, 1.5): contrast adjustment range (0.5x to 1.5x). saturation=(0.5, 1.5): saturation adjustment range (0.5x to 1.5x). hue=(-0.25, 0.25): hue shift range (-0.25 to +0.25, corresponding to -90 degrees to +90 degrees on the hue circle).
- RandomInvert(p=0.5) inverts colors with a probability of 50% (color inversion, such as black to white, red to cyan). The meaning of p is the same as above.
- RandomAffine(degrees=(-30,30), translate=(0.1, 0.1), scale=(0.8, 1.2), shear=(-30, 30)) performs random affine transformations (rotation, translation, scaling, shearing). degrees=(-30, 30): the range of rotation angles. translate=(0.1, 0.1): maximum translation ratio in horizontal and vertical directions (10% of the image size). scale=(0.8, 1.2): scaling range (0.8x to 1.2x). shear=(-30, 30): the range of shear angles (-30 degrees to +30 degrees). Shearing is a linear geometric transformation that simulates the effect of "slant deformation" by tilting a part of the image.
- Grayscale(num_output_channels=3) converts the image to grayscale but retains 3 channels (RGB format, with the same value in each channel). num_output_channels=3: the number of output channels (3 means generating a 3-channel grayscale image, compatible with model input). Converting a color image (RGB) to grayscale essentially merges the brightness information of the three channels through weighted averaging to generate a single-channel image.
The training time is extended to 90 epochs, and the results are as follows:
It should be noted that upon checking the GPU utilization, it was found to be very low. The problem is likely due to the time taken for data reading with transform augmentation and disk I/O, resulting in slow training. Adjustments to the DataLoader's concurrency are needed.
Here, 12 concurrent workers were used (depending on the complexity of image augmentation), with persistent_workers=True to avoid the overhead of repeatedly creating/destroying processes. 8 concurrent workers were used for testing. After improvements, the time for training one epoch was greatly reduced; this was run on a local computer with an RTX4070 laptop.
The batch size was also changed to 128.
Strong#
First, the model structure was modified. I learned about resnet18 and resnet34 and made slight modifications based on resnet34:
class BasicBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
self.downsample = None
if stride != 1 or in_channels != out_channels:
self.downsample = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride, 0),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
identity = self.downsample(identity)
out += identity
out = self.relu(out)
return out
class Classifier(nn.Module):
def __init__(self):
super(Classifier, self).__init__()
# torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
# torch.nn.MaxPool2d(kernel_size, stride, padding)
# input dimension [3, 128, 128]
# Initial convolution layer
self.conv1 = nn.Conv2d(3, 64, kernel_size=5, stride=2, padding=2, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU(inplace=True)
self.block1 = BasicBlock(64, 64, 1) # [64, 64, 64]
self.block2 = BasicBlock(64, 64) # [64, 64, 64]
self.block3 = BasicBlock(64, 64) # [64, 64, 64]
self.block4 = BasicBlock(64, 128, 2) # [128, 32, 32]
self.block5 = BasicBlock(128, 128) # [128, 32, 32]
self.block6 = BasicBlock(128, 128) # [128, 32, 32]
self.block7 = BasicBlock(128, 128) # [128, 32, 32]
self.block8 = BasicBlock(128, 256, 2) # [256, 16, 16]
self.block9 = BasicBlock(256, 256) # [256, 16, 16]
self.block10 = BasicBlock(256, 256) # [256, 16, 16]
self.block11 = BasicBlock(256, 256) # [256, 16, 16]
self.block12 = BasicBlock(256, 256) # [256, 16, 16]
self.block13 = BasicBlock(256, 256) # [256, 16, 16]
self.block14 = BasicBlock(256, 512, 2) # [512, 8, 8]
self.block15 = BasicBlock(512, 512) # [512, 8, 8]
self.block16 = BasicBlock(512, 512) # [512, 8, 8]
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512, 11)
def forward(self, x):
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.block1(out)
out = self.block2(out)
out = self.block3(out)
out = self.block4(out)
out = self.block5(out)
out = self.block6(out)
out = self.block7(out)
out = self.block8(out)
out = self.block9(out)
out = self.block10(out)
out = self.block11(out)
out = self.block12(out)
out = self.block13(out)
out = self.block14(out)
out = self.block15(out)
out = self.block16(out)
out = self.avgpool(out)
out = out.view(out.size()[0], -1)
return self.fc(out)
Then, Cross-Validation and Ensemble were performed:
Cross-Validation used five-fold cross-validation, merging the original training and validation sets for the five-fold cross-validation:
# "cuda" only when GPUs are available.
device = "cuda" if torch.cuda.is_available() else "cpu"
# The number of training epochs and patience.
n_epochs = 200
patience = 50 # If no improvement in 'patience' epochs, early stop
import numpy as np
from sklearn.model_selection import KFold
from torch.utils.tensorboard import SummaryWriter
import datetime
# Initialize 5-fold cross-validation
n_folds = 5
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
# Load the complete training set (for cross-validation)
train_set = FoodDataset(os.path.join(_dataset_dir, "training"), tfm=train_tfm)
valid_set = FoodDataset(os.path.join(_dataset_dir, "validation"), tfm=train_tfm)
# Combine datasets
combined_files = train_set.files + valid_set.files
full_dataset = FoodDataset(path="", tfm=train_tfm, files=combined_files)
oof_preds = np.zeros(len(full_dataset)) # Store OOF prediction results
oof_labels = np.zeros(len(full_dataset)) # Store true labels
# Store all base models (for later ensemble)
base_models = []
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
log_dir = f"runs/food_classification_{timestamp}"
writer = SummaryWriter()
for fold, (train_idx, val_idx) in enumerate(kf.split(train_set)):
print(f"\n====== Fold {fold+1}/{n_folds} ======")
# Split training and validation subsets
train_subset = Subset(train_set, train_idx)
val_subset = Subset(train_set, val_idx)
# DataLoader
train_loader = DataLoader(
train_subset,
batch_size=batch_size,
shuffle=True,
num_workers=12,
pin_memory=True,
persistent_workers=True
)
val_loader = DataLoader(
val_subset,
batch_size=batch_size,
shuffle=False,
num_workers=8,
pin_memory=True,
persistent_workers=True
)
# Independently initialize model and optimizer for each fold
model = Classifier().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0003, weight_decay=1e-5)
criterion = nn.CrossEntropyLoss()
# Early stopping related variables (independent for each fold)
fold_best_acc = 0
stale = 0
# Training loop (maintaining original logic)
for epoch in range(n_epochs):
# ---------- Training ----------
model.train()
train_loss, train_accs = [], []
for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
imgs, labels = batch
imgs, labels = imgs.to(device), labels.to(device)
logits = model(imgs)
loss = criterion(logits, labels)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=10)
optimizer.step()
acc = (logits.argmax(dim=-1) == labels).float().mean()
train_loss.append(loss.item())
train_accs.append(acc.item())
# Print training information
avg_loss = np.mean(train_loss)
avg_acc = np.mean(train_accs)
# Write to TensorBoard
writer.add_scalar(f'Fold_{fold}/Train/Loss', avg_loss, epoch)
writer.add_scalar(f'Fold_{fold}/Train/Accuracy', avg_acc, epoch)
print(f"[ Train | {epoch+1:03d}/{n_epochs:03d} ] loss = {avg_loss:.5f}, acc = {avg_acc:.5f}")
# ---------- Validation ----------
model.eval()
val_loss, val_accs, val_preds = [], [], []
val_labels = [] # Accumulate all validation batch labels
for batch in tqdm(val_loader, desc="Validating"):
imgs, labels = batch
imgs = imgs.to(device)
labels_np = labels.numpy()
val_labels.extend(labels_np) # Accumulate labels
with torch.no_grad():
logits = model(imgs)
preds = logits.argmax(dim=-1).cpu().numpy()
loss = criterion(logits, labels.to(device))
val_loss.append(loss.item())
val_accs.append((preds == labels_np).mean())
val_preds.extend(preds)
# Record OOF predictions and labels
oof_preds[val_idx] = np.array(val_preds)
oof_labels[val_idx] = np.array(val_labels)
# Print validation information
avg_val_loss = np.mean(val_loss)
avg_val_acc = np.mean(val_accs)
# Write to TensorBoard
writer.add_scalar(f'Fold_{fold}/Val/Loss', avg_val_loss, epoch)
writer.add_scalar(f'Fold_{fold}/Val/Accuracy', avg_val_acc, epoch)
print(f"[ Valid | {epoch+1:03d}/{n_epochs:03d} ] loss = {avg_val_loss:.5f}, acc = {avg_val_acc:.5f}")
# Early stopping logic (independent for each fold)
if avg_val_acc > fold_best_acc:
print(f"Fold {fold} best model at epoch {epoch}")
torch.save(model.state_dict(), f"fold{fold}_best.ckpt")
fold_best_acc = avg_val_acc
stale = 0
else:
stale += 1
if stale > patience:
print(f"Early stopping at epoch {epoch}")
break
# Save the current fold's model
base_models.append(model)
# Close the TensorBoard writer
writer.close()
# ---------- Post-processing ----------
# Calculate OOF accuracy
oof_acc = (oof_preds == oof_labels).mean()
print(f"\n[OOF Accuracy] {oof_acc:.4f}")
After saving five base models, ensemble predictions were used in the test section:
# Ensemble prediction (soft voting method)
all_preds = []
for model in base_models:
model.eval()
fold_preds = []
for data, _ in test_loader: # Keep consistent with the original test_loader
with torch.no_grad():
logits = model(data.to(device))
# Save the raw logits (probabilities) of each model, not directly argmax
fold_preds.append(logits.cpu().numpy())
# Combine all batch prediction results of the current model
fold_preds = np.concatenate(fold_preds, axis=0)
all_preds.append(fold_preds)
# Soft voting: average the logits of all models and take argmax
all_preds = np.stack(all_preds) # shape: (n_models, n_samples, n_classes)
prediction = all_preds.mean(axis=0).argmax(axis=1) # shape: (n_samples,)
The final results are as follows:
It is already very close to the boss.
Due to time constraints, I did not do the boss; I will come back to it later when I have time.
The two report_problem are related to data augmentation and the design of the residual network, which were mentioned during the medium and strong completion processes, so I did not elaborate further.