DOG EMOTION RECOGNITION IN IMAGES USING FINE-TUNED VISION TRANSFORMERS
Keywords:
vision transformer, self-attention, fine-tuning, data augmentationAbstract
Recognizing canine emotions has a practical value for veterinary practice, welfare monitoring, and safer human–dog interaction. This work investigates transformer-based image classification for dog emotion recognition and presents a complete pipeline that fine-tunes a ViT-B/16 backbone on a four-class dataset (angry, happy, relaxed, sad). Images are standardized to 224×224 and normalized to ImageNet statistics, with stochastic augmentation (flips, rotations, color jitter, brightness/contrast, and small affine shifts) to improve robustness. A new 4-way classification head is trained with differential learning rates on top of a pretrained ImageNet-21k encoder and optimized with AdamW, class-weighted cross-entropy, warm-up, cosine annealing, early stopping, and checkpointing. Post-processing includes confidence thresholding and optional temporal smoothing for video scenarios. On the held-out test set, the fine-tuned ViT achieves 82,6% accuracy, outperforming a fine-tuned ResNet-50 (75,4%) and a ViT trained from scratch (68,9%). Per-class analysis shows the highest discrimination for “Happy,” while “Sad” and “Relaxed” are most frequently confused due to subtle visual overlap. These findings indicate that global self-attention in ViTs captures nuanced cues (e.g., ear position and mouth tension) better than convolutional baselines, and that transfer learning is critical under limited labeled data. The study highlights the remaining challenges in cross-breed generalization, viewpoint and lighting variation, and label subjectivity, and points toward multimodal extensions and temporally aware models for further gains.



