SEMANTIC CLUSTERING AND MULTI-MODEL INTEGRATION FOR EFFICIENT AUDIO CAPTIONING
Keywords:
audio captioning, large language models, Whisper-small, CoNeTTEAbstract
Audio captioning models play a crucial role in bridging the gap between acoustic information and human language, making digital audio content more accessible and searchable for diverse applications. These systems are increasingly vital in assistive technologies, content management, and surveillance systems where automated understanding of soundscapes is required. This research has introduced an innovative hybrid captioning methodology that boosts the capabilities of resource-efficient models without substantially increasing their computational footprint. The proposed approach harnesses the advantages of two lightweight audio captioning systems (Whisper-small and CoNeTTE) through a sophisticated pipeline encompassing multiple stages: initial caption generation, semantic phrase extraction, clustering of related concepts, selection of optimal phrases, and coherent text assembly. This technique enables the creation of richer, more detailed captions by combining the strongest elements from each model's output. Testing on the Clotho dataset revealed significant performance improvements, with the hybrid system surpassing individual models by substantial margins across all evaluation metrics. On average, the hybrid approach demonstrated enhancements of 28,4% over Whisper-small and 34,3% over CoNeTTE. Particularly impressive gains were observed in METEOR (48,4%) and SPICE (40,4%) metrics, highlighting the hybrid system's superior semantic accuracy and alignment with human-generated descriptions. These findings support our initial hypothesis that different architectures capture complementary aspects of audio content, with Whisper-small excelling in precision and CoNeTTE in semantic comprehension. Future research directions include expanding the framework with additional specialized models while refining the semantic clustering with adaptive thresholds.



