ARMUS: A HIGH-QUALITY MULTISPEAKER ARMENIAN SPEECH CORPUS FOR SPEECH SYNTHESIS
Keywords:
Text-to-Speech (TTS), Armenian language, speech dataset, audio segmentation, silence detection, speech synthesisAbstract
The development of Text-to-Speech (TTS) systems requires high-quality speech datasets, which are particularly scarce for under-resourced languages like Armenian. This paper presents the development and implementation of an automated system for creating speech datasets from audiobooks and corresponding text files specifically designed for Armenian TTS applications. The system employs intelligent audio segmentation based on silence detection, text alignment mechanisms, and automated quality assessment protocols. Using this automated approach, a comprehensive Armenian speech dataset contain 14,182 audio segments was created with a total duration of 75,597.79 seconds (approximately 21 hours), sourced from professional audiobook recordings. The dataset includes recordings from two male speakers and covers 14,078 unique sentences containing 137,716 words with 30,466 unique vocabulary items. Audio files are standardized at 22,050 Hz sampling rate, 16-bit depth, and mono format to ensure consistency.
Quantitative analysis reveals that segment durations follow a natural distribution centered between 2-6 seconds, with an average duration of 5.33 seconds per segment. Phoneme distribution analysis demonstrates comprehensive coverage of the Armenian phonological system, following expected linguistic patterns. Quality assessment shows signal-to-noise ratios exceeding 35 dB across all segments, with 94.3% of randomly sampled segments meeting predefined quality criteria. The created dataset significantly exceeds existing Armenian speech resources in both volume and quality, providing a valuable foundation for Armenian TTS system development and other speech processing applications.



