File size: 18,909 Bytes
63bf0d3 4b3aebb 8bccd4a 4b3aebb 0dc3800 4b3aebb 0dc3800 4b3aebb 35aa93b 0dc3800 4b3aebb 0dc3800 4b3aebb 980e5ab 8ad54b2 05c07b8 980e5ab 4b3aebb 0dc3800 4b3aebb 0dc3800 4b3aebb 0dc3800 4b3aebb 0dc3800 4b3aebb 0dc3800 4b3aebb 0dc3800 4b3aebb 0dc3800 4b3aebb 0dc3800 4b3aebb 0dc3800 4b3aebb 05c07b8 4b3aebb 0dc3800 4b3aebb 05c07b8 f103e2b 4b3aebb 547dc90 4b3aebb 7539b18 4b3aebb 0dc3800 4b3aebb 7539b18 4b3aebb 0dc3800 4b3aebb 0dc3800 8fe503d 4b3aebb f103e2b 4b3aebb f103e2b 4b3aebb f103e2b 4b3aebb 0dc3800 4b3aebb 9f15275 4b3aebb 0dc3800 4b3aebb 9f15275 4b3aebb 9f15275 4b3aebb 9f15275 4b3aebb 0dc3800 4b3aebb c3f83e4 4b3aebb 980e5ab 4b3aebb 980e5ab 4b3aebb 0dc3800 4b3aebb 63bf0d3 4b3aebb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 |
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width" />
<title>IqraEval.2 Challenge Interspeech 2026</title>
<style>
:root {
--navy-blue: #001f4d;
--coral: #ff6f61;
--light-gray: #f5f7fa;
--text-dark: #222;
}
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
background-color: var(--light-gray);
color: var(--text-dark);
margin: 20px;
line-height: 1.6;
}
h1, h2, h3 {
color: var(--navy-blue);
font-weight: 700;
margin-top: 1.2em;
}
h1 {
text-align: center;
font-size: 2.8rem;
margin-bottom: 0.3em;
}
h2 {
border-bottom: 3px solid var(--coral);
padding-bottom: 0.3em;
}
h3 {
color: var(--coral);
margin-top: 1em;
}
p, ul, pre, ol {
max-width: 900px;
margin: 0.8em auto;
}
ul, ol { padding-left: 1.2em; }
ul li, ol li { margin: 0.4em 0; }
code {
background-color: #eef4f8;
color: var(--navy-blue);
padding: 2px 6px;
border-radius: 4px;
font-family: Consolas, monospace;
font-size: 0.9em;
}
pre {
background-color: #eef4f8;
padding: 1em;
border-radius: 8px;
overflow-x: auto;
font-size: 0.95em;
}
a {
color: var(--coral);
text-decoration: none;
}
a:hover { text-decoration: underline; }
.card {
max-width: 960px;
background: white;
margin: 0 auto 40px;
padding: 2em 2.5em;
box-shadow: 0 4px 14px rgba(0,0,0,0.1);
border-radius: 12px;
}
img {
display: block;
margin: 20px auto;
max-width: 100%;
height: auto;
border-radius: 8px;
box-shadow: 0 4px 8px rgba(0,31,77,0.15);
}
.centered p {
text-align: center;
font-style: italic;
color: var(--navy-blue);
margin-top: 0.4em;
}
.highlight {
color: var(--coral);
font-weight: 700;
}
/* nested lists in paragraphs */
p > ul { margin-top: 0.3em; }
</style>
</head>
<body>
<div class="card">
<h1>IqraEval.2 Challenge Interspeech 2026</h1>
<img src="IqraEval.png" alt="Interspeech 2026 Challenge Logo" />
<h2>Overview</h2>
<p>
This <strong>Challenge Interspeech 2026</strong> is a shared task aimed at advancing <strong>automatic assessment of Modern Standard Arabic (MSA) pronunciation</strong> by leveraging computational methods to detect and diagnose pronunciation errors. The focus on MSA provides a standardized and well-defined context for evaluating Arabic pronunciation.
</p>
<p>
Participants will develop systems capable of detecting mispronunciations (e.g., substitution, deletion, or insertion of phonemes).
</p>
<h2>Timeline</h2>
<ul>
<li><strong>1 December 2025</strong>: Registration opens</li>
<li><strong>15 December 2025</strong>: Release of training data, evaluation set, Arabic phoneme set, and phonemiser</li>
<li><strong>15 February 2026</strong>: Registration closes; leaderboard frozen</li>
<li><strong>17 February 2026</strong>: Results announced</li>
<li><strong>25 February 2026</strong>: Challenge paper submission deadline</li>
</ul>
<h2>Task Description: MSA Mispronunciation Detection System</h2>
<p>
Design a model to detect and provide detailed feedback on mispronunciations in MSA speech. Users read vowelized sentences; the model predicts the spoken phoneme sequence and flags deviations. Evaluation is on the <strong>MSA-Test</strong> dataset with human‐annotated errors.
</p>
<div class="centered">
<img src="task.png" alt="System Overview" />
<p>Figure: Overview of the Mispronunciation Detection Workflow</p>
</div>
<h3>1. Read the Sentence</h3>
<p>
System shows a <strong>Reference Sentence</strong> plus its <strong>Reference Phoneme Sequence</strong>.
</p>
<p><strong>Example:</strong></p>
<ul>
<li><strong>Arabic:</strong> يَتَحَدَّثُ النَّاسُ اللُّغَةَ الْعَرَبِيَّةَ</li>
<li>
<strong>Phoneme:</strong>
<code>< y a t a H a d d a v u n n aa s u l l u g h a t a l E a r a b i y y a t a</code>
</li>
</ul>
<h3>2. Save Recording</h3>
<p>
User speaks; system captures and stores the audio waveform.
</p>
<h3>3. Mispronunciation Detection</h3>
<p>
Model predicts the phoneme sequence—deviations from reference indicate mispronunciations.
</p>
<p><strong>Example of Mispronunciation:</strong></p>
<ul>
<li><strong>Reference:</strong> <code>< y a t a H a d d a v u n n aa s u l l u g h a t a l E a r a b i y y a t a</code></li>
<li><strong>Predicted:</strong> <code>< y a t a H a d d a <span class="highlight">s</span> u n n aa s u l l u g h a t u l E a r a b i y y a t a</code></li>
</ul>
<p>
Here, <code>v</code>→<code>s</code> (substitution) represents a common pronunciation error.
</p>
<!-- <h2>Phoneme Set Description</h2>
<p>
The phoneme set used in this work is based on a specialized phonetizer developed for vowelized MSA. It includes a comprehensive range of phonemes designed to capture key phonetic and prosodic features of standard Arabic speech, such as stress, pausing, intonation, emphaticness, and notably, gemination. Gemination—the doubling of consonant sounds—is explicitly represented by duplicating the consonant symbol (e.g., <code>/b/</code> becomes <code>/bb/</code>).
This phoneme set provides a detailed yet practical representation of the speech sounds relevant for accurate mispronunciation detection in MSA.
For further details, including the full phoneme inventory, see <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
</p> -->
<h2>Phoneme Set Description</h2>
<p>
The phoneme set employed in this work derives from a specialized phonetizer developed specifically for vowelized Modern Standard Arabic (MSA). It encompasses a comprehensive inventory of phonemes designed to capture essential phonetic and prosodic features, including stress, pausing, intonation, emphaticness, and gemination. Notably, gemination—the lengthening of consonant sounds—is explicitly represented by duplicating the consonant symbol (e.g., <code>/b/</code> becomes <code>/bb/</code>). This approach ensures a detailed yet practical representation of speech sounds, which is critical for accurate mispronunciation detection.
</p>
<p>
To phonemize additional datasets or custom text using this standard, we provide the open-source tool at the <a href="https://github.com/Iqra-Eval/MSA_phonetiser">MSA Phonetizer Repository</a>. <strong>Important:</strong> This phonetizer requires the input Arabic text to be <strong>fully diacritized</strong> to ensure accurate phonetic transcription. For further details on the symbols used, please refer to the <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
</p>
<!--
<h2>Training Dataset: Description</h2>
<p>
Hosted on Hugging Face:
</p>
<ul>
<li>
<strong>Training:</strong> 79 hours of MSA speech augmented with additional Arabic data <strong>(Will be released on 15 December 2025)</strong>
</li>
<li>
<strong>Development:</strong> 3.4 hours as dev set <strong>(Will be released on 15 December 2025)</strong>
</li>
</ul>
<p>
<strong>Columns:</strong>
<ul>
<li><code>audio</code>: waveform</li>
<li><code>sentence</code>: original text (sentence)</li>
<li><code>index</code>: sentence ID</li>
<li><code>tashkeel_sentence</code>: fully diacritized text (sentence)</li>
<li><code>phoneme</code>: phoneme sequence (using phonetizer)</li>
</ul>
</p>
<h2>Training Dataset: TTS Data (Optional)</h2>
<p>
Auxiliary high-quality TTS corpus for augmentation: <strong>(Will be released on 15 December 2025)</strong>
</p>
<h2>Test Dataset: MSA-Test</h2>
<p>
98 sentences × 18 speakers ≈ 2 h, with deliberate errors and human annotations.
<code>load_dataset("IqraEval/open_testset")</code>
</p>
-->
<h2>Training Data Overview</h2>
<p>
To ensure robustness, our training strategy utilizes a mix of native speech (pseudo-labeled), synthetic mispronunciations, and real recorded errors.
</p>
<h3>1. Native Speech (Pseudo-Labeled)</h3>
<p>
<strong>Dataset:</strong> <code>IqraEval/Iqra_train</code><br>
<strong>Volume:</strong> ~79 hours (Train) + 3.4 hours (Dev)<br>
This dataset consists of recordings from native MSA speakers. As these speakers are assumed to pronounce the text correctly, this subset is treated as "Golden" data using pseudo-labels.
</p>
<p><strong>Columns:</strong></p>
<ul>
<li><code>audio</code>: The speech waveform.</li>
<li><code>sentence</code>: The original raw text.</li>
<li><code>tashkeel_sentence</code>: Fully diacritized text, generated using an internal SOTA diacritizer (assumed correct).</li>
<li><code>phoneme_ref</code>: The reference canonical phoneme sequence.</li>
<li><code>phoneme_mis</code>: The realized phoneme sequence.
<br><em>Note: Since no errors are present, this is identical to <code>phoneme_ref</code>.</em>
</li>
</ul>
<h3>2. Synthetic Mispronunciations (TTS)</h3>
<p>
<strong>Dataset:</strong> <code>IqraEval/Iqra_TTS</code><br>
<strong>Volume:</strong> ~80 hours<br>
To compensate for the lack of errors in the native set, we generated a synthetic dataset using various trained TTS systems. Mispronunciations were deliberately introduced into the input text before audio generation.
</p>
<p><strong>Columns:</strong></p>
<ul>
<li><code>audio</code>: The synthesized waveform.</li>
<li><code>sentence_ref</code>: The original correct text.</li>
<li><code>sentence_mis</code>: The text containing deliberate errors.</li>
<li><code>phoneme_ref</code>: The canonical phoneme sequence of the correct text.</li>
<li><code>phoneme_aug</code>: The phoneme sequence corresponding to the synthesized mispronunciation.</li>
<li><code>tashkeel_sentence</code>: The fully diacritized version of the reference text.</li>
</ul>
<h3>3. Real Mispronunciations (Interspeech 2026)</h3>
<p>
<strong>Dataset:</strong> <code>IqraEval/Iqra_Extra_IS26</code><br>
<strong>Volume:</strong> ~2 hours<br>
Moving beyond synthetic data, this subset contains real recordings of human mispronunciations collected specifically for Interspeech 2026.
</p>
<p><strong>Columns:</strong></p>
<ul>
<li><code>audio</code>: The speech waveform.</li>
<li><code>sentence</code>: The original text.</li>
<li><code>phoneme_ref</code>: The target canonical phoneme sequence.</li>
<li><code>phoneme_mis</code>: The actual realized phonemes containing human errors.</li>
</ul>
<hr>
<h2>Evaluation Dataset</h2>
<p>
<strong>Dataset:</strong> <code>IqraEval/QuranMB.v2</code><br>
Currently, only the audio files are released for this evaluation set. It serves as a benchmark for detecting mispronunciations in a distinct domain.
</p>
<div style="background-color: #f0f4f8; padding: 15px; border-left: 5px solid #0056b3; margin-top: 20px;">
<strong>Important Note on Data Leakage:</strong><br>
Strict measures were taken to ensure experimental integrity. We have verified that there is <strong>no overlap in speakers or content</strong> (sentences) between the training datasets (`Iqra_train`, `Iqra_TTS`, `Iqra_Extra_IS26`) and the evaluation datasets.
</div>
<h2>Submission Details (Draft)</h2>
<p>
Submit a UTF-8 CSV named <code>teamID_submission.csv</code> with two columns:
</p>
<ul>
<li><strong>ID:</strong> audio filename (no extension)</li>
<li><strong>Labels:</strong> predicted phoneme sequence (space-separated)</li>
</ul>
<pre>ID,Labels
0000_0001, y a t a H a d d a ...
0000_0002, m a a n a n s a ...
...
</pre>
<p>
<strong>Note:</strong> no extra spaces, single CSV, no archives.
</p>
<!-- <h2>Evaluation Criteria</h2>
<p>
The Leaderboard is based on phoneme-level <strong>F1-score</strong>.
We use a hierarchical evaluation (detection + diagnostic) per <a href="https://arxiv.org/pdf/2310.13974" target="_blank">MDD Overview</a>.
</p>
<ul>
<li><em><strong>What is said</strong></em>: annotated phoneme sequence</li>
<li><em><strong>What is predicted</strong></em>: model output</li>
<li><em><strong>What should have been said</strong></em>: reference sequence</li>
</ul>
<p>From these we compute:</p>
<ul>
<li><strong>TA:</strong> correct phonemes accepted</li>
<li><strong>TR:</strong> mispronunciations correctly detected</li>
<li><strong>FR:</strong> correct phonemes flagged as errors</li>
<li><strong>FA:</strong> mispronunciations missed</li>
</ul>
<p>Rates:</p>
<ul>
<li><strong>FRR:</strong> FR/(TA+FR)</li>
<li><strong>FAR:</strong> FA/(FA+TR)</li>
<li><strong>DER:</strong> DE/(CD+DE)</li>
</ul>
<p>
Plus standard Precision, Recall, F1 for detection:
<ul>
<li>Precision = TR/(TR+FR)</li>
<li>Recall = TR/(TR+FA)</li>
<li>F1 = 2·P·R/(P+R)</li>
</ul>
</p> -->
<h2>Evaluation Criteria</h2>
<div style="background-color: #f0f8ff; border-left: 5px solid #007bff; padding: 15px; margin-bottom: 20px;">
<h3 style="margin-top: 0; color: #007bff;">🏆 Primary Metric</h3>
<p style="margin-bottom: 0;">
The Leaderboard is ranked primarily by the <strong>Phoneme-level F1-score</strong>.
While other metrics (FRR, FAR, DER) are computed for analysis, <strong>F1</strong> determines the final standing.
</p>
</div>
<p>
We use a hierarchical evaluation strategy (detection + diagnostic) based on the
<a href="https://arxiv.org/pdf/2310.13974" target="_blank">MDD Overview</a> framework.
</p>
<h3>1. Input Definitions</h3>
<ul>
<li><strong>What is said:</strong> The annotated phoneme sequence.</li>
<li><strong>What is predicted:</strong> The output from your model.</li>
<li><strong>What should have been said:</strong> The reference (target) sequence.</li>
</ul>
<h3>2. Confusion Matrix Components</h3>
<p>From the inputs above, we compute the following counts:</p>
<table style="width: 100%; border-collapse: collapse; margin-bottom: 20px;">
<tr style="background-color: #f9f9f9; border-bottom: 1px solid #ddd;">
<td style="padding: 8px;"><strong>TA (True Accept)</strong></td>
<td style="padding: 8px;">Correct phonemes properly accepted.</td>
</tr>
<tr style="border-bottom: 1px solid #ddd;">
<td style="padding: 8px;"><strong>TR (True Reject)</strong></td>
<td style="padding: 8px;">Mispronunciations correctly detected.</td>
</tr>
<tr style="background-color: #f9f9f9; border-bottom: 1px solid #ddd;">
<td style="padding: 8px;"><strong>FR (False Reject)</strong></td>
<td style="padding: 8px;">Correct phonemes incorrectly flagged as errors.</td>
</tr>
<tr>
<td style="padding: 8px;"><strong>FA (False Accept)</strong></td>
<td style="padding: 8px;">Mispronunciations missed (labeled as correct).</td>
</tr>
</table>
<h3>3. Calculated Metrics</h3>
<h4>Detection Metrics (Leaderboard Ranking)</h4>
<ul>
<li><strong>Precision:</strong> TR / (TR + FR)</li>
<li><strong>Recall:</strong> TR / (TR + FA)</li>
<li><strong>F1-Score:</strong> 2 · (Precision · Recall) / (Precision + Recall)</li>
</ul>
<h4>Diagnostic Rates (Auxiliary)</h4>
<ul>
<li><strong>FRR (False Reject Rate):</strong> FR / (TA + FR)</li>
<li><strong>FAR (False Accept Rate):</strong> FA / (FA + TR)</li>
<li><strong>DER (Diagnostic Error Rate):</strong> DE / (CD + DE)</li>
</ul>
<h2>Suggested Research Directions</h2>
<ol>
<li>
<strong>Advanced Mispronunciation Detection Models</strong><br>
Apply state-of-the-art self-supervised models (e.g., Wav2Vec2.0, HuBERT), using variants that are pre-trained/fine-tuned on Arabic speech. These models can then be fine-tuned on MSA datasets to improve phoneme-level accuracy.
</li>
<li>
<strong>Data Augmentation Strategies</strong><br>
Create synthetic mispronunciation examples using pipelines like
<a href="https://arxiv.org/abs/2211.00923" target="_blank">SpeechBlender</a>.
Augmenting limited Arabic speech data helps mitigate data scarcity and improves model robustness.
</li>
<li>
<strong>Analysis of Common Mispronunciation Patterns</strong><br>
Perform statistical analysis on the MSA-Test dataset to identify prevalent errors (e.g., substituting similar phonemes, swapping vowels).
These insights can drive targeted training and tailored feedback rules.
</li>
</ol>
<h2>Registration</h2>
<p>
Teams and individual participants must register to gain access to the test set. Please complete the registration form using the link below:
</p>
<p>
<a href="https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fforms%2Fd%2Fe%2F1FAIpQLSdDyEP7vzJnpvthiEK6WPws2vpuI_yqbzOzEVqHKs0wdDY_Lg%2Fviewform%3Fusp%3Dheader&data=05%7C02%7C%7C828e4c0463a24cca40de08de2e808b16%7C13a8d02d59f3416a8231b3080e639cad%7C0%7C0%7C638999326802565605%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=CUdgz9Az%2FrFF%2FThZgSkvaXYZneSeVNTfv5drPhbKK44%3D&reserved=0" target="_blank">Registration Form</a>
</p>
<p>
Registration opens on December 1, 2025.
</p>
<h2>Future Updates</h2>
<p>
Further details on the open-set leaderboard submission will be posted on the shared task website (December 15, 2025). Stay tuned!
</p>
<h2>Contact and Support</h2>
<p>
For inquiries and support, reach out to the task coordinators.
</p>
<h2>References</h2>
<ul>
<li>El Kheir Y. et al., “SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation,” arXiv:2211.00923, 2022.</li>
<li>Aly S. A. et al., “ASMDD: Arabic Speech Mispronunciation Detection Dataset,” arXiv:2111.01136, 2021.</li>
<li>Moustafa A. & Aly S. A., “Efficient Voice Identification Using Wav2Vec2.0 and HuBERT…,” arXiv:2111.06331, 2021.</li>
<li>El Kheir Y. et al., “Automatic Pronunciation Assessment – A Review,” arXiv:2310.13974, 2021.</li>
</ul>
</div>
</body>
</html>
|