Update index.html
Browse files- index.html +66 -1
index.html
CHANGED
|
@@ -167,7 +167,7 @@
|
|
| 167 |
<p>
|
| 168 |
To phonemize additional datasets or custom text using this standard, we provide the open-source tool at the <a href="https://github.com/Iqra-Eval/MSA_phonetiser">MSA Phonetizer Repository</a>. <strong>Important:</strong> This phonetizer requires the input Arabic text to be <strong>fully diacritized</strong> to ensure accurate phonetic transcription. For further details on the symbols used, please refer to the <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
|
| 169 |
</p>
|
| 170 |
-
|
| 171 |
<h2>Training Dataset: Description</h2>
|
| 172 |
<p>
|
| 173 |
Hosted on Hugging Face:
|
|
@@ -201,8 +201,73 @@
|
|
| 201 |
98 sentences × 18 speakers ≈ 2 h, with deliberate errors and human annotations.
|
| 202 |
<code>load_dataset("IqraEval/open_testset")</code>
|
| 203 |
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
<h2>Submission Details (Draft)</h2>
|
| 207 |
<p>
|
| 208 |
Submit a UTF-8 CSV named <code>teamID_submission.csv</code> with two columns:
|
|
|
|
| 167 |
<p>
|
| 168 |
To phonemize additional datasets or custom text using this standard, we provide the open-source tool at the <a href="https://github.com/Iqra-Eval/MSA_phonetiser">MSA Phonetizer Repository</a>. <strong>Important:</strong> This phonetizer requires the input Arabic text to be <strong>fully diacritized</strong> to ensure accurate phonetic transcription. For further details on the symbols used, please refer to the <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
|
| 169 |
</p>
|
| 170 |
+
<!--
|
| 171 |
<h2>Training Dataset: Description</h2>
|
| 172 |
<p>
|
| 173 |
Hosted on Hugging Face:
|
|
|
|
| 201 |
98 sentences × 18 speakers ≈ 2 h, with deliberate errors and human annotations.
|
| 202 |
<code>load_dataset("IqraEval/open_testset")</code>
|
| 203 |
</p>
|
| 204 |
+
-->
|
| 205 |
+
|
| 206 |
+
<h2>Training Data Overview</h2>
|
| 207 |
+
<p>
|
| 208 |
+
To ensure robustness, our training strategy utilizes a mix of native speech (pseudo-labeled), synthetic mispronunciations, and real recorded errors.
|
| 209 |
+
</p>
|
| 210 |
+
|
| 211 |
+
<h3>1. Native Speech (Pseudo-Labeled)</h3>
|
| 212 |
+
<p>
|
| 213 |
+
<strong>Dataset:</strong> <code>IqraEval/Iqra_train</code><br>
|
| 214 |
+
<strong>Volume:</strong> ~79 hours (Train) + 3.4 hours (Dev)<br>
|
| 215 |
+
This dataset consists of recordings from native MSA speakers. As these speakers are assumed to pronounce the text correctly, this subset is treated as "Golden" data using pseudo-labels.
|
| 216 |
+
</p>
|
| 217 |
+
<p><strong>Columns:</strong></p>
|
| 218 |
+
<ul>
|
| 219 |
+
<li><code>audio</code>: The speech waveform.</li>
|
| 220 |
+
<li><code>sentence</code>: The original raw text.</li>
|
| 221 |
+
<li><code>tashkeel_sentence</code>: Fully diacritized text, generated using an internal SOTA diacritizer (assumed correct).</li>
|
| 222 |
+
<li><code>phoneme_ref</code>: The reference canonical phoneme sequence.</li>
|
| 223 |
+
<li><code>phoneme_mis</code>: The realized phoneme sequence.
|
| 224 |
+
<br><em>Note: Since no errors are present, this is identical to <code>phoneme_ref</code>.</em>
|
| 225 |
+
</li>
|
| 226 |
+
</ul>
|
| 227 |
+
|
| 228 |
+
<h3>2. Synthetic Mispronunciations (TTS)</h3>
|
| 229 |
+
<p>
|
| 230 |
+
<strong>Dataset:</strong> <code>IqraEval/Iqra_TTS</code><br>
|
| 231 |
+
<strong>Volume:</strong> ~80 hours<br>
|
| 232 |
+
To compensate for the lack of errors in the native set, we generated a synthetic dataset using various trained TTS systems. Mispronunciations were deliberately introduced into the input text before audio generation.
|
| 233 |
+
</p>
|
| 234 |
+
<p><strong>Columns:</strong></p>
|
| 235 |
+
<ul>
|
| 236 |
+
<li><code>audio</code>: The synthesized waveform.</li>
|
| 237 |
+
<li><code>sentence_ref</code>: The original correct text.</li>
|
| 238 |
+
<li><code>sentence_mis</code>: The text containing deliberate errors.</li>
|
| 239 |
+
<li><code>phoneme_ref</code>: The canonical phoneme sequence of the correct text.</li>
|
| 240 |
+
<li><code>phoneme_aug</code>: The phoneme sequence corresponding to the synthesized mispronunciation.</li>
|
| 241 |
+
<li><code>tashkeel_sentence</code>: The fully diacritized version of the reference text.</li>
|
| 242 |
+
</ul>
|
| 243 |
+
|
| 244 |
+
<h3>3. Real Mispronunciations (Interspeech 2026)</h3>
|
| 245 |
+
<p>
|
| 246 |
+
<strong>Dataset:</strong> <code>IqraEval/Iqra_Extra_IS26</code><br>
|
| 247 |
+
<strong>Volume:</strong> ~2 hours<br>
|
| 248 |
+
Moving beyond synthetic data, this subset contains real recordings of human mispronunciations collected specifically for Interspeech 2026.
|
| 249 |
+
</p>
|
| 250 |
+
<p><strong>Columns:</strong></p>
|
| 251 |
+
<ul>
|
| 252 |
+
<li><code>audio</code>: The speech waveform.</li>
|
| 253 |
+
<li><code>sentence</code>: The original text.</li>
|
| 254 |
+
<li><code>phoneme_ref</code>: The target canonical phoneme sequence.</li>
|
| 255 |
+
<li><code>phoneme_mis</code>: The actual realized phonemes containing human errors.</li>
|
| 256 |
+
</ul>
|
| 257 |
+
|
| 258 |
+
<hr>
|
| 259 |
|
| 260 |
+
<h2>Evaluation Dataset</h2>
|
| 261 |
+
<p>
|
| 262 |
+
<strong>Dataset:</strong> <code>IqraEval/QuranMB.v2</code><br>
|
| 263 |
+
Currently, only the audio files are released for this evaluation set. It serves as a benchmark for detecting mispronunciations in a distinct domain.
|
| 264 |
+
</p>
|
| 265 |
|
| 266 |
+
<div style="background-color: #f0f4f8; padding: 15px; border-left: 5px solid #0056b3; margin-top: 20px;">
|
| 267 |
+
<strong>Important Note on Data Leakage:</strong><br>
|
| 268 |
+
Strict measures were taken to ensure experimental integrity. We have verified that there is <strong>no overlap in speakers or content</strong> (sentences) between the training datasets (`Iqra_train`, `Iqra_TTS`, `Iqra_Extra_IS26`) and the evaluation datasets.
|
| 269 |
+
</div>
|
| 270 |
+
|
| 271 |
<h2>Submission Details (Draft)</h2>
|
| 272 |
<p>
|
| 273 |
Submit a UTF-8 CSV named <code>teamID_submission.csv</code> with two columns:
|