01Yassine commited on
Commit
f103e2b
·
verified ·
1 Parent(s): 05c07b8

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +66 -1
index.html CHANGED
@@ -167,7 +167,7 @@
167
  <p>
168
  To phonemize additional datasets or custom text using this standard, we provide the open-source tool at the <a href="https://github.com/Iqra-Eval/MSA_phonetiser">MSA Phonetizer Repository</a>. <strong>Important:</strong> This phonetizer requires the input Arabic text to be <strong>fully diacritized</strong> to ensure accurate phonetic transcription. For further details on the symbols used, please refer to the <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
169
  </p>
170
-
171
  <h2>Training Dataset: Description</h2>
172
  <p>
173
  Hosted on Hugging Face:
@@ -201,8 +201,73 @@
201
  98 sentences × 18 speakers ≈ 2 h, with deliberate errors and human annotations.
202
  <code>load_dataset("IqraEval/open_testset")</code>
203
  </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
 
 
 
 
 
 
205
 
 
 
 
 
 
206
  <h2>Submission Details (Draft)</h2>
207
  <p>
208
  Submit a UTF-8 CSV named <code>teamID_submission.csv</code> with two columns:
 
167
  <p>
168
  To phonemize additional datasets or custom text using this standard, we provide the open-source tool at the <a href="https://github.com/Iqra-Eval/MSA_phonetiser">MSA Phonetizer Repository</a>. <strong>Important:</strong> This phonetizer requires the input Arabic text to be <strong>fully diacritized</strong> to ensure accurate phonetic transcription. For further details on the symbols used, please refer to the <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
169
  </p>
170
+ <!--
171
  <h2>Training Dataset: Description</h2>
172
  <p>
173
  Hosted on Hugging Face:
 
201
  98 sentences × 18 speakers ≈ 2 h, with deliberate errors and human annotations.
202
  <code>load_dataset("IqraEval/open_testset")</code>
203
  </p>
204
+ -->
205
+
206
+ <h2>Training Data Overview</h2>
207
+ <p>
208
+ To ensure robustness, our training strategy utilizes a mix of native speech (pseudo-labeled), synthetic mispronunciations, and real recorded errors.
209
+ </p>
210
+
211
+ <h3>1. Native Speech (Pseudo-Labeled)</h3>
212
+ <p>
213
+ <strong>Dataset:</strong> <code>IqraEval/Iqra_train</code><br>
214
+ <strong>Volume:</strong> ~79 hours (Train) + 3.4 hours (Dev)<br>
215
+ This dataset consists of recordings from native MSA speakers. As these speakers are assumed to pronounce the text correctly, this subset is treated as "Golden" data using pseudo-labels.
216
+ </p>
217
+ <p><strong>Columns:</strong></p>
218
+ <ul>
219
+ <li><code>audio</code>: The speech waveform.</li>
220
+ <li><code>sentence</code>: The original raw text.</li>
221
+ <li><code>tashkeel_sentence</code>: Fully diacritized text, generated using an internal SOTA diacritizer (assumed correct).</li>
222
+ <li><code>phoneme_ref</code>: The reference canonical phoneme sequence.</li>
223
+ <li><code>phoneme_mis</code>: The realized phoneme sequence.
224
+ <br><em>Note: Since no errors are present, this is identical to <code>phoneme_ref</code>.</em>
225
+ </li>
226
+ </ul>
227
+
228
+ <h3>2. Synthetic Mispronunciations (TTS)</h3>
229
+ <p>
230
+ <strong>Dataset:</strong> <code>IqraEval/Iqra_TTS</code><br>
231
+ <strong>Volume:</strong> ~80 hours<br>
232
+ To compensate for the lack of errors in the native set, we generated a synthetic dataset using various trained TTS systems. Mispronunciations were deliberately introduced into the input text before audio generation.
233
+ </p>
234
+ <p><strong>Columns:</strong></p>
235
+ <ul>
236
+ <li><code>audio</code>: The synthesized waveform.</li>
237
+ <li><code>sentence_ref</code>: The original correct text.</li>
238
+ <li><code>sentence_mis</code>: The text containing deliberate errors.</li>
239
+ <li><code>phoneme_ref</code>: The canonical phoneme sequence of the correct text.</li>
240
+ <li><code>phoneme_aug</code>: The phoneme sequence corresponding to the synthesized mispronunciation.</li>
241
+ <li><code>tashkeel_sentence</code>: The fully diacritized version of the reference text.</li>
242
+ </ul>
243
+
244
+ <h3>3. Real Mispronunciations (Interspeech 2026)</h3>
245
+ <p>
246
+ <strong>Dataset:</strong> <code>IqraEval/Iqra_Extra_IS26</code><br>
247
+ <strong>Volume:</strong> ~2 hours<br>
248
+ Moving beyond synthetic data, this subset contains real recordings of human mispronunciations collected specifically for Interspeech 2026.
249
+ </p>
250
+ <p><strong>Columns:</strong></p>
251
+ <ul>
252
+ <li><code>audio</code>: The speech waveform.</li>
253
+ <li><code>sentence</code>: The original text.</li>
254
+ <li><code>phoneme_ref</code>: The target canonical phoneme sequence.</li>
255
+ <li><code>phoneme_mis</code>: The actual realized phonemes containing human errors.</li>
256
+ </ul>
257
+
258
+ <hr>
259
 
260
+ <h2>Evaluation Dataset</h2>
261
+ <p>
262
+ <strong>Dataset:</strong> <code>IqraEval/QuranMB.v2</code><br>
263
+ Currently, only the audio files are released for this evaluation set. It serves as a benchmark for detecting mispronunciations in a distinct domain.
264
+ </p>
265
 
266
+ <div style="background-color: #f0f4f8; padding: 15px; border-left: 5px solid #0056b3; margin-top: 20px;">
267
+ <strong>Important Note on Data Leakage:</strong><br>
268
+ Strict measures were taken to ensure experimental integrity. We have verified that there is <strong>no overlap in speakers or content</strong> (sentences) between the training datasets (`Iqra_train`, `Iqra_TTS`, `Iqra_Extra_IS26`) and the evaluation datasets.
269
+ </div>
270
+
271
  <h2>Submission Details (Draft)</h2>
272
  <p>
273
  Submit a UTF-8 CSV named <code>teamID_submission.csv</code> with two columns: