Spaces:

IqraEval
/

IqraEval_Interspeech_26

Running

App Files Files Community

01Yassine commited on 4 days ago

Commit

f103e2b

verified ·

1 Parent(s): 05c07b8

Update index.html

Browse files

Files changed (1) hide show

index.html +66 -1

index.html CHANGED Viewed

@@ -167,7 +167,7 @@
       <p>
         To phonemize additional datasets or custom text using this standard, we provide the open-source tool at the <a href="https://github.com/Iqra-Eval/MSA_phonetiser">MSA Phonetizer Repository</a>. <strong>Important:</strong> This phonetizer requires the input Arabic text to be <strong>fully diacritized</strong> to ensure accurate phonetic transcription. For further details on the symbols used, please refer to the <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
       </p>
     <h2>Training Dataset: Description</h2>
     <p>
       Hosted on Hugging Face:
@@ -201,8 +201,73 @@
       98 sentences × 18 speakers ≈ 2 h, with deliberate errors and human annotations.
       <code>load_dataset("IqraEval/open_testset")</code>
     </p>
     <h2>Submission Details (Draft)</h2>
     <p>
       Submit a UTF-8 CSV named <code>teamID_submission.csv</code> with two columns:

       <p>
         To phonemize additional datasets or custom text using this standard, we provide the open-source tool at the <a href="https://github.com/Iqra-Eval/MSA_phonetiser">MSA Phonetizer Repository</a>. <strong>Important:</strong> This phonetizer requires the input Arabic text to be <strong>fully diacritized</strong> to ensure accurate phonetic transcription. For further details on the symbols used, please refer to the <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
       </p>
+<!--
     <h2>Training Dataset: Description</h2>
     <p>
       Hosted on Hugging Face:
       98 sentences × 18 speakers ≈ 2 h, with deliberate errors and human annotations.
       <code>load_dataset("IqraEval/open_testset")</code>
     </p>
+ -->
+    <h2>Training Data Overview</h2>
+    <p>
+  To ensure robustness, our training strategy utilizes a mix of native speech (pseudo-labeled), synthetic mispronunciations, and real recorded errors.
+</p>
+<h3>1. Native Speech (Pseudo-Labeled)</h3>
+<p>
+  <strong>Dataset:</strong> <code>IqraEval/Iqra_train</code><br>
+  <strong>Volume:</strong> ~79 hours (Train) + 3.4 hours (Dev)<br>
+  This dataset consists of recordings from native MSA speakers. As these speakers are assumed to pronounce the text correctly, this subset is treated as "Golden" data using pseudo-labels.
+</p>
+<p><strong>Columns:</strong></p>
+<ul>
+  <li><code>audio</code>: The speech waveform.</li>
+  <li><code>sentence</code>: The original raw text.</li>
+  <li><code>tashkeel_sentence</code>: Fully diacritized text, generated using an internal SOTA diacritizer (assumed correct).</li>
+  <li><code>phoneme_ref</code>: The reference canonical phoneme sequence.</li>
+  <li><code>phoneme_mis</code>: The realized phoneme sequence.
+    <br><em>Note: Since no errors are present, this is identical to <code>phoneme_ref</code>.</em>
+  </li>
+</ul>
+<h3>2. Synthetic Mispronunciations (TTS)</h3>
+<p>
+  <strong>Dataset:</strong> <code>IqraEval/Iqra_TTS</code><br>
+  <strong>Volume:</strong> ~80 hours<br>
+  To compensate for the lack of errors in the native set, we generated a synthetic dataset using various trained TTS systems. Mispronunciations were deliberately introduced into the input text before audio generation.
+</p>
+<p><strong>Columns:</strong></p>
+<ul>
+  <li><code>audio</code>: The synthesized waveform.</li>
+  <li><code>sentence_ref</code>: The original correct text.</li>
+  <li><code>sentence_mis</code>: The text containing deliberate errors.</li>
+  <li><code>phoneme_ref</code>: The canonical phoneme sequence of the correct text.</li>
+  <li><code>phoneme_aug</code>: The phoneme sequence corresponding to the synthesized mispronunciation.</li>
+  <li><code>tashkeel_sentence</code>: The fully diacritized version of the reference text.</li>
+</ul>
+<h3>3. Real Mispronunciations (Interspeech 2026)</h3>
+<p>
+  <strong>Dataset:</strong> <code>IqraEval/Iqra_Extra_IS26</code><br>
+  <strong>Volume:</strong> ~2 hours<br>
+  Moving beyond synthetic data, this subset contains real recordings of human mispronunciations collected specifically for Interspeech 2026.
+</p>
+<p><strong>Columns:</strong></p>
+<ul>
+  <li><code>audio</code>: The speech waveform.</li>
+  <li><code>sentence</code>: The original text.</li>
+  <li><code>phoneme_ref</code>: The target canonical phoneme sequence.</li>
+  <li><code>phoneme_mis</code>: The actual realized phonemes containing human errors.</li>
+</ul>
+<hr>
+<h2>Evaluation Dataset</h2>
+<p>
+  <strong>Dataset:</strong> <code>IqraEval/QuranMB.v2</code><br>
+  Currently, only the audio files are released for this evaluation set. It serves as a benchmark for detecting mispronunciations in a distinct domain.
+</p>
+<div style="background-color: #f0f4f8; padding: 15px; border-left: 5px solid #0056b3; margin-top: 20px;">
+  <strong>Important Note on Data Leakage:</strong><br>
+  Strict measures were taken to ensure experimental integrity. We have verified that there is <strong>no overlap in speakers or content</strong> (sentences) between the training datasets (`Iqra_train`, `Iqra_TTS`, `Iqra_Extra_IS26`) and the evaluation datasets.
+                                                                    </div>
     <h2>Submission Details (Draft)</h2>
     <p>
       Submit a UTF-8 CSV named <code>teamID_submission.csv</code> with two columns: