Spaces:

IqraEval
/

IqraEval_Interspeech_26

Running

File size: 18,909 Bytes

63bf0d3
4b3aebb
 
 
 
8bccd4a
4b3aebb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0dc3800
4b3aebb
 
 
0dc3800
 
4b3aebb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35aa93b
0dc3800
4b3aebb
 
 
0dc3800
4b3aebb
 
 
 
 
 
 
980e5ab
8ad54b2
05c07b8
 
980e5ab
4b3aebb
 
0dc3800
4b3aebb
0dc3800
4b3aebb
 
 
 
 
 
0dc3800
4b3aebb
0dc3800
4b3aebb
 
 
0dc3800
4b3aebb
0dc3800
 
4b3aebb
 
 
 
 
0dc3800
4b3aebb
 
 
 
 
 
 
 
0dc3800
 
4b3aebb
 
0dc3800
4b3aebb
 
05c07b8
4b3aebb
0dc3800
 
4b3aebb
05c07b8
 
 
 
 
 
 
 
f103e2b
4b3aebb
 
 
 
 
 
547dc90
4b3aebb
 
7539b18
4b3aebb
 
 
 
 
 
0dc3800
 
 
4b3aebb
 
 
 
 
 
7539b18
4b3aebb
 
0dc3800
4b3aebb
0dc3800
8fe503d
4b3aebb
f103e2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b3aebb
f103e2b
 
 
 
 
4b3aebb
f103e2b
 
 
 
 
4b3aebb
 
 
 
 
 
 
 
0dc3800
 
 
 
 
4b3aebb
 
 
 
9f15275
4b3aebb
0dc3800
4b3aebb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f15275
 
 
 
 
 
 
 
 
4b3aebb
9f15275
4b3aebb
9f15275
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b3aebb
 
0dc3800
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b3aebb
 
 
 
 
 
c3f83e4
4b3aebb
 
980e5ab
4b3aebb
 
 
 
980e5ab
4b3aebb
 
 
 
0dc3800
4b3aebb
 
 
 
 
 
 
 
 
 
 
63bf0d3
4b3aebb

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width" />
  <title>IqraEval.2 Challenge Interspeech 2026</title>
  <style>
    :root {
      --navy-blue: #001f4d;
      --coral: #ff6f61;
      --light-gray: #f5f7fa;
      --text-dark: #222;
    }
    body {
      font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
      background-color: var(--light-gray);
      color: var(--text-dark);
      margin: 20px;
      line-height: 1.6;
    }
    h1, h2, h3 {
      color: var(--navy-blue);
      font-weight: 700;
      margin-top: 1.2em;
    }
    h1 {
      text-align: center;
      font-size: 2.8rem;
      margin-bottom: 0.3em;
    }
    h2 {
      border-bottom: 3px solid var(--coral);
      padding-bottom: 0.3em;
    }
    h3 {
      color: var(--coral);
      margin-top: 1em;
    }
    p, ul, pre, ol {
      max-width: 900px;
      margin: 0.8em auto;
    }
    ul, ol { padding-left: 1.2em; }
    ul li, ol li { margin: 0.4em 0; }
    code {
      background-color: #eef4f8;
      color: var(--navy-blue);
      padding: 2px 6px;
      border-radius: 4px;
      font-family: Consolas, monospace;
      font-size: 0.9em;
    }
    pre {
      background-color: #eef4f8;
      padding: 1em;
      border-radius: 8px;
      overflow-x: auto;
      font-size: 0.95em;
    }
    a {
      color: var(--coral);
      text-decoration: none;
    }
    a:hover { text-decoration: underline; }
    .card {
      max-width: 960px;
      background: white;
      margin: 0 auto 40px;
      padding: 2em 2.5em;
      box-shadow: 0 4px 14px rgba(0,0,0,0.1);
      border-radius: 12px;
    }
    img {
      display: block;
      margin: 20px auto;
      max-width: 100%;
      height: auto;
      border-radius: 8px;
      box-shadow: 0 4px 8px rgba(0,31,77,0.15);
    }
    .centered p {
      text-align: center;
      font-style: italic;
      color: var(--navy-blue);
      margin-top: 0.4em;
    }
    .highlight {
      color: var(--coral);
      font-weight: 700;
    }
    /* nested lists in paragraphs */
    p > ul { margin-top: 0.3em; }
  </style>
</head>
<body>
  <div class="card">
    <h1>IqraEval.2 Challenge Interspeech 2026</h1>
    <img src="IqraEval.png" alt="Interspeech 2026 Challenge Logo" />

    <h2>Overview</h2>
    <p>
      This <strong>Challenge Interspeech 2026</strong> is a shared task aimed at advancing <strong>automatic assessment of Modern Standard Arabic (MSA) pronunciation</strong> by leveraging computational methods to detect and diagnose pronunciation errors. The focus on MSA provides a standardized and well-defined context for evaluating Arabic pronunciation.
    </p>
    <p>
      Participants will develop systems capable of detecting mispronunciations (e.g., substitution, deletion, or insertion of phonemes).
    </p>

    <h2>Timeline</h2>
    <ul>
      <li><strong>1 December 2025</strong>: Registration opens</li>
      <li><strong>15 December 2025</strong>: Release of training data, evaluation set, Arabic phoneme set, and phonemiser</li>
      <li><strong>15 February 2026</strong>: Registration closes; leaderboard frozen</li>
      <li><strong>17 February 2026</strong>: Results announced</li>
      <li><strong>25 February 2026</strong>: Challenge paper submission deadline</li>
    </ul>

    <h2>Task Description: MSA Mispronunciation Detection System</h2>
    <p>
      Design a model to detect and provide detailed feedback on mispronunciations in MSA speech. Users read vowelized sentences; the model predicts the spoken phoneme sequence and flags deviations. Evaluation is on the <strong>MSA-Test</strong> dataset with human‐annotated errors.
    </p>
    <div class="centered">
      <img src="task.png" alt="System Overview" />
      <p>Figure: Overview of the Mispronunciation Detection Workflow</p>
    </div>

    <h3>1. Read the Sentence</h3>
    <p>
      System shows a <strong>Reference Sentence</strong> plus its <strong>Reference Phoneme Sequence</strong>.
    </p>
    <p><strong>Example:</strong></p>
    <ul>
      <li><strong>Arabic:</strong> يَتَحَدَّثُ النَّاسُ اللُّغَةَ الْعَرَبِيَّةَ</li>
      <li>
        <strong>Phoneme:</strong>
        <code>&lt; y a t a H a d d a v u n n aa s u l l u g h a t a l E a r a b i y y a t a</code>
      </li>
    </ul>

    <h3>2. Save Recording</h3>
    <p>
      User speaks; system captures and stores the audio waveform.
    </p>

    <h3>3. Mispronunciation Detection</h3>
    <p>
      Model predicts the phoneme sequence—deviations from reference indicate mispronunciations.
    </p>
    <p><strong>Example of Mispronunciation:</strong></p>
    <ul>
      <li><strong>Reference:</strong> <code>&lt; y a t a H a d d a v u n n aa s u l l u g h a t a l E a r a b i y y a t a</code></li>
      <li><strong>Predicted:</strong> <code>&lt; y a t a H a d d a <span class="highlight">s</span> u n n aa s u l l u g h a t u l E a r a b i y y a t a</code></li>
    </ul>
    <p>
      Here, <code>v</code>→<code>s</code> (substitution) represents a common pronunciation error.
    </p>

    <!-- <h2>Phoneme Set Description</h2>
    <p>
      The phoneme set used in this work is based on a specialized phonetizer developed for vowelized MSA. It includes a comprehensive range of phonemes designed to capture key phonetic and prosodic features of standard Arabic speech, such as stress, pausing, intonation, emphaticness, and notably, gemination. Gemination—the doubling of consonant sounds—is explicitly represented by duplicating the consonant symbol (e.g., <code>/b/</code> becomes <code>/bb/</code>).
      This phoneme set provides a detailed yet practical representation of the speech sounds relevant for accurate mispronunciation detection in MSA.
      For further details, including the full phoneme inventory, see <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
    </p> -->
    <h2>Phoneme Set Description</h2>
      <p>
        The phoneme set employed in this work derives from a specialized phonetizer developed specifically for vowelized Modern Standard Arabic (MSA). It encompasses a comprehensive inventory of phonemes designed to capture essential phonetic and prosodic features, including stress, pausing, intonation, emphaticness, and gemination. Notably, gemination—the lengthening of consonant sounds—is explicitly represented by duplicating the consonant symbol (e.g., <code>/b/</code> becomes <code>/bb/</code>). This approach ensures a detailed yet practical representation of speech sounds, which is critical for accurate mispronunciation detection.
      </p>
      <p>
        To phonemize additional datasets or custom text using this standard, we provide the open-source tool at the <a href="https://github.com/Iqra-Eval/MSA_phonetiser">MSA Phonetizer Repository</a>. <strong>Important:</strong> This phonetizer requires the input Arabic text to be <strong>fully diacritized</strong> to ensure accurate phonetic transcription. For further details on the symbols used, please refer to the <a href="https://huggingface.co/spaces/IqraEval/ArabicPhoneme">Phoneme Inventory</a>.
      </p>
<!-- 
    <h2>Training Dataset: Description</h2>
    <p>
      Hosted on Hugging Face:
    </p>
    <ul>
      <li>
        <strong>Training:</strong> 79 hours of MSA speech augmented with additional Arabic data <strong>(Will be released on 15 December 2025)</strong>
      </li>
      <li>
        <strong>Development:</strong> 3.4 hours as dev set <strong>(Will be released on 15 December 2025)</strong>
      </li>
    </ul>
    <p>
      <strong>Columns:</strong>
      <ul>
        <li><code>audio</code>: waveform</li>
        <li><code>sentence</code>: original text (sentence)</li>
        <li><code>index</code>: sentence ID</li>
        <li><code>tashkeel_sentence</code>: fully diacritized text (sentence)</li>
        <li><code>phoneme</code>: phoneme sequence (using phonetizer)</li>
      </ul>
    </p>

    <h2>Training Dataset: TTS Data (Optional)</h2>
    <p>
      Auxiliary high-quality TTS corpus for augmentation: <strong>(Will be released on 15 December 2025)</strong>
    </p>

    <h2>Test Dataset: MSA-Test</h2>
    <p>
      98 sentences × 18 speakers ≈ 2 h, with deliberate errors and human annotations.
      <code>load_dataset("IqraEval/open_testset")</code>
    </p>
 -->

    <h2>Training Data Overview</h2>
    <p>
  To ensure robustness, our training strategy utilizes a mix of native speech (pseudo-labeled), synthetic mispronunciations, and real recorded errors.
</p>

<h3>1. Native Speech (Pseudo-Labeled)</h3>
<p>
  <strong>Dataset:</strong> <code>IqraEval/Iqra_train</code><br>
  <strong>Volume:</strong> ~79 hours (Train) + 3.4 hours (Dev)<br>
  This dataset consists of recordings from native MSA speakers. As these speakers are assumed to pronounce the text correctly, this subset is treated as "Golden" data using pseudo-labels.
</p>
<p><strong>Columns:</strong></p>
<ul>
  <li><code>audio</code>: The speech waveform.</li>
  <li><code>sentence</code>: The original raw text.</li>
  <li><code>tashkeel_sentence</code>: Fully diacritized text, generated using an internal SOTA diacritizer (assumed correct).</li>
  <li><code>phoneme_ref</code>: The reference canonical phoneme sequence.</li>
  <li><code>phoneme_mis</code>: The realized phoneme sequence.
    <br><em>Note: Since no errors are present, this is identical to <code>phoneme_ref</code>.</em>
  </li>
</ul>

<h3>2. Synthetic Mispronunciations (TTS)</h3>
<p>
  <strong>Dataset:</strong> <code>IqraEval/Iqra_TTS</code><br>
  <strong>Volume:</strong> ~80 hours<br>
  To compensate for the lack of errors in the native set, we generated a synthetic dataset using various trained TTS systems. Mispronunciations were deliberately introduced into the input text before audio generation.
</p>
<p><strong>Columns:</strong></p>
<ul>
  <li><code>audio</code>: The synthesized waveform.</li>
  <li><code>sentence_ref</code>: The original correct text.</li>
  <li><code>sentence_mis</code>: The text containing deliberate errors.</li>
  <li><code>phoneme_ref</code>: The canonical phoneme sequence of the correct text.</li>
  <li><code>phoneme_aug</code>: The phoneme sequence corresponding to the synthesized mispronunciation.</li>
  <li><code>tashkeel_sentence</code>: The fully diacritized version of the reference text.</li>
</ul>

<h3>3. Real Mispronunciations (Interspeech 2026)</h3>
<p>
  <strong>Dataset:</strong> <code>IqraEval/Iqra_Extra_IS26</code><br>
  <strong>Volume:</strong> ~2 hours<br>
  Moving beyond synthetic data, this subset contains real recordings of human mispronunciations collected specifically for Interspeech 2026.
</p>
<p><strong>Columns:</strong></p>
<ul>
  <li><code>audio</code>: The speech waveform.</li>
  <li><code>sentence</code>: The original text.</li>
  <li><code>phoneme_ref</code>: The target canonical phoneme sequence.</li>
  <li><code>phoneme_mis</code>: The actual realized phonemes containing human errors.</li>
</ul>

<hr>

<h2>Evaluation Dataset</h2>
<p>
  <strong>Dataset:</strong> <code>IqraEval/QuranMB.v2</code><br>
  Currently, only the audio files are released for this evaluation set. It serves as a benchmark for detecting mispronunciations in a distinct domain.
</p>

<div style="background-color: #f0f4f8; padding: 15px; border-left: 5px solid #0056b3; margin-top: 20px;">
  <strong>Important Note on Data Leakage:</strong><br>
  Strict measures were taken to ensure experimental integrity. We have verified that there is <strong>no overlap in speakers or content</strong> (sentences) between the training datasets (`Iqra_train`, `Iqra_TTS`, `Iqra_Extra_IS26`) and the evaluation datasets.
                                                                    </div>
    
    <h2>Submission Details (Draft)</h2>
    <p>
      Submit a UTF-8 CSV named <code>teamID_submission.csv</code> with two columns:
    </p>
    <ul>
      <li><strong>ID:</strong> audio filename (no extension)</li>
      <li><strong>Labels:</strong> predicted phoneme sequence (space-separated)</li>
    </ul>
    <pre>ID,Labels
0000_0001, y a t a H a d d a ...
0000_0002, m a a n a n s a ...
...
      </pre>
    <p>
      <strong>Note:</strong> no extra spaces, single CSV, no archives.
    </p>

    <!-- <h2>Evaluation Criteria</h2>
    <p>
      The Leaderboard is based on phoneme-level <strong>F1-score</strong>.
      We use a hierarchical evaluation (detection + diagnostic) per <a href="https://arxiv.org/pdf/2310.13974" target="_blank">MDD Overview</a>.
    </p>
    <ul>
      <li><em><strong>What is said</strong></em>: annotated phoneme sequence</li>
      <li><em><strong>What is predicted</strong></em>: model output</li>
      <li><em><strong>What should have been said</strong></em>: reference sequence</li>
    </ul>
    <p>From these we compute:</p>
    <ul>
      <li><strong>TA:</strong> correct phonemes accepted</li>
      <li><strong>TR:</strong> mispronunciations correctly detected</li>
      <li><strong>FR:</strong> correct phonemes flagged as errors</li>
      <li><strong>FA:</strong> mispronunciations missed</li>
    </ul>
    <p>Rates:</p>
    <ul>
      <li><strong>FRR:</strong> FR/(TA+FR)</li>
      <li><strong>FAR:</strong> FA/(FA+TR)</li>
      <li><strong>DER:</strong> DE/(CD+DE)</li>
    </ul>
    <p>
      Plus standard Precision, Recall, F1 for detection:
      <ul>
        <li>Precision = TR/(TR+FR)</li>
        <li>Recall = TR/(TR+FA)</li>
        <li>F1 = 2·P·R/(P+R)</li>
      </ul>
    </p> -->

    <h2>Evaluation Criteria</h2>

<div style="background-color: #f0f8ff; border-left: 5px solid #007bff; padding: 15px; margin-bottom: 20px;">
    <h3 style="margin-top: 0; color: #007bff;">🏆 Primary Metric</h3>
    <p style="margin-bottom: 0;">
        The Leaderboard is ranked primarily by the <strong>Phoneme-level F1-score</strong>. 
        While other metrics (FRR, FAR, DER) are computed for analysis, <strong>F1</strong> determines the final standing.
    </p>
</div>

<p>
    We use a hierarchical evaluation strategy (detection + diagnostic) based on the 
    <a href="https://arxiv.org/pdf/2310.13974" target="_blank">MDD Overview</a> framework.
</p>

<h3>1. Input Definitions</h3>
<ul>
    <li><strong>What is said:</strong> The annotated phoneme sequence.</li>
    <li><strong>What is predicted:</strong> The output from your model.</li>
    <li><strong>What should have been said:</strong> The reference (target) sequence.</li>
</ul>

<h3>2. Confusion Matrix Components</h3>
<p>From the inputs above, we compute the following counts:</p>
<table style="width: 100%; border-collapse: collapse; margin-bottom: 20px;">
    <tr style="background-color: #f9f9f9; border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;"><strong>TA (True Accept)</strong></td>
        <td style="padding: 8px;">Correct phonemes properly accepted.</td>
    </tr>
    <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;"><strong>TR (True Reject)</strong></td>
        <td style="padding: 8px;">Mispronunciations correctly detected.</td>
    </tr>
    <tr style="background-color: #f9f9f9; border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;"><strong>FR (False Reject)</strong></td>
        <td style="padding: 8px;">Correct phonemes incorrectly flagged as errors.</td>
    </tr>
    <tr>
        <td style="padding: 8px;"><strong>FA (False Accept)</strong></td>
        <td style="padding: 8px;">Mispronunciations missed (labeled as correct).</td>
    </tr>
</table>

<h3>3. Calculated Metrics</h3>

<h4>Detection Metrics (Leaderboard Ranking)</h4>
<ul>
    <li><strong>Precision:</strong> TR / (TR + FR)</li>
    <li><strong>Recall:</strong> TR / (TR + FA)</li>
    <li><strong>F1-Score:</strong> 2 · (Precision · Recall) / (Precision + Recall)</li>
</ul>

<h4>Diagnostic Rates (Auxiliary)</h4>
<ul>
    <li><strong>FRR (False Reject Rate):</strong> FR / (TA + FR)</li>
    <li><strong>FAR (False Accept Rate):</strong> FA / (FA + TR)</li>
    <li><strong>DER (Diagnostic Error Rate):</strong> DE / (CD + DE)</li>
</ul>
    
    <h2>Suggested Research Directions</h2>
    <ol>
      <li>
        <strong>Advanced Mispronunciation Detection Models</strong><br>
        Apply state-of-the-art self-supervised models (e.g., Wav2Vec2.0, HuBERT), using variants that are pre-trained/fine-tuned on Arabic speech. These models can then be fine-tuned on MSA datasets to improve phoneme-level accuracy.
      </li>
      <li>
        <strong>Data Augmentation Strategies</strong><br>
        Create synthetic mispronunciation examples using pipelines like
        <a href="https://arxiv.org/abs/2211.00923" target="_blank">SpeechBlender</a>.
        Augmenting limited Arabic speech data helps mitigate data scarcity and improves model robustness.
      </li>
      <li>
        <strong>Analysis of Common Mispronunciation Patterns</strong><br>
        Perform statistical analysis on the MSA-Test dataset to identify prevalent errors (e.g., substituting similar phonemes, swapping vowels).
        These insights can drive targeted training and tailored feedback rules.
      </li>
    </ol>

    <h2>Registration</h2>
    <p>
      Teams and individual participants must register to gain access to the test set. Please complete the registration form using the link below:
    </p>
    <p>
      <a href="https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fforms%2Fd%2Fe%2F1FAIpQLSdDyEP7vzJnpvthiEK6WPws2vpuI_yqbzOzEVqHKs0wdDY_Lg%2Fviewform%3Fusp%3Dheader&data=05%7C02%7C%7C828e4c0463a24cca40de08de2e808b16%7C13a8d02d59f3416a8231b3080e639cad%7C0%7C0%7C638999326802565605%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=CUdgz9Az%2FrFF%2FThZgSkvaXYZneSeVNTfv5drPhbKK44%3D&reserved=0" target="_blank">Registration Form</a>
    </p>
    <p>
      Registration opens on December 1, 2025.
    </p>

    <h2>Future Updates</h2>
    <p>
      Further details on the open-set leaderboard submission will be posted on the shared task website (December 15, 2025). Stay tuned!
    </p>

    <h2>Contact and Support</h2>
    <p>
      For inquiries and support, reach out to the task coordinators.
    </p>

    <h2>References</h2>
    <ul>
      <li>El Kheir Y. et al., “SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation,” arXiv:2211.00923, 2022.</li>
      <li>Aly S. A. et al., “ASMDD: Arabic Speech Mispronunciation Detection Dataset,” arXiv:2111.01136, 2021.</li>
      <li>Moustafa A. & Aly S. A., “Efficient Voice Identification Using Wav2Vec2.0 and HuBERT…,” arXiv:2111.06331, 2021.</li>
      <li>El Kheir Y. et al., “Automatic Pronunciation Assessment – A Review,” arXiv:2310.13974, 2021.</li>
    </ul>
  </div>
</body>
</html>