An Open Record of
Odia Speech Data

JAGA catalogues every publicly available speech dataset for the Odia language and coordinates community-driven collection of dialect recordings across Odisha's 30 districts.

Open source under CC-BY 4.0

We need Odia speakers from every district.

Odia is spoken by over 35 million people, yet no publicly available corpus captures its full dialectal range. The Sambalpuri spoken in Bargarh is not the Ganjami of Berhampur, and neither is the Baleswari of the northern coast. These varieties are under-documented and at risk of being flattened into a single "Standard Odia" by speech systems trained on limited data.

JAGA is building the first district-level dialect registry for Odia. If you are a native speaker, a linguist, a student, or simply someone who cares about preserving how your region speaks — we want to hear from you.

How it works: fill the interest form, tell us your district and background. We will follow up with recording guidelines, prompts, and metadata templates tailored to your region.

~2 minutes to fill

Open Interest Form

We match contributors to districts. Any Odia dialect, any background.

What JAGA Does

JAGA (ଜଗା) is what people in Odisha call Lord Jagannath — the one who sees everything. The name, and the circular eyes in the logo, carry that idea: nothing about the language should go unobserved. JAGA is a research initiative focused exclusively on the Odia language. It operates as two things:

  1. A dataset registry — a single, maintained catalogue of every known Odia speech corpus, with verified metadata: size, license, domain, dialect coverage, recording conditions, and known limitations. Researchers working on Odia ASR, TTS, or NLU should not have to rediscover what already exists.
  2. A dialect collection effort — an open initiative to record speech samples across Odisha's dialect zones, starting from districts that have zero representation in existing corpora. The goal is spontaneous, natural speech — not read prompts — annotated with district, speaker metadata, and transcribed in native Odia script.

The project is Odia-first. We intend to extend to other Indic languages in the future, but only after the Odia pipeline — collection, cleaning, analysis, and open release — is established and proven.

Odisha's Dialect Landscape

Odisha's 30 districts span at least six recognized dialect zones. Standard Odia (Mughalbandi) dominates official and literary use, but the western Sambalpuri, southern Ganjami, northern Baleswari, and southwestern Desia varieties each carry distinct phonetics, vocabulary, and verb morphology. Hover any district for details.

Loading map…

Sources: Census of India (2011), Linguistic Survey of India, CIIL Mysore, Grierson's LSI Vol. V. Boundaries are approximate and may overlap.

Odia Speech Datasets

These are the publicly available Odia speech corpora we have verified and used. Each entry includes provenance, scope, and known limitations. Additional datasets will be added only after independent review.

Project Vaani — Odia Subset

Speech Dialect-tagged Public (Gated)

Vaani is a large-scale Indian speech project by ARTPARK and IISc Bangalore, funded by Google, aiming to collect over 150,000 hours of speech across all Indian districts. The Odia subset contains spontaneous speech recordings from speakers across Odisha, transcribed in native Odia script. English loanwords in utterances are marked in {} brackets within transcripts.

JAGA maintains a cleaned extraction of the Odia subset with 21 non-Odia-script utterances removed (Bengali/Assamese/Devanagari script contamination identified via script-level audit). The cleaned version is hosted on HuggingFace with gated access to comply with the original CC-BY-NC 4.0 license.

TaskAutomatic Speech Recognition (ASR)
TypeSpontaneous speech with native-script transcriptions
SplitsTrain, Validation, Test
Total size~55.6 GB
LicenseCC-BY-NC 4.0 (Non-Commercial)
Dialect coverageDistrict-level tagging (varies by collection region)

MUCS 2021 Odia — OpenSLR 103

Speech Public

This read-speech corpus was created for the MUCS 2021 (Multilingual and Code-Switching ASR) challenge and is also distributed through the Vakyansh project. Speech was collected on-field from four districts representing distinct dialect zones: Sambalpur (North-Western Odia), Mayurbhanj (North-Eastern), Puri (Central / Standard), and Koraput (Southern). Collection focused on three applied domains: agriculture, healthcare, and finance — recorded from farmers, medical staff, and bank employees respectively.

TaskAutomatic Speech Recognition (ASR)
TypeRead speech (prompted sentences)
Train94.54 hrs / 820 unique sentences
Test5.49 hrs / 65 unique sentences (non-overlapping)
Sampling8 kHz, 16-bit encoding
Vocabulary1,644 unique words
DomainsAgriculture, Healthcare, Finance
DistrictsSambalpur, Mayurbhanj, Puri, Koraput
LicenseMicrosoft Research Open Data License

What is missing

Between these two corpora, only 4 out of 30 districts have any explicit dialect representation in the read-speech data (OpenSLR 103), and while Vaani covers more ground, its transcriptions are limited to a subset of the full audio. No corpus provides spontaneous conversational speech with systematic district-level coverage. Scientific, legal, and literary domains remain entirely absent. These are the gaps JAGA intends to address — starting with dialect-tagged spontaneous recordings from the remaining 26 districts.

If you know of additional verified Odia speech datasets not listed here, open an issue on GitHub. We will review and add them after independent verification.

Roadmap

JAGA is in early stage. The following are planned and will be published as they are completed.

Corpus-level Analysis

Per-dataset statistics: vocabulary size, OOV rates, domain distribution, speaker demographics, and transcript quality metrics.

Coming soon

District Coverage Tracking

An interactive overlay on the dialect map showing which districts have recordings, who is contributing, and what remains uncollected.

Coming soon

Collection Protocol

Standardised recording guidelines: device requirements, prompt design for spontaneous speech, metadata schema, and consent forms.

Coming soon