Project Vaani — Odia Subset
Vaani is a large-scale Indian speech project by ARTPARK and IISc Bangalore, funded by Google, aiming to collect
over 150,000 hours of speech across all Indian districts. The Odia subset contains spontaneous speech recordings
from speakers across Odisha, transcribed in native Odia script. English loanwords in utterances are marked in
{} brackets within transcripts.
JAGA maintains a cleaned extraction of the Odia subset with 21 non-Odia-script utterances removed (Bengali/Assamese/Devanagari script contamination identified via script-level audit). The cleaned version is hosted on HuggingFace with gated access to comply with the original CC-BY-NC 4.0 license.
| Task | Automatic Speech Recognition (ASR) |
|---|---|
| Type | Spontaneous speech with native-script transcriptions |
| Splits | Train, Validation, Test |
| Total size | ~55.6 GB |
| License | CC-BY-NC 4.0 (Non-Commercial) |
| Dialect coverage | District-level tagging (varies by collection region) |