diff --git a/datasets/ivrit-ai-audio-v2.yaml b/datasets/ivrit-ai-audio-v2.yaml new file mode 100644 index 000000000..8624d1746 --- /dev/null +++ b/datasets/ivrit-ai-audio-v2.yaml @@ -0,0 +1,36 @@ +Name: ivrit-ai Hebrew Audio v2 Dataset +Description: > + The ivrit-ai audio-v2 dataset is a curated collection of Hebrew speech recordings and metadata designed to advance speech recognition and AI research using high-quality, crowd-sourced and/or institutional audio. Contact ivrit.ai for information about its composition and source domains.Documentation: https://huggingface.co/datasets/ivrit-ai/audio-v2 +Contact: info@ivrit.ai +ManagedBy: ivrit.ai +UpdateFrequency: Updated several times per year +Tags: + - natural language processing + - automatic speech recognition + - speech processing +License: > + ivrit.ai license (modified CC-BY, permitting use for training AI models only and prohibiting deepfake generation; see https://www.ivrit.ai/en/license-faqs/ for full terms) +Citation: > + If you use this dataset, cite: + Marmor, Yanir and Lifshitz, Yair and Snapir, Yoad and Misgav, Kinneret (2025). Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. Proc. Interspeech 2025, pp. 723–727. + "[ivrit-ai Crowd-Transcribe Hebrew Speech Dataset] was accessed on [DATE] at registry.opendata.aws/ivrit-ai-crowdtranscribe" +Resources: + - Description: "Hebrew speech audio and aligned metadata in plain text and other formats. Data is available via Hugging Face Datasets. Contact ivrit.ai for bulk/alternative access methods." + ARN: "" + Region: "" + Type: "External Resource" + Explore: + - "https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5" +DataAtWork: + Tutorials: + - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing" + URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf + AuthorName: Marmor, Yanir et al. +Tools & Applications: [] +Publications: + - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing" + URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf + AuthorName: Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret +ADXCategories: + - Language + - Speech diff --git a/datasets/ivrit-ai-crowdtranscribe.yaml b/datasets/ivrit-ai-crowdtranscribe.yaml new file mode 100644 index 000000000..267942e66 --- /dev/null +++ b/datasets/ivrit-ai-crowdtranscribe.yaml @@ -0,0 +1,38 @@ +Name: ivrit-ai Crowd-Transcribe Hebrew Speech Dataset +Description: > + The ivrit-ai Crowd-Transcribe v5 dataset is a comprehensive Hebrew speech dataset contributed and vetted by a crowd of volunteers, designed to support the development of open-source Hebrew ASR systems and other language technologies. It is available for the purposes of training AI models, subject to the ivrit.ai license, which prohibits use for non-AI-model training and deepfake creation. The dataset enables robust Hebrew speech-to-text and downstream research. +Documentation: https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5 +Contact: info@ivrit.ai +ManagedBy: ivrit.ai +UpdateFrequency: Updated several times per year +Tags: + - natural language processing + - automatic speech recognition + - speech processing +License: > + ivrit.ai license (modified CC-BY, permitting use for training AI models only and prohibiting deepfake generation; see https://www.ivrit.ai/en/license-faqs/ for full terms) +Citation: > + If you use this dataset, cite: + Marmor, Yanir and Lifshitz, Yair and Snapir, Yoad and Misgav, Kinneret (2025). Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. Proc. Interspeech 2025, pp. 723–727. + "[ivrit-ai Crowd-Transcribe Hebrew Speech Dataset] was accessed on [DATE] at registry.opendata.aws/ivrit-ai-crowdtranscribe" +Resources: + - Description: "Hebrew crowd-sourced transcribed speech audio and aligned metadata in plain text and other formats. Data is available via Hugging Face Datasets. Contact ivrit.ai for bulk/alternative access methods." + ARN: "" + Region: "" + Type: "External Resource" + Explore: + - "https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5" +DataAtWork: + Tutorials: + - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing" + URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf + AuthorName: Marmor, Yanir et al. +Tools & Applications: [] +Publications: + - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing" + URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf + AuthorName: Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret + +ADXCategories: + - Language + - Speech diff --git a/datasets/ivrit-ai-knesset-plenums.yaml b/datasets/ivrit-ai-knesset-plenums.yaml new file mode 100644 index 000000000..6de7be746 --- /dev/null +++ b/datasets/ivrit-ai-knesset-plenums.yaml @@ -0,0 +1,37 @@ +Name: ivrit-ai Knesset Plenum Transcriptions Dataset +Description: > + The ivrit-ai Knesset Plenum Transcriptions dataset comprises aligned Hebrew speech and transcriptions from Israeli Knesset parliamentary plenary sessions. The dataset supports research on parliamentary speech, political discourse, and automatic speech recognition. +Documentation: https://huggingface.co/datasets/ivrit-ai/knesset-plenums +Contact: info@ivrit.ai +ManagedBy: ivrit.ai +UpdateFrequency: Updated several times per year +Tags: + - natural language processing + - automatic speech recognition + - speech processing +License: > + ivrit.ai license (modified CC-BY, permitting use for training AI models only and prohibiting deepfake generation; see https://www.ivrit.ai/en/license-faqs/ for full terms) +Citation: > + If you use this dataset, cite: + Marmor, Yanir and Lifshitz, Yair and Snapir, Yoad and Misgav, Kinneret (2025). Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. Proc. Interspeech 2025, pp. 723–727. + "[ivrit-ai Crowd-Transcribe Hebrew Speech Dataset] was accessed on [DATE] at registry.opendata.aws/ivrit-ai-crowdtranscribe" +Resources: + - Description: "Hebrew Knesset plenum audio and transcriptions, with aligned metadata. Access via Hugging Face Datasets or by contacting ivrit.ai for bulk." + ARN: "" + Region: "" + Type: "External Resource" + Explore: + - "https://huggingface.co/datasets/ivrit-ai/knesset-plenums" +DataAtWork: + Tutorials: + - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing" + URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf + AuthorName: Marmor, Yanir et al. +Tools & Applications: [] +Publications: + - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing" + URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf + AuthorName: Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret +ADXCategories: + - Language + - Speech