bayanat is a simple library for gathering statistics about Arabic text.
from bayanat import Bayanat
dataset = Bayanat(path)Functions
get_top_freq_wordsretrieves n most frequent words.get_top_freq_charsretrieves n most chars chars.get_largest_wordget the largest word in the corpus.get_top_longest_wordsget the top longest words in the corpus.sample_words_by_charsample words by character.sample_random_sentencesample a sentence with a given size.get_ratio_of_non_arabicshow percentage of non Arabic chars.get_ratio_of_englishshow percentage of English chars.get_ratio_of_arabicshow percentage of Arabic chars.get_statsprint number of chars, words and lines.get_size_vocabgets the number of unique words in the text.plot_top_freq_wordsplots n most frequent words a bar graph.plot_top_freq_wordsplots n most frequent chars a bar graph.plot_embeddingsgiven some words it plots the words using embeddings. This usesAraVecmodel.plot_word_cloudplots the word cloud of a given text
Run directly on Colab.
This is an open source project where we encourage contributions from the community.
MIT license.
@misc{bayanat2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {bayanat: Statistics of Arabic Text.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/bayanat}}
}