Skip to content

Commit 3ffe066

Browse files
committed
update
1 parent 721161a commit 3ffe066

File tree

5 files changed

+13
-21
lines changed

5 files changed

+13
-21
lines changed

docs/src/index.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -25,15 +25,4 @@ julia> Pkg.add("Yunir")
2525
doi = {10.5281/zenodo.6629868},
2626
url = {https://doi.org/10.5281/zenodo.6629868}
2727
}
28-
```
29-
## Outline
30-
```@contents
31-
Pages = [
32-
"man/basic_utilities.md",
33-
"man/orthography.md",
34-
"man/qurantree.md",
35-
"man/api.md",
36-
"man/references.md",
37-
]
38-
Depth = 2
3928
```

docs/src/man/basic_utilities.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ Basic Utilities
22
=====
33
In this section, we are going to discuss how to use the APIs for dediacritization, normalization, and transliteration.
44
## Dediacritization
5-
The function to use is `dediac` which works on either Arabic, Buckwalter or custom transliterated characters.
5+
Dediacritization is the process of removing diacritics from an Arabic word. These diacritics are mostly vowels but also includes _sukuun_ سُكُون and _saddah_ شَدّة. The function to use for dediacritization is `dediac` which works on either Arabic, Buckwalter or custom transliterated characters.
66
```@repl abc
77
using Yunir
88
@transliterator :default
@@ -15,20 +15,23 @@ Or using Buckwalter as follows:
1515
bw_basmala = "bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi";
1616
dediac(bw_basmala; isarabic=false)
1717
```
18+
The `isarabic` parameter with `false` argument indicates that the `dediac` function or `dediac` API takes a Buckwalter encoded input, `bw_basmala`, and returns an output that is not encoded in Arabic (as in the previous example) but instead an output in Buckwalter form as well.
19+
1820
With Julia's broadcasting feature, the above dediacritization can be applied to arrays by simply adding `.` to the name of the function.
1921
```@repl abc
2022
sentence0 = ["بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ",
2123
"إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
2224
]
2325
dediac.(sentence0)
2426
```
27+
As seen above, broadcasting allows application of the `dediac` function to the elements of the vector `sentence0`. That is, because there are two entries in the `sentence0` vector, the broadcasting applies the `dediac` function to each of these and thus returning two outputs as well.
2528
## Normalization
26-
The function to use is `normalize`, which works on either Arabic, Buckwalter or custom transliterated characters. For example, using the `ar_basmala` and `bw_basmala` defined above, the normalized version would be
29+
Arabic letters are calligraphic by design. It's free flowing design makes it very flexible to form unique ligatures that may require normalization for consistency's sake when doing natural language processing. To do normalization, the function to use is `normalize`, which works on either Arabic, Buckwalter or custom transliterated characters. For example, using the `ar_basmala` and `bw_basmala` defined above, the normalized version would be
2730
```@repl abc
2831
normalize(ar_basmala)
2932
normalize(bw_basmala; isarabic=false)
3033
```
31-
You can also normalize specific characters, for example:
34+
Again, the `isarabic=false` parameter simply disables an Arabic output and instead encode it as a Buckwalter output. You can also normalize specific characters, for example:
3235
```@repl abc
3336
normalize(ar_basmala, :alif_khanjareeya)
3437
normalize(ar_basmala, :hamzat_wasl)

docs/src/man/orthography.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ If we want to take the numerals, we need to tokenize it first.
1414
```@repl abc2
1515
arb_token = tokenize(ar_basmala)
1616
```
17-
Next we then parse each of these words as `Orthography`.
17+
Next, we parse each of these words as `Orthography`.
1818
```@repl abc2
1919
arb_parsed1 = parse(Orthography, arb_token[1])
2020
arb_parsed2 = parse.(Orthography, arb_token)
@@ -41,7 +41,7 @@ vocals(arb_parsed2[3])
4141
```
4242

4343
## Simple Encoding
44-
Simple encoding is a worded or spelled out transliteration of the arabic text.
44+
Simple encoding is a worded or spelled out transliteration of an Arabic text.
4545
```@repl abc2
4646
parse(SimpleEncoding, ar_basmala)
4747
```

docs/src/man/rhythmic_analysis.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Rhythmic Analysis
22
=============
3-
The prevalence of poetry in Arabic literature necessitates scientific tool to study the rhythmic signatures. Unfortunately, there are no resources for such methodology until the recent work of [asaadthesis](@citet). This section will demonstrate the APIs for doing rhythmic analysis based on the methodologies proposed by [asaadthesis](@citet). To do this, there are two types of text that will be studied, and these are pre-Islamic poetry and the Holy Qur'an.
3+
The prevalence of poetry in Arabic literature necessitates scientific tool to study the rhythmic signatures. Unfortunately, apart from the fact that there are no tools to do this yet, at least to the best knowledge of the author, there are no resources as well for the statistical methodologies of studying rhythm as well until recently. The recent work of [asaadthesis](@citet) provided initial statistical tools that are now available in Yunir.jl as well. This section will demonstrate the APIs for doing rhythmic analysis based on the methodologies proposed by [asaadthesis](@citet). To do this, there are two types of text that will be studied, one for Arabic poetry and the Holy Qur'an. For the Holy Qur'an, a comprehensive analysis was done by [asaadthesis](@citet), readers are encouraged to read it. As to how to do apply it though using Yunir.jl, this section will cover the details.
44

55
## Arabic Poetry
66
The first data is from a well known author, [Al-Mutanabbi المتنبّي](https://en.wikipedia.org/wiki/Al-Mutanabbi), who authored several poetry including the titled [*'Indeed, every woman with a swaying walk'*](https://www.youtube.com/watch?v=9c1IrQwfYFM), which will be the basis for this section.

docs/src/man/text_alignment.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ shamela0012129_cln = clean(shamela0012129)
4141
shamela0023790_cln = clean(shamela0023790)
4242
```
4343
!!! tips "Tips"
44-
The `clean` function removes the non-Arabic characters through RegEx or [Regular Expression](https://en.wikipedia.org/wiki/Regular_expression), which is set at the third argument of the function. That is, `clean(shamela0012129)` is actually equivalent to:
44+
The `clean` function removes the non-Arabic characters through RegEx or [Regular Expression](https://en.wikipedia.org/wiki/Regular_expression), which is set at the third parameter of the function. That is, `clean(shamela0012129)` is actually equivalent to:
4545
```julia
4646
clean(shamela0012129; replace_non_ar="", target_regex=r"[A-Za-z0-9\(\|\–\[\«\»\]~\)_@./#&+\—-]*")
4747
```
@@ -92,7 +92,7 @@ We can actually extract the encoded version, which is in extended Buckwalter tra
9292
```@repl abc
9393
res1.alignment
9494
```
95-
This is the same with the result above, but this one is the Buckwalter encoded Arabic input.
95+
This is the same with the previous result above, but this one is the Buckwalter encoded Arabic input.
9696

9797
The number in the left side is the index of the first character in the row, whereas the number in the right side is the index of the last character in the row.
9898
### Alignment statistics
@@ -192,7 +192,7 @@ f
192192
```
193193
The figure above is divided into three subplots arranged in rows. You can think of the figure as two input text displayed in horizontal (i.e, sideways) orientation. In this orientation, the x-axis becomes the rows of the texts, that is, you can think of the x-axis as the rows of the texts in the book. In this case, we have two books, the reference and the target books. Each dot in reference and target corresponds to the characters that have matched. The lines and curves in the middle (colored in red) represent the connections of the rows of the texts where the matched happened. Further, the y-axis correspond to the length of the rows, in this case 60 characters per row. As you can see, the top tick label of the y-axis is 0 and the bottom tick label of the y-axis is 60, this is because the writing of Arabic is right-to-left, and so we can think of the 0th-tick at the top as the starting index of the first character in both texts, and the row ends at the 60th-tick at the bottom.
194194

195-
We added further customization to the plot, readers are encouraged to explore the API.
195+
We added further customization to the plot, readers are encouraged to explore the [API](http://127.0.0.1:5501/docs/build/man/api/).
196196

197197
As for the plot of insertions of characters, we have:
198198
```@example abc
@@ -228,7 +228,7 @@ a[3].xticks = 0:2:unique(xys[2][1])[end]
228228
f
229229
```
230230
## Cost Model
231-
The pairwise alignment above works by minimizing a cost function, which is define by a cost model. It is important that we understand how the cost model is setup so that we can give proper scoring for the mismatches, matches, deletions and insertions. To define a cost model, we use [BioAligments.jl](https://github.com/BioJulia/BioAlignments.jl)'s `CostModel` struct.
231+
The pairwise alignment above works by minimizing a cost function, which is defined by a cost model. It is important that we understand how the cost model is setup so that we can give proper scoring for the mismatches, matches, deletions and insertions. To define a cost model, we use [BioAligments.jl](https://github.com/BioJulia/BioAlignments.jl)'s `CostModel` struct.
232232

233233
The default cost model is given by
234234
```@setup def

0 commit comments

Comments
 (0)