You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/man/basic_utilities.md
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@ Basic Utilities
2
2
=====
3
3
In this section, we are going to discuss how to use the APIs for dediacritization, normalization, and transliteration.
4
4
## Dediacritization
5
-
The function to use is `dediac` which works on either Arabic, Buckwalter or custom transliterated characters.
5
+
Dediacritization is the process of removing diacritics from an Arabic word. These diacritics are mostly vowels but also includes _sukuun_ سُكُون and _saddah_ شَدّة. The function to use for dediacritization is `dediac` which works on either Arabic, Buckwalter or custom transliterated characters.
6
6
```@repl abc
7
7
using Yunir
8
8
@transliterator :default
@@ -15,20 +15,23 @@ Or using Buckwalter as follows:
The `isarabic` parameter with `false` argument indicates that the `dediac` function or `dediac` API takes a Buckwalter encoded input, `bw_basmala`, and returns an output that is not encoded in Arabic (as in the previous example) but instead an output in Buckwalter form as well.
19
+
18
20
With Julia's broadcasting feature, the above dediacritization can be applied to arrays by simply adding `.` to the name of the function.
As seen above, broadcasting allows application of the `dediac` function to the elements of the vector `sentence0`. That is, because there are two entries in the `sentence0` vector, the broadcasting applies the `dediac` function to each of these and thus returning two outputs as well.
25
28
## Normalization
26
-
The function to use is `normalize`, which works on either Arabic, Buckwalter or custom transliterated characters. For example, using the `ar_basmala` and `bw_basmala` defined above, the normalized version would be
29
+
Arabic letters are calligraphic by design. It's free flowing design makes it very flexible to form unique ligatures that may require normalization for consistency's sake when doing natural language processing. To do normalization, the function to use is `normalize`, which works on either Arabic, Buckwalter or custom transliterated characters. For example, using the `ar_basmala` and `bw_basmala` defined above, the normalized version would be
27
30
```@repl abc
28
31
normalize(ar_basmala)
29
32
normalize(bw_basmala; isarabic=false)
30
33
```
31
-
You can also normalize specific characters, for example:
34
+
Again, the `isarabic=false` parameter simply disables an Arabic output and instead encode it as a Buckwalter output. You can also normalize specific characters, for example:
Copy file name to clipboardExpand all lines: docs/src/man/rhythmic_analysis.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
Rhythmic Analysis
2
2
=============
3
-
The prevalence of poetry in Arabic literature necessitates scientific tool to study the rhythmic signatures. Unfortunately, there are no resources for such methodology until the recent work of [asaadthesis](@citet). This section will demonstrate the APIs for doing rhythmic analysis based on the methodologies proposed by [asaadthesis](@citet). To do this, there are two types of text that will be studied, and these are pre-Islamic poetry and the Holy Qur'an.
3
+
The prevalence of poetry in Arabic literature necessitates scientific tool to study the rhythmic signatures. Unfortunately, apart from the fact that there are no tools to do this yet, at least to the best knowledge of the author, there are no resources as well for the statistical methodologies of studying rhythm as well until recently. The recent work of [asaadthesis](@citet) provided initial statistical tools that are now available in Yunir.jl as well. This section will demonstrate the APIs for doing rhythmic analysis based on the methodologies proposed by [asaadthesis](@citet). To do this, there are two types of text that will be studied, one for Arabic poetry and the Holy Qur'an. For the Holy Qur'an, a comprehensive analysis was done by [asaadthesis](@citet), readers are encouraged to read it. As to how to do apply it though using Yunir.jl, this section will cover the details.
4
4
5
5
## Arabic Poetry
6
6
The first data is from a well known author, [Al-Mutanabbi المتنبّي](https://en.wikipedia.org/wiki/Al-Mutanabbi), who authored several poetry including the titled [*'Indeed, every woman with a swaying walk'*](https://www.youtube.com/watch?v=9c1IrQwfYFM), which will be the basis for this section.
The `clean` function removes the non-Arabic characters through RegEx or [Regular Expression](https://en.wikipedia.org/wiki/Regular_expression), which is set at the third argument of the function. That is, `clean(shamela0012129)` is actually equivalent to:
44
+
The `clean` function removes the non-Arabic characters through RegEx or [Regular Expression](https://en.wikipedia.org/wiki/Regular_expression), which is set at the third parameter of the function. That is, `clean(shamela0012129)` is actually equivalent to:
@@ -92,7 +92,7 @@ We can actually extract the encoded version, which is in extended Buckwalter tra
92
92
```@repl abc
93
93
res1.alignment
94
94
```
95
-
This is the same with the result above, but this one is the Buckwalter encoded Arabic input.
95
+
This is the same with the previous result above, but this one is the Buckwalter encoded Arabic input.
96
96
97
97
The number in the left side is the index of the first character in the row, whereas the number in the right side is the index of the last character in the row.
98
98
### Alignment statistics
@@ -192,7 +192,7 @@ f
192
192
```
193
193
The figure above is divided into three subplots arranged in rows. You can think of the figure as two input text displayed in horizontal (i.e, sideways) orientation. In this orientation, the x-axis becomes the rows of the texts, that is, you can think of the x-axis as the rows of the texts in the book. In this case, we have two books, the reference and the target books. Each dot in reference and target corresponds to the characters that have matched. The lines and curves in the middle (colored in red) represent the connections of the rows of the texts where the matched happened. Further, the y-axis correspond to the length of the rows, in this case 60 characters per row. As you can see, the top tick label of the y-axis is 0 and the bottom tick label of the y-axis is 60, this is because the writing of Arabic is right-to-left, and so we can think of the 0th-tick at the top as the starting index of the first character in both texts, and the row ends at the 60th-tick at the bottom.
194
194
195
-
We added further customization to the plot, readers are encouraged to explore the API.
195
+
We added further customization to the plot, readers are encouraged to explore the [API](http://127.0.0.1:5501/docs/build/man/api/).
196
196
197
197
As for the plot of insertions of characters, we have:
The pairwise alignment above works by minimizing a cost function, which is define by a cost model. It is important that we understand how the cost model is setup so that we can give proper scoring for the mismatches, matches, deletions and insertions. To define a cost model, we use [BioAligments.jl](https://github.com/BioJulia/BioAlignments.jl)'s `CostModel` struct.
231
+
The pairwise alignment above works by minimizing a cost function, which is defined by a cost model. It is important that we understand how the cost model is setup so that we can give proper scoring for the mismatches, matches, deletions and insertions. To define a cost model, we use [BioAligments.jl](https://github.com/BioJulia/BioAlignments.jl)'s `CostModel` struct.
0 commit comments