Skip to content
This repository was archived by the owner on Oct 10, 2025. It is now read-only.

Commit 5d91884

Browse files
committed
tf docs
1 parent 2661cb1 commit 5d91884

File tree

3 files changed

+190
-0
lines changed

3 files changed

+190
-0
lines changed

fusus/about/transcriptionl.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
"""
2+
.. include:: ../docs/about/transcriptionl.md
3+
"""

fusus/docs/about/transcriptionl.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# Lakhnawi transcription
2+
3+
The Text-Fabric data is derived from the Lakhnawi PDF by reverse engineering.
4+
The PDF is a textual PDF with an unusual usage of fonts to obtain desired effects with
5+
ligatures and diacritics.
6+
7+
# Divisions
8+
9+
The text is divided into the following chunks
10+
11+
## Piece
12+
13+
**Section level 1**
14+
15+
Logical unit, corresponding to the main division of the work: *bezel*.
16+
(The title of the work is: *bezels* of wisdom.)
17+
18+
Some pieces are in fact introductory chapters, and not the *bezels* of the
19+
main work.
20+
21+
**Features**
22+
23+
name | type | description
24+
--- | --- | ---
25+
`n` | int | sequence number of a piece, starting with 1
26+
`np` | int | sequence number of a proper content piece, i.e. a *bezel*
27+
`title` | str | title of a piece
28+
29+
## Page
30+
31+
**Section level 2**
32+
33+
Physical unit: a printed page.
34+
35+
**Features**
36+
37+
name | type | description
38+
--- | --- | ---
39+
`n` | int | sequence number of a page, starting with 1
40+
41+
## Line
42+
43+
**Section level 3**
44+
45+
Physical unit: a printed line within a page.
46+
47+
**Features**
48+
49+
name | type | description
50+
--- | --- | ---
51+
`n` | int | sequence number of a page, starting with 1
52+
53+
## Column
54+
55+
Logical/physical unit: a column within a line.
56+
57+
Note that the page is not divided into columns.
58+
Some lines are divided into columns in
59+
hemistic poems. See `fusus.lakhnawi.Lakhnawi.columns`.
60+
61+
## Span
62+
63+
Logical/physical unit: a strectch of text with the same writing direction.
64+
Whenever the writing direction reverses, a new span is started.
65+
66+
67+
**Features**
68+
69+
name | type | description
70+
--- | --- | ---
71+
`n` | int | sequence number of a span within a column or line
72+
`dir` | str | writing direction of a span; either `r` or `l`
73+
74+
## Sentence
75+
76+
Logical unit: a sentence, defined by the full-stop marker.
77+
Whenever the writing direction reverses, a new span is started.
78+
79+
80+
**Features**
81+
82+
name | type | description
83+
--- | --- | ---
84+
`n` | int | sequence number of a span within a column or line
85+
86+
## Word
87+
88+
Logical/physical unit: individual words in as far they are separated
89+
by whitespace.
90+
91+
!!! caution "Imperfect whitespace detection"
92+
We do not guarantee that whitespace has been detected
93+
perfectly.
94+
So we do miss word boundaries on the one hand, and we
95+
have spurious word boundaries on the other hand.
96+
97+
**Features**
98+
99+
name | type | description
100+
--- | --- | ---
101+
`boxl` | int | left x-coordinate of the bounding box of a word
102+
`boxt` | int | top y-coordinate of the bounding box of a word
103+
`boxr` | int | right x-coordinate of the bounding box of a word
104+
`boxb` | int | bottom y-coordinate of the bounding box of a word
105+
`letters` | str | the text of a word in Arabic, unicode, without punctuation
106+
`lettersn` | str | the text of a word in beta code, latin + diacritics
107+
`lettersp` | str | the text of a word in beta code, ascii
108+
`letterst` | str | the text of a word in romanized transcription
109+
`punc` | str | the punctuation and/or space immediately after a word in Arabic, unicode
110+
`punca` | str | the punctuation and/or space immediately after a word in ascii
111+

legacy/notebooks/test copy.py

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# -*- coding: utf-8 -*-
2+
# ---
3+
# jupyter:
4+
# jupytext:
5+
# text_representation:
6+
# extension: .py
7+
# format_name: light
8+
# format_version: '1.5'
9+
# jupytext_version: 1.11.4
10+
# kernelspec:
11+
# display_name: Python3.9
12+
# language: python
13+
# name: python3
14+
# ---
15+
16+
import re
17+
18+
WORD_RE = re.compile(r"""
19+
([x-z]+)
20+
|
21+
([a-d]+)
22+
""", re.X)
23+
24+
string = "ccxzaayzbbzz"
25+
26+
x = WORD_RE.findall(string)
27+
x
28+
29+
CHUNK_RE = re.compile(fr"[{nonLetterRange}]")
30+
31+
string = "..aa"
32+
33+
match = CHUNK_RE.match(string)
34+
match
35+
36+
# +
37+
PART = r"""!"\#\$%\&\'\(\)\*\+,\-\./:;<=>\?@\[\]\{\}«»ʰʱʲʳʴʵʶʷʸʹʺʻʼʽʾʿˀˁ˂˃˄˅ˆˇˈˉˊˋˌˍˎˏːˑ˒˓˔˕˖˗˘˙˚˛˜˝˞˟ˠˡˢˣˤ˥˦˧˨˩˪˫ˬ˭ˮ˯˰˱˲˳˴˵˶˷˸˹˺˻˼˽˾˿̀́̂̃̄،؛\u061c\u061d؞؟‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․ ‥…‧\u2028\u2029‹›⁅⁆⁌⁍﴾﴿"""
38+
39+
WORD_RE = re.compile(f"""
40+
(
41+
[^{PART}]+
42+
)
43+
|
44+
(
45+
[{PART}]+
46+
)
47+
""", re.X)
48+
# -
49+
50+
string = 'إِلىٰأَكْثَرَ،إِلىٰ'
51+
52+
# +
53+
parts = []
54+
first = True
55+
56+
for (letters, nonLetters) in WORD_RE.findall(string):
57+
print(f"PART {letters=} {nonLetters=}")
58+
if first:
59+
parts.append([nonLetters, letters, ""])
60+
first = False
61+
elif letters:
62+
parts.append(["", letters, ""])
63+
else:
64+
parts[-1][-1] += nonLetters
65+
if parts:
66+
parts[-1][-1] += " "
67+
68+
# -
69+
70+
for part in parts:
71+
print("PART")
72+
print(f"\t{part[0]=}")
73+
print(f"\t{part[1]=}")
74+
print(f"\t{part[2]=}")
75+
76+

0 commit comments

Comments
 (0)