Skip to content

BibleNLP/greek-room

Repository files navigation

greekroom

greekroom is a suite of tools to support Biblical natural language processing (in progress)

gr_utilities

gr_utilities is a set of Greek Room utilities.

wb_file_props.py A CLI Python script to analyze file properties such as script direction, quotations.
usage: wb_file_props.py [-h]
           [-i INPUT_FILENAME]
           [-s INPUT_STRING]
           [-j JSON_OUT_FILENAME]
           [-o HTML_OUT_FILENAME]
           [--lang_code LANG_CODE]
           [--lang_name LANG_NAME]

options:
  -h, --help            show this help message and exit
  -i INPUT_FILENAME, --input_filename INPUT_FILENAME
  -s INPUT_STRING, --input_string INPUT_STRING
  -j JSON_OUT_FILENAME, --json_out_filename JSON_OUT_FILENAME
  -o HTML_OUT_FILENAME, --html_out_filename HTML_OUT_FILENAME
  --lang_code LANG_CODE
  --lang_name LANG_NAME

Notes:

  • Typically, either an INPUT_FILENAME or an INPUT_STRING is provided (but not both).
  • Typically, a JSON_OUT_FILENAME or a HTML_OUT_FILENAME is provided (or both).

Sample calls

wb_file_props.py -h
wb_file_props.py -s """She asked: “Whatʼs a ‘PyPi’?”
He replied: “I don't know.”""" -j test.json
cat test.json

gr_utilities.wb_file_props.script_punct A Python function to analyze file properties such as script direction, quotations.
import json
try:
    from gr_utilities import wb_file_props
except ImportError:
    from greekroom.gr_utilities import wb_file_props

## Apply script to string
text = """She asked: “Whatʼs a ‘PyPi’?”
He replied: “I don't know.”"""
result_dict = wb_file_props.script_punct(None, text, "eng", "English")
print(result_dict)

## Apply script to file content
# Write text to file
filename = "test.txt"
with open(filename, "w") as f_out:
    f_out.write(text)

# Apply script
result_dict2 = wb_file_props.script_punct(filename)
# Print result as JSON string
print(json.dumps(result_dict2))
# Write result to HTML file
html_output = "test.html"
with open(html_output, "w") as f_html:
    wb_file_props.print_to_html(result_dict2, f_html)

owl

owl is a battery of smaller Bible Translation checks.

repeated_words.py A CLI Python script to check a file for repeated words, e.g. "the the".
usage: repeated_words.py [-h] 
                         [-j JSON] 
                         [-i IN_FILENAME] 
                         [-r REF_FILENAME] 
                         [-o OUT_FILENAME] 
                         [--html HTML] 
                         [--project_name PROJECT_NAME] 
                         [--lang_code LANGUAGE-CODE] 
                         [--lang_name LANG_NAME] 
                         [--message_id MESSAGE_ID]
                         [-d DATA_FILENAMES] 
                         [--verbose]

options:
  -h, --help            show this help message and exit
  -j JSON, --json JSON  input (alternative 1)
  -i IN_FILENAME, --in_filename IN_FILENAME
                        text file (alternative 2)
  -r REF_FILENAME, --ref_filename REF_FILENAME
                        ref file (alt. 2)
  -o OUT_FILENAME, --out_filename OUT_FILENAME
                        output JSON filename
  --html HTML           output HTML filename
  --project_name PROJECT_NAME
                        full name of Bible translation project
  --lang_code LANGUAGE-CODE
                        ISO 639-3, e.g. 'fas' for Persian
  --lang_name LANG_NAME
  --message_id MESSAGE_ID
  -d DATA_FILENAMES, --data_filenames DATA_FILENAMES
  --verbose

Notes:

  • Typically, either a JSON INPUT_FILENAME or a JSON INPUT_STRING is provided (but not both).
  • Typically, a JSON_OUT_FILENAME or a HTML_OUT_FILENAME is provided (or both).

Sample calls

repeated_words.py -h
repeated_words.py -j '{"jsonrpc": "2.0",
 "id": "eng-sample-01",
 "method": "BibleTranslationCheck",
 "params": [{"lang-code": "eng", "lang-name": "English", 
             "project-id": "eng-sample", 
             "project-name": "English Bible",
             "selectors": [{"tool": "GreekRoom", "checks": ["RepeatedWords"]}],
             "check-corpus": [{"snt-id": "GEN 1:1", "text": "In in the beginning ..."},
                              {"snt-id": "JHN 12:24", "text": "Truly truly, I say to you ..."}]}]}' -o test.json
cat test.json
owl.repeated_words.check_mcp A Python function to check a file for repeated words, e.g. "the the".
import json
try:
   from owl import repeated_words
except ImportError:
    from greekroom.owl import repeated_words

task_s = '''{"jsonrpc": "2.0",
 "id": "eng-sample-01",
 "method": "BibleTranslationCheck",
 "params": [{"lang-code": "eng", "lang-name": "English",
             "project-id": "eng-sample",
             "project-name": "English Bible",
             "selectors": [{"tool": "GreekRoom", "checks": ["RepeatedWords"]}],
             "check-corpus": [{"snt-id": "GEN 1:1", "text": "In in the beginning ..."},
                              {"snt-id": "JHN 12:24", "text": "Truly truly, I say to you ..."}]}]}'''

# load_data_filename() loads <i>legitimate_duplicates.jsonl</i> (see below); call this function only once, even for multiple checks.
data_filename_dict = repeated_words.load_data_filename()
corpus = repeated_words.new_corpus("eng-sample-01")
mcp_d, misc_data_dict, check_corpus_list = repeated_words.check_mcp(task_s, data_filename_dict, corpus)
print(json.dumps(mcp_d))
print(misc_data_dict)
print(check_corpus_list)

# print to HTML file
feedback = repeated_words.get_feedback(mcp_d, 'GreekRoom', 'RepeatedWords')
corpus = repeated_words.update_corpus_if_empty(corpus, check_corpus_list)
repeated_words.write_to_html(feedback, misc_data_dict, corpus, "test.html", "eng", "English", "English Bible")
# result will be in test.html
legitimate_duplicates.jsonl Data files describing legitimate repeated words.

Samples:

{"lang-code": "eng", "text": "truly, truly"}
{"lang-code": "eng", "text": "her her", "snt-ids": ["HOS 2:17", "EST 2:9", "JDT 10:4"], "context-examples": ["give her her vineyards", "gave her her things for purification"]}
{"lang-code": "grc", "text": "ἀμὴν ἀμὴν", "rom": "amen amen", "gloss": {"eng": "truly truly [I say to you]"}}

{"lang-code": "hin", "text": "जब जब", "rom": "jab jab", "gloss": {"eng": "whenever"}}
{"lang-code": "hin", "text": "कुछ कुछ", "rom": "kuch kuch", "gloss": {"eng": "something, somewhat, some of, part of"}}
{"lang-code": "eng", "text": "they they", "delete": true}

Notes:

  • Searches for files owl/data/legitimate_duplicates.jsonl in directories "greekroom", "$XDG_DATA_HOME", "/usr/share", "$HOME/.local/share"
  • later entries overwrite prior entries
  • "delete": true entries delete prior entries

About

Tools for Checking Bible Translations

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors