diff --git a/examples/harry-potter-corpus/README.md b/examples/harry-potter-corpus/README.md index 4e3226c..fd8faf3 100644 --- a/examples/harry-potter-corpus/README.md +++ b/examples/harry-potter-corpus/README.md @@ -146,7 +146,7 @@ We need to get the corpus into the right shape for the indexing process. PrimeQA We want this collection of documents to retain as much continuity of meaning as possible, so it's a good idea to keep paragraphs and avoid separations within sentences. Short pieces (like dialogue lines) can be combined into larger ones, and larger paragraphs can be split by sentence boundaries to keep the word count in check. This corpus has plenty of dialogue lines, and a few paragraphs longer than 180 words. Both of these transformations are done in our preprocessing scripts, alongside the fixing of page-spanning paragraphs. ``` -(hp-corpus) harry-potter-corpus $ ./process.sh +(hp-corpus) harry-potter-corpus $ ./process.py Book*.txt > corpus.tsv Processing Book1.txt... Processing Book2.txt... Processing Book3.txt... @@ -154,7 +154,6 @@ Processing Book4.txt... Processing Book5.txt... Processing Book6.txt... Processing Book7.txt... -corpus.tsv successfully generated. (hp-corpus) harry-potter-corpus $ ``` @@ -195,53 +194,21 @@ lowed himself a grin. “Most mysterious. And now, over to Jim McGuffin with t> : ``` -The inner workings of the `process.sh` script are described below. +The inner workings of the `process.py` script are described below. ## Processing steps -`process.sh` is a Bash script that streams the books' contents into a series of pipelined modification steps: +`process.py` is a Python script that streams the books' contents into a series of pipelined modification steps: -- Using `sed` regular expression replacements, turn page footer occurrences into empty lines (first three steps). -- With the fourth `sed` command, turn lines containing only spaces into empty lines. -- `remove_multi_newline.py` collapses three or more contiguous newlines into a single empty line. -- `remove_single_newline.py` removes single newlines so paragraphs get consolidated into the same line, one per line. -- `fix_straddling_paragraphs.py` fixes page break-straddling paragraphs checking sentence continuity (one paragraph per line up to this point). -- `combine_up_to_n_words.py` combines contiguous short paragraphs as long as the result doesn't exceed 180 words. -- `split_by_sentences_up_to_n_words.py` splits paragraphs longer than 180 words, keeping whole sentences. -- `to_tsv.py` formats each piece as `tsv` appending `Book Paragraph ` as title. +- Using regular expression matching, skips page footer occurrences (first three steps) and blank lines (fourth step). +- `fix_straddling_paragraphs` fixes page break-straddling paragraphs checking sentence continuity (one paragraph per line up to this point). +- `combine_up_to_n_words` combines contiguous short paragraphs as long as the result doesn't exceed 180 words. +- `split_by_sentences_up_to_n_words` splits paragraphs longer than 180 words, keeping whole sentences. +- `to_tsv` formats each piece as `tsv` appending `Book Paragraph ` as title. -The first steps are implemented with simple Bash commands, while the later steps, such as the sentence-aware ones, are implemented as Python scripts with spaCy. +Each book gets serialized into an intermediate `tsv` representation of one "document" per line, being each one of these either an original paragraph, whole sentences of a long paragraph or a concatenation of short paragraphs or dialogue lines. -``` -readonly WORD_QTY="${1:-180}" - -for book in Book*.txt; do - >&2 echo "Processing ${book}..." - cat "${book}" \ - | sed -E 's/^Page \|\s*[0-9]+ .*$//g' \ - | sed -E 's/^P a g e.*$//g' \ - | sed -E 's/P.*Rowling//g' \ - | sed -E 's/^\s+$//g' \ - | ./remove_multi_newline.py \ - | ./remove_single_newline.py \ - | ./fix_straddling_paragraphs.py \ - | ./combine_up_to_n_words.py "${WORD_QTY}" \ - | ./to_tsv.py "${book%.*}" \ - | cat > "${book%.*}.tsv" -done -``` - -Each book gets into an intermediate `tsv` representation of one "document" per line, being each one of these either an original paragraph, whole sentences of a long paragraph or a concatenation of short paragraphs or dialogue lines. - -After processing each book, the seven intermediate representations are concatenated after a `idtexttitle` header into a single stream, in which each line is preceded by a line number: - -``` -cat Book*.tsv \ - | nl \ - | sed 's/^ *//g' \ - | { echo -e 'id\ttext\ttitle'; cat; } \ - | cat > corpus.tsv -``` +Each book representation is concatenated after a `idtexttitle` header into a single stream, in which each line is preceded by a line number. -When the process finishes, the `corpus.tsv` file is ready to be used by PrimeQA's indexing feature. +The process outputs the result into `stdout`, so it can be redirected easily to a file, for example, `corpus.tsv`. When the process finishes, the file is ready to be used by PrimeQA's indexing feature. diff --git a/examples/harry-potter-corpus/combine_up_to_n_words.py b/examples/harry-potter-corpus/combine_up_to_n_words.py deleted file mode 100755 index c4b0a20..0000000 --- a/examples/harry-potter-corpus/combine_up_to_n_words.py +++ /dev/null @@ -1,38 +0,0 @@ -#! /usr/bin/env python3 - -def word_qty(nlp, text: str) -> int: - doc = nlp(text) - count = 0 - for i in doc: - if not (i.is_space or i.is_punct): - count = count + 1 - return count - -if __name__ == "__main__": - import sys - import spacy - - # Check argument present - if len(sys.argv) < 2: - print(f"Usage: python combine_up_to_n_words.py ", file=sys.stderr) - sys.exit(1) - - n = int(sys.argv[1]) - - nlp = spacy.load('en_core_web_sm') - par = '' - count = 0 - for line in sys.stdin: - current_par = line.rstrip('\n') - current_count = word_qty(nlp, current_par) - if count + current_count <= n: - if par != '': - current_par = ' ' + current_par - par = par + current_par - count = count + current_count - else: - print(par) - par = current_par - count = current_count - print(par) - diff --git a/examples/harry-potter-corpus/fix_straddling_paragraphs.py b/examples/harry-potter-corpus/fix_straddling_paragraphs.py deleted file mode 100755 index d7f990c..0000000 --- a/examples/harry-potter-corpus/fix_straddling_paragraphs.py +++ /dev/null @@ -1,30 +0,0 @@ -#! /usr/bin/env python3 - -def count_sentences(nlp, text: str) -> int: - return sum(1 for _ in nlp(replace_stylish_quotes(text)).sents) - -def replace_stylish_quotes(text: str) -> str: - return text.replace("“", "").replace("”", "").replace("’", "\'") - -if __name__ == "__main__": - import sys - import spacy - nlp = spacy.load('en_core_web_sm') - accum = '' - accum_sentence_count = 0 - for line in sys.stdin: - piece = line.rstrip('\n') - sentence_count = count_sentences(nlp, piece) - concatenation = accum.rstrip(' ') + ' ' + piece - concatenation_sentence_count = count_sentences(nlp, concatenation) - if concatenation_sentence_count == accum_sentence_count + sentence_count: - # expected boundary between paragraphs: prints last paragraph and resets sentence count: - print(accum) - accum = piece - accum_sentence_count = sentence_count - else: - # straddling paragraph: accumulates text and sentence count: - accum = concatenation - accum_sentence_count = concatenation_sentence_count - print(accum) - diff --git a/examples/harry-potter-corpus/process.py b/examples/harry-potter-corpus/process.py new file mode 100755 index 0000000..070afcb --- /dev/null +++ b/examples/harry-potter-corpus/process.py @@ -0,0 +1,182 @@ +#! /usr/bin/env python3 + +from typing import Any, Generator +import sys +from io import StringIO +import itertools +import re +import spacy +from spacy.language import Language +import csv + +def log(stuff: Any): + """Logs str of values to stderr.""" + print(str(stuff), file=sys.stderr) + +def lines_from_file(file_name: str) -> Generator[str, None, None]: + """Reads a file and yields its lines.""" + log(f"Processing {file_name}...") + for line in open(file_name, "r", encoding="utf-8"): + yield line + +def strip_newline(lines: Generator[str, None, None]) -> Generator[str, None, None]: + """Removes trailing newlines from lines.""" + for line in lines: + yield line.rstrip("\n") + +def skip(pattern: str, lines: Generator[str, None, None]) -> Generator[str, None, None]: + """Skips lines matching the pattern.""" + for line in lines: + if not re.search(pattern, line): + yield line + +def fix_straddling_paragraphs(nlp: Language, lines: Generator[str, None, None]) -> Generator[str, None, None]: + """Combines lines according to sentence continuity""" + def _replace_stylish_quotes(text: str) -> str: + """Replaces stylish quotes and apostrophes.""" + return text.replace("“", "").replace("”", "").replace("’", "\'") + + def _count_sentences(text: str) -> int: + """Counts the quantity of sentences in a fragment.""" + return sum(1 for _ in nlp(_replace_stylish_quotes(text)).sents) + + accum: str = "" + accum_sentence_count: int = 0 + for line in lines: + sentence_count = _count_sentences(line) + concatenation = accum.rstrip(" ") + " " + line + concatenation_sentence_count = _count_sentences(concatenation) + if concatenation_sentence_count == accum_sentence_count + sentence_count: + # expected boundary between paragraphs: returns last paragraph and resets sentence count: + yield accum + accum = line + accum_sentence_count = sentence_count + else: + # straddling paragraph: accumulates text and sentence count: + accum = concatenation + accum_sentence_count = concatenation_sentence_count + yield accum + +def word_qty(doc) -> int: + """Counts words from text fragment.""" + count: int = 0 + for i in doc: + if not (i.is_space or i.is_punct): + count = count + 1 + return count + +def combine_up_to_n_words(nlp: Language, n: int, lines: Generator[str, None, None]) -> Generator[str, None, None]: + """Combines contiguous short paragraphs up to n words.""" + par: str = "" + count: int = 0 + for line in lines: + current_count: int = word_qty(nlp(line)) + if count + current_count <= n: + if par != "": + line = " " + line + par = par.rstrip(" ") + line + count = count + current_count + else: + yield par + par = line + count = current_count + if par != "": + yield par + par = "" + count = 0 + +def split_by_sentences_up_to_n_words(nlp: Language, n: int, lines: Generator[str, None, None]) -> Generator[str, None, None]: + """Splits paragraphs longer than n words in sentence boundaries.""" + for line in lines: + doc = nlp(line) + word_count: int = word_qty(doc) + if word_count > n: + accum: str = "" + word_count_accum: int = 0 + for sent in doc.sents: + sentence = sent.text + sentence_word_count = word_qty(sent) + if word_count_accum + sentence_word_count < n: + if accum != "": + accum = accum.rstrip() + " " + accum = accum + sentence + word_count_accum = word_count_accum + sentence_word_count + else: + yield accum + accum = sentence + word_count_accum = sentence_word_count + if accum != "": + yield accum + else: + yield line + +def number_lines(lines: Generator[str, None, None]) -> Generator[str, None, None]: + """Prepends a tab-separated line number to a text line.""" + count: int = 1 + for line in lines: + yield str(count) + "\t" + line + count = count + 1 + +def to_tsv(title_prefix: str, lines: Generator[str, None, None]) -> Generator[str, None, None]: + """Generates TSV lines from an input generator, appending title and paragraph numbering.""" + buffer = StringIO() + writer = csv.writer(buffer, delimiter='\t', lineterminator='\n') + + def _stringify(data: list[str]) -> str: + """Extracts the TSV serialization from the writer as a string.""" + writer.writerow(data) + value: str = buffer.getvalue().strip("\r\n") + buffer.seek(0) + buffer.truncate(0) + return value + + counter: int = 1 + for line in lines: + data: list[str] = [ line, title_prefix + " Paragraph " + str(counter) ] + yield _stringify(data) + counter = counter + 1 + +def write(out, lines: Generator[str, None, None]): + """Writes lines to output stream, appending a newline.""" + for line in lines: + out.write(line + "\n") + +if __name__ == "__main__": + nlp: Language = spacy.load("en_core_web_sm") + WORD_QTY: int = 180 + files: list[str] = sys.argv[1:] + out = sys.stdout + + # Adds the header to the file + out.write("id\ttext\ttitle\n") + + all_lines = itertools.chain() + for file_name in files: + # For each book: + # Reads line by line. + lines = lines_from_file(file_name) + # Removes newline at the end of each line. + lines = strip_newline(lines) + # Skips page number footer by regex matching. + lines = skip(r"^Page \|\s*[0-9]+ .*$", lines) + lines = skip(r"^P a g e.*$", lines) + lines = skip(r"P.*Rowling", lines) + # Skips blank lines. + lines = skip(r"^\s+$", lines) + # Fixes pagebreak-straddling paragraphs checking sentence continuity (one paragraph per line up to this point). + lines = fix_straddling_paragraphs(nlp, lines) + # Combines contiguous short paragraphs as long as the result doesn't exceed WORD_QTY words. + lines = combine_up_to_n_words(nlp, WORD_QTY, lines) + # Splits paragraphs longer than WORD_QTY words, keeping whole sentences. + lines = split_by_sentences_up_to_n_words(nlp, WORD_QTY, lines) + + # Formats each fragment as TSV appending "Book Paragraph " as title. + book_name = re.sub(r"\.[a-zA-Z0-9]+$", "", file_name) + lines = to_tsv(book_name, lines) + + # Concatenates book lines + all_lines = itertools.chain(all_lines, lines) + + # Prepends line numbers (will be used as ids) + all_lines = number_lines(all_lines) + write(out, all_lines) diff --git a/examples/harry-potter-corpus/process.sh b/examples/harry-potter-corpus/process.sh deleted file mode 100755 index 104849d..0000000 --- a/examples/harry-potter-corpus/process.sh +++ /dev/null @@ -1,44 +0,0 @@ -#! /usr/bin/env bash - -# For each book: -# First three seds: Removes page number footer by regex matching. -# Fourth sed: Remove sole blanks from lines. -# remove_multi_newline.py: Collapses three or more contiguous newlines into a single empty line (two contiguous newlines). -# remove_single_newline.py: Removes single newlines so paragraphs get consolidated into the same line, one per line. -# fix_straddling_paragraphs.py: Fixes pagebreak-straddling paragraphs checking sentence continuity (one paragraph per line up to this point). -# combine_up_to_n_words.py: Combines contiguous short paragraphs as long as the result doesn't exceed WORD_QTY words (180 by default, if not provided). -# split_by_sentences_up_to_n_words.py: Splits paragraphs longer than WORD_QTY words, keeping whole sentences. -# to_tsv.py: Formats each fragment as TSV appending "Book Paragraph " as title. - -readonly WORD_QTY="${1:-180}" - -for book in Book*.txt; do - >&2 echo "Processing ${book}..." - cat "${book}" \ - | sed -E 's/^Page \|\s*[0-9]+ .*$//g' \ - | sed -E 's/^P a g e.*$//g' \ - | sed -E 's/P.*Rowling//g' \ - | sed -E 's/^\s+$//g' \ - | ./remove_multi_newline.py \ - | ./remove_single_newline.py \ - | ./fix_straddling_paragraphs.py \ - | ./combine_up_to_n_words.py "${WORD_QTY}" \ - | ./split_by_sentences_up_to_n_words.py "${WORD_QTY}" \ - | ./to_tsv.py "${book%.*}" \ - | cat > "${book%.*}.tsv" -done - -# cat: Concatenates book results -# nl: Prepends line numbers (will be used as ids) -# sed: Removes nl padding spaces at the beginning of each line -# echo;cat: Adds the header to the file -# cat: Concatenates each book's result into a single `corpus.tsv` file. - -cat Book*.tsv \ - | nl \ - | sed 's/^ *//g' \ - | { echo -e 'id\ttext\ttitle'; cat; } \ - | cat > corpus.tsv - ->&2 echo "corpus.tsv successfully generated." - diff --git a/examples/harry-potter-corpus/remove_multi_newline.py b/examples/harry-potter-corpus/remove_multi_newline.py deleted file mode 100755 index 4212ad6..0000000 --- a/examples/harry-potter-corpus/remove_multi_newline.py +++ /dev/null @@ -1,12 +0,0 @@ -#! /usr/bin/env python3 - -if __name__ == "__main__": - from sys import stdin - last='dummy' - for line in stdin: - if line != '\n': - print(line, end='') - elif last != '\n': - print('\n', end='') - last = line - diff --git a/examples/harry-potter-corpus/remove_single_newline.py b/examples/harry-potter-corpus/remove_single_newline.py deleted file mode 100755 index c012201..0000000 --- a/examples/harry-potter-corpus/remove_single_newline.py +++ /dev/null @@ -1,14 +0,0 @@ -#! /usr/bin/env python3 - -if __name__ == "__main__": - import sys - accum='' - for line in sys.stdin: - if line != '\n': - accum = accum + line.rstrip('\n') - else: - print(accum) - accum = '' - if accum != '': - print(accum) - diff --git a/examples/harry-potter-corpus/split_by_sentences_up_to_n_words.py b/examples/harry-potter-corpus/split_by_sentences_up_to_n_words.py deleted file mode 100755 index ff28a19..0000000 --- a/examples/harry-potter-corpus/split_by_sentences_up_to_n_words.py +++ /dev/null @@ -1,45 +0,0 @@ -#! /usr/bin/env python3 - -def word_qty(doc) -> int: - count = 0 - for i in doc: - if not (i.is_space or i.is_punct): - count = count + 1 - return count - -if __name__ == "__main__": - import sys - import spacy - - # Check argument present - if len(sys.argv) < 2: - print(f"Usage: python split_by_sentences_up_to_n_words.py ", file=sys.stderr) - sys.exit(1) - - n = int(sys.argv[1]) - - nlp = spacy.load('en_core_web_sm') - for line in sys.stdin: - paragraph = line.rstrip('\n') - doc = nlp(paragraph) - word_count = word_qty(doc) - if word_count > n: - accum = '' - word_count_accum = 0 - for sent in doc.sents: - sentence = sent.text - sentence_word_count = word_qty(sent) - if word_count_accum + sentence_word_count < n: - if accum != '': - accum = accum.rstrip() + ' ' - accum = accum + sentence - word_count_accum = word_count_accum + sentence_word_count - else: - print(accum) - accum = sentence - word_count_accum = sentence_word_count - if accum != '': - print(accum) - else: - print(paragraph) - diff --git a/examples/harry-potter-corpus/to_tsv.py b/examples/harry-potter-corpus/to_tsv.py deleted file mode 100755 index a51a6e8..0000000 --- a/examples/harry-potter-corpus/to_tsv.py +++ /dev/null @@ -1,21 +0,0 @@ -#! /usr/bin/env python3 - -if __name__ == "__main__": - import sys - import csv - - # Check argument present - if len(sys.argv) < 2: - print(f"Usage: python to_tsv.py ", file=sys.stderr) - sys.exit(1) - - title_prefix = sys.argv[1] - - doc = csv.writer(sys.stdout, delimiter='\t', lineterminator='\n') - counter = 1 - for line in sys.stdin: - line = line.rstrip('\n') - data = [ line, title_prefix + " Paragraph " + str(counter) ] - doc.writerow(data) - counter = counter + 1 -