Is there a loading progress indicator during nlp tokenization? #12117

nhershy · 2023-01-17T21:59:08Z

nhershy
Jan 17, 2023

I am loading the contents of entire books into the NLP object. For the last book (which was a text file of about 1MB and 960,000 chars), it took around 20 minutes or so to finish tokenizing. I will be needing to load even larger files. How can I get some type of indicator that shows percentage complete while the tokenizing process is ongoing? That way I can get a rough estimate of the times required.

input = loadBook(FILE_PATH)
print("Tokenizing Words...")
tokens = nlp(input) # Want to show current progress from this action here
print("Tokenizing Complete!")

Answered by adrianeboyd

Jan 18, 2023

Let me back up a step and ask a few questions for clarification:

which model are you loading for nlp?
are you only interested in tokenizing your text or are you interested in annotation from additional pipeline components (part-of-speech tags, parses, entities, etc.)?

General comments:

we'd recommend splitting your input text up into smaller logical chunks for processing with spacy (paragraphs, pages, sections)
20 minutes is an extremely long time for a single text no matter what (do you have very slow custom pipeline components? are you running a transformer model on CPU? is it possibly thrashing?)

View full answer

adrianeboyd · 2023-01-18T08:10:00Z

adrianeboyd
Jan 18, 2023

Let me back up a step and ask a few questions for clarification:

which model are you loading for nlp?
are you only interested in tokenizing your text or are you interested in annotation from additional pipeline components (part-of-speech tags, parses, entities, etc.)?

General comments:

we'd recommend splitting your input text up into smaller logical chunks for processing with spacy (paragraphs, pages, sections)
20 minutes is an extremely long time for a single text no matter what (do you have very slow custom pipeline components? are you running a transformer model on CPU? is it possibly thrashing?)

2 replies

nhershy Jan 18, 2023
Author

Thank you for getting back to me, Adrian.

The model I am using is en_core_web_trf
I was under the impression that when the input string is tokenized, part-of-speech tages and other under-the-hood processes happened automatically (which yes, I do want the POS tags, lemmas, etc). The tokens currently contain all the information I desire (and more so).
I don't have any custom pipelines or other custom transforms. I'm new to using spaCy, and my setup is pretty "vanilla" right now.

Besides breaking up the file into chunks, is there an alternative way to view progress/loading percent when the tokenizing is happening. The program and tokenizing works successfully without issue. I'd just like to have more control to view progress complete (like how I created my own method to view progress in the loop below)

Here are the relevant portions of my code:

import os
import string

def showProgress(count, total):
    if (count % 10000 == 0):
        print("{0}%".format(int((count / total) * 100)))

import en_core_web_trf

nlp = en_core_web_trf.load()

FILE_PATH = "the_godfather.txt"  # THIS IS THE 1MB TEXT FILE WITH 960,000 CHARS
DEBUG = False

uasi_word_list = []
input = ""

if DEBUG:
    input = "The men who had tried to dishonour her"
    input = endWithPunctuation(input)
else:
    print("(1/3) Importing File...")
    input = loadBook(FILE_PATH)

print("(2/3) Tokenizing Words...")
tokens = nlp(input)  # THIS IS WHERE THE 20-30 MINUTE LOADING HAPPENS.

total_count = len(tokens)
print("(3/3) Uasizing Words...")
print("Total Words: {0}".format(total_count))
index = -1
count = 0
for token in tokens:
    index += 1

    count += 1
    showProgress(count, total_count)

    if endOfSentenceReset(token):
        uasi_word_list.append(token.text)
        continue

    auxilaryWordCheck(token, tokens[index+1])

    if g_remove_word:
        g_remove_word = False
        continue

    isIrregularUasiWord(token)

    isProperNoun(token)

    # If not an irregular Uasi word
    performVowelShift(token, tokens[index-1])

    ifWordEndsInW()

    addOrisSonos(token, tokens[index+1], tokens[index-1])

    uasi_word_list.append(g_uasi_word)
    g_uasi_word = ""

my program can be found here, with all the relevant code in main.py file in case you need additional info

adrianeboyd Jan 19, 2023

Using en_core_web_trf on CPU is going to be very slow for very little gain in the annotation accuracy for the annotation you're looking at. I'd recommend using en_core_web_md instead. (It will be fast enough that you might not feel like you need a progress bar.)

If you split your input up into shorter texts, you can get a progress bar, e.g. with tqdm:

from tqdm import tqdm
import spacy

texts = ["a", "b", "c"]

nlp = spacy.load("en_core_web_md")
for doc in nlp.pipe(tqdm(texts)):
    pass

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Is there a loading progress indicator during nlp tokenization? #12117

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Is there a loading progress indicator during nlp tokenization? #12117

Uh oh!

nhershy Jan 17, 2023

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Jan 18, 2023

Uh oh!

nhershy Jan 18, 2023 Author

Uh oh!

adrianeboyd Jan 19, 2023

nhershy
Jan 17, 2023

Replies: 1 comment 2 replies

adrianeboyd
Jan 18, 2023

nhershy Jan 18, 2023
Author