Skip already "tokenized" string in doc #11587

abundis-rmn2 · 2022-10-06T22:09:20Z

abundis-rmn2
Oct 6, 2022

Am working with a code that split hashtags with multiple word list: 1) One full of city names, 2) another with specific terms, 3) two big and common wordlist (spanish & english), in order to detect city names in hashtags and then geolocate them. The script works ok, but its's really slow, becouse will look for lots of substrings in the city list array after a similar word it's finded in wordlist and then is validated in Geonamecache library | https://github.com/yaph/geonamescache

Am trying to skip hashtags that already pass trough the tokenizer or at least those that the LanguageComponent already tag.

I have this code:

@Language.component("custom_set_extension")
def custom_set_extension(doc):
    #Diccionario General
    wordlist = initialize_words("wl")
    #Diccionario particular en el que se buscan palabras juntas como San francisco y Sanfrancisco (util para hashtags)
    citieslist_arr = initialize_words("cities").splitlines()
    #print(wordlist)
    Token.set_extension("is_geo", default=False, force=True)
    Token.set_extension("geo_countrycode", default=None, force=True)
    Token.set_extension("geo_hashtag", default=None, force=True)
    #print(type(wordlist))
    for token_index, token in enumerate(doc):
        if token._.is_hashtag:
            #print("token es hashtag en componente",token.text)
            if token._.is_geo:
                print("ya tiene data geo") #Should skip token but it always go to parse_tag() func
            else:
                parse_tag(token, token.text, wordlist, citieslist_arr)
            #if token.text in wordlist:
            #    print("lo encontró")
            #    print(token)
    return doc

It detects is_hashtag attribute but doesn't work wih is_geo and always execute parse_tag() function

@Language.component("mention_hashtags_set_extension")

def mention_hashtags_set_extension(doc):
    hashtag_getter = lambda token: token.text.startswith('#')
    Token.set_extension('is_hashtag', getter=hashtag_getter, force=True)
    mention_getter = lambda token: token.text.startswith('@')
    Token.set_extension('is_mention', getter=mention_getter, force=True)
    return

So is there a way that Spacy marks already tokenized strings/words in this case hashtags so i can use this mark as a conditional to skip the whole process of spliting the hashtag, looking for substrings in the arrays and then validating in Geonamescache, and making this process faster?

polm · 2022-10-07T05:00:42Z

polm
Oct 7, 2022

It's a little hard to understand what's going on in your code. What does your pipeline look like / what order are components in? Where is is_geo being set?

You normally shouldn't call set_extension in a Language.component function like that. You only need to call it once per process, so you can call it in an init function with Language.factory or just put it at the top of your script. In particular, mention_hashtags_set_extension really doesn't have to be a component at all, since you just need to run it once.

Based on your sample, it looks like the way you're checking is_geo is fine, but it's not being set for some reason.

I'm also not sure what you mean by "already tokenized". When you get a Doc, every part of it is already tokenized. Is your hashtag related code running retokenize or something?

2 replies

abundis-rmn2 Oct 8, 2022
Author

Thanks for your reply @polm, with your observations, I changed set_extension at the beginning of general-functions.py file that I commented and copy-pasted next

import re
import spacy
from spacy.language import Language
from spacy.tokens import Token
from instagrapi import Client
import time
import geonamescache
import difflib

#Starting Geonamescache, Instagrapi & Spacy
gc = geonamescache.GeonamesCache()
cl = Client()
#nlp = spacy.load("en_core_web_trf", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
nlp = spacy.load("en_core_web_trf")
#nlp = spacy.blank("es")

#Setting Spacy set extension once.
Token.set_extension("is_geo", default=None, force=True)
Token.set_extension("geo_countrycode", default=None, force=True)
Token.set_extension("geo_hashtag", default=None, force=True)

# Function to init wordlists
def initialize_words(dict_txt, space_strip = False):
    content = ""
    content_clean = ""
    with open(dict_txt+'.txt', encoding='UTF-8') as f: # A file containing common english words
        content = f.readlines()
        print("Initialize Words", dict_txt, "file")
    #return [word.lower().rstrip('\n') for word in content]
    print(type(content))
    print(len(content))
    if space_strip == True:
        for word in content:
            #print(word)
            if ' ' in word:
                content_clean += word.lower().rstrip('\n')
                content_clean += "\n"
                word = word.replace(" ", "")
                #print("after replace", word)
                content_clean += word.lower()
            else:
                content_clean += word.lower().rstrip('\n')+" "
                content_clean += "\n"
    elif space_strip == False:
        for word in content:
            #print("No space striping", word)
            content_clean += word.lower().rstrip('\n') + " "
            content_clean += "\n"

    return content_clean.rstrip('\n')

# Global Wordlist
wordlist = initialize_words("wl")
wordlist_arr = wordlist.splitlines()
# print(wordlist)
text_file = open("wl_compiled.txt", "w", encoding="utf-8")
n = text_file.write(str(wordlist))
text_file.close()

# Particular wordlist in this case city names. Will take of space and append to array San Francisco -  sanfrancisco.
# this will help in cases people use #SanFranciscoGraffiti hashtag
citieslist_arr = initialize_words("cities", space_strip=True).splitlines()
# print(citieslist_arr)

@Language.component("mention_hashtags")
def mention_hashtags(doc):
    #if token has # or @ keep it together
    i = []
    for token_index, token in enumerate(doc):
            if token.text== "#" or token.text=="@":
                i.append(token_index)
    for idx, token_pos in enumerate(i):
        with doc.retokenize() as retokenizer:
            token_pos = token_pos - idx
            retokenizer.merge(doc[token_pos: token_pos+2])
    return doc

@Language.component("mention_hashtags_set_extension")
def mention_hashtags_set_extension(doc):
    #Set extension to token if starts with # or @
    hashtag_getter = lambda token: token.text.startswith('#')
    Token.set_extension('is_hashtag', getter=hashtag_getter, force=True)
    mention_getter = lambda token: token.text.startswith('@')
    Token.set_extension('is_mention', getter=mention_getter, force=True)
    return doc

@Language.component("geo_hashtag")
def geo_hashtag(doc):
    for token_index, token in enumerate(doc):
        if token._.is_hashtag:
            print("Token is_hashtag",token.text)
            if token._.is_geo:
                print("And is_geo, we skip it")
            else:
                print("Is_hashtag and is not marked as geo look for cities")
                parse_tag(token, token.text, wordlist, citieslist_arr)
    return doc

# Code taken from this answer - https://stackoverflow.com/a/20518476/19824551
def parse_tag(token, term, wordlist, citieslist_arr):
    print("parse_tag", term)
    words = []
    # Remove hashtag, split by dash
    tags = term[1:].split(' ')
    print("tag without #", tags)
    for tag in tags:
        # This loop will look for a word in wordlist, if find something take this word away from hashtag
        # for example the whole hashtag (token) is #freightgraffitiSanFrancisco, this script will look from the start of the string
        # will find "freight" - > use it in find_word() and strip it away from the whole hashtag
        # then will use the string "graffitiSanFrancisco", and then SanFransisco
        print("Tag after word stripping", tag)
        # Here its looking fot the word in the wordlist
        word = find_word(token, tag, wordlist, citieslist_arr)
        print("despues find_word")
        while word != None and len(tag):
            words.append(word)
            print("algo pasa?")
            if len(tag) == len(word): # Special case for when eating rest of word
                break
            tag = tag[len(word):]
            word = find_word(token, tag, wordlist, citieslist_arr)
    return(" ".join(words))

def find_word(token, tag, wordlist, city_list_arr):
    # Called from parse_tag()
    print("find_word() looking for:", tag)
    tag = tag.lower()
    i = len(tag) + 1
    while i > 1:
        i -= 1
        #print(tag, tag[:i])
        #if tag[:i] in wordlist and len(tag[:i]) > 3: #will do the work, but may flag false positives, next conditional works better but may be slower
        if re.search(r"\b" + re.escape(tag[:i]) + r"\b", wordlist) and len(tag[:i]) > 3: #https://stackoverflow.com/questions/4154961/find-substring-in-string-but-only-if-whole-words
            print("find_word in wordlist:",tag[:i])
            # If a word in globallist is finded will look for it in the city array [Guadalajara, San Franisco, sanfrancisco] - This one have all the cities with spaces already stripped and append
            for city in city_list_arr:
                if string_similarity(tag[:i], city) > .9 and len(tag[:i]) > 5:
                    print("Similarity", tag[:i], city) # Long list of cities, similarity has to real high almost 1
                #if tag[:i] in city and len(tag[:i]) > 5:
                    #print("Tag is in city name", tag[:i], city) # Long list of cities that includes common lenguage words vgr find_word() looking for: rollingcanvas | find_word in wordlist: rolling tag is in city name: rolling - rolling meadows
                #if tag[:i] == city:
                    #print("Exact match", tag[:i], city) # Couldnt make it work
                    # If tag its foundend will look for the similar string using Geosnamecache library
                    city = gc.search_cities(tag[:i], case_sensitive=False)
                    # If response is positive, will call function city_arr that will set_extension is.geo to the whole token.
                    if not len(city) == 0:
                        city_arr(city, i, tag[:i], token)
                    elif tag[:i] in city_list_arr:
                    # If answer is negative will look for the next version of the city from sanfrancisco to San Francisco,
                    # supposed to be the next to each other in array
                        print("Found something:", tag[:i])
                        print("in city_list_arr: ",city_list_arr[city_list_arr.index(tag[:i])-1])
                        city = gc.search_cities(city_list_arr[city_list_arr.index(tag[:i]) - 1], case_sensitive=False)
                        city_arr(city, i, tag[:i], token)
            return tag[:i]
    return None

def city_arr(city, i, tag, token):
    e = 0
    print(i, tag, "is geo outside if cond")
    #Sort cities by population, biggest cities will get on top
    city = sorted(city, key=lambda e: e['population'], reverse=False)
    #Take more populated city and set countrycode to token
    if (city[0]['countrycode'] == "CA") or (city[0]['countrycode'] == "MX") or (city[0]['countrycode'] == "US") and tag in city[0]['name'].lower(): #Just for México, Canada and USA
    #if tag in city[0]['name'].lower(): # Any Country
        print(i, tag, "is geo")
        token._.is_geo = True
        token._.set("geo_countrycode", city[0]['countrycode'])
        token._.set("geo_hashtag", tag)
        print("Hashtag", token.text, "located at:")
        print( city[0])
        e += 1

#https://www.w3resource.com/python-exercises/string/python-data-type-string-exercise-92.php
def string_similarity(str1, str2):
    result =  difflib.SequenceMatcher(a=str1.lower(), b=str2.lower(), autojunk=False)
    return result.ratio()

nlp.add_pipe("mention_hashtags", first=True)
nlp.add_pipe("mention_hashtags_set_extension", after="mention_hashtags")
nlp.add_pipe("geo_hashtag", after="mention_hashtags_set_extension")

This simple "pipeline" do the trick and set_extension to hashtags that contain city names.

The question is how to avoid Spacy to pass the same hashtag already used through the whole process, let me explain. I have some instagram post already scrapped with the amazing adw0rd library https://github.com/adw0rd/instagrapi. I stored these post's in one SQL database, and make some, for now useless, visualization work (you can see it here | https://data.abundis.com.mx/vista/hashtags.php?id=47). Then I download those post captions and run them through the pipeline with this code.

    posts = cursor.fetchall()
    print("MUID found :", len(posts))

    post_list_str = ""
    for post in posts:
        #print(post[11])
        post_list_str += post[11]
        #feel slower

    #print (post_list_str)
    start_time = time.time()
    # cl.login("betitoprendido3", "challenge/action/1")
    s = """#SanDiegoBenching #TlajoGraff #ScottAirForceOne 
            #cherokeetag#NorthPekin#SanFrancisco#SanSebastianelGrande#BuenaVista#WestlakeVillage 
            #winnipegbench #graffiticholula #graffititoluca #canadabench#jaliscograffiti#benchguadalajara#bombasguadalajaramistrik#jasdjaws 
            #jawscaminojalisco#tlajomulco#guadalajaragraffiti"""
    s = post_list_str
    s = re.sub(r'#', r' #', s)
    post_list_str = re.sub(r'#', r' #', post_list_str)

 #taken from here but used in another place https://stackoverflow.com/questions/64565899/how-to-get-a-list-of-unique-tokens-in-spacy
    out = []
    seen = set()
    for word in post_list_str.split(" "):
        if word not in seen:
            out.append(word)
        seen.add(word)
    # now out has "unique" tokens

    unique_post_list_str = ""
    for word in out:
        print(word)
        unique_post_list_str += word + " "

    print("unique tokens:", len(out))
    print("whole corpus:", len(post_list_str.split(" ")))

    doc = nlp(unique_post_list_str)

    for token in doc:
        if not token.is_space:
            print(token.text, token.lemma_, token.pos_)
            if token._.is_hashtag:
                print(token.text, " - hashtag")
                token_hashtag = re.sub(r'#', r'', token.text)
                # print( len(cl.hashtag_info(token_hashtag)) )
                # try:
                # cl.hashtag_info(token_hashtag)
                # print(token.text, " - ", cl.hashtag_info(token_hashtag).media_count)
                # time.sleep(5)
                # except:
                # print("An exception occurred")
                if token._.is_geo:
                    print("Countrycode -", token._.geo_countrycode)
                    print("Lemma hashtag -", token._.geo_hashtag)
            elif token._.is_mention:
                token_mention = re.sub(r'@', r'', token.text)
                # if cl.user_info_by_username(token_mention):
                # print(token.text, " - ", cl.user_info_by_username(token_mention).biography)
                # time.sleep(5)
                print(token.text, " - arroba ")

For now, I am making and array seen of unique words and feed Spacy with this big string. The point here is, I would love to analyze maybe a pipe made of posts, and where a token is already tagged as is_geo just skip. For example: #withSomeHashTags and #GuadalajaraSpacy

texts = ["Instagram post 1 #withSomeHashTags", "Caption #GuadalajaraSpacy ", "Another one #withSomeHashTags #GuadalajaraSpacy""]
for doc in nlp.pipe(texts, batch_size=50):
    assert doc.has_annotation("DEP")

PS. Sorry for funny english and the funnier coding. Am really just learning how to code and for me is important to print anything so i can understand what's happening while running the script.

polm Oct 13, 2022

Thanks for sharing all your code. While context usually helps, I'm afraid it's still not really clear to me what the problem is here or what you're trying to do exactly. It would help if you could give a small example with expected input and output - your example at the very end with #GuadalajaraSpacy is a good start, but I'm not sure what you mean by "just skip". Do you want nlp.pipe to skip the third text because it's already seen #GuadalajaraSpacy before?

If you just want to skip things like that then using a Python set to track tokens or hashtags you've seen before sounds fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Skip already "tokenized" string in doc #11587

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Skip already "tokenized" string in doc #11587

Uh oh!

abundis-rmn2 Oct 6, 2022

Replies: 1 comment · 2 replies

Uh oh!

polm Oct 7, 2022

Uh oh!

Uh oh!

abundis-rmn2 Oct 8, 2022 Author

Uh oh!

polm Oct 13, 2022

abundis-rmn2
Oct 6, 2022

Replies: 1 comment 2 replies

polm
Oct 7, 2022

abundis-rmn2 Oct 8, 2022
Author