Pandas Dataframe integration with Doc saving #10914

Josef-Hlink · 2022-06-04T23:32:26Z

Josef-Hlink
Jun 4, 2022

For a school project, I am tasked with calculating similarity scores between search terms and product titles.

For this, we use rather big datasets that are stored as csv files.

I want to read the data into a pd dataframe, and map the relevant columns to spaCy Doc objects using the nlp function with the language "en_core_web_lg" like this:

def parse_dataframes(dataframes: dict[str, pd.DataFrame], colnames: list[str]) -> dict[str, pd.DataFrame]:

	dfs: dict[str, pd.DataFrame] = {}
	nlp: spacy.Language = spacy.load('en_core_web_lg')

	for df_name, df in dataframes.items():
		for col in colnames:
			df[col] = df[col].apply(nlp)
		dfs.update({df_name: df})
	
	return dfs

This works, but because the dataset is rather huge and I don't want to wait multiple hours every time I change a little bit of code afterwards, I tried to save the parsed data.

This can of course not be done by simply writing everything back to a csv file, as all spaCy's useful vector info would get lost.

Using spaCy's doc.to_disk method, I thought of this approach:

def store_as_docs(dataframes: dict[str, pd.DataFrame], colnames: list[str]) -> None:

	for df_name, df in dataframes.items():
		if not os.path.exists(df_dir := os.path.join(os.getcwd(), '..', 'spacy_docs', df_name)):
			os.mkdir(df_dir)
		for index, row in df.iterrows():
			for col in colnames:
				doc = row[col]
				if not os.path.exists(col_dir := os.path.join(df_dir, col)):
					os.mkdir(col_dir)
				doc.to_disk(os.path.join(df_dir, col, str(index)))

For instance, say I'm now only interested in the dataframes called ["train", "test"] and the columns ["search_term", "product_title"].

This would create a directory looking like this:

spacy_docs
 ├─train
 │  ├─search_term
 │  │  ├─1
 │  │  ├─2
 │  │  └─...
 │  └─product_title
 │     ├─1
 │     ├─2
 │     └─...
 └─test
    ├─search_term
    │  ├─1
    │  ├─2
    │  └─...
    └─product_title
       ├─1
       ├─2
       └─...

However, when I try to read the data back into the dataframe (which is a lot faster than the previously mentioned couple of hours), something weird happens with the datatypes.

This is my current approach:

def create_doc_dfs(dataframes: dict[str, pd.DataFrame], colnames: list[str]) -> dict[str, pd.DataFrame]:

	dfs: dict[str, pd.DataFrame] = {}
	empty_vocab = spacy.vocab.Vocab()
	
	for df_name, df in dataframes.items():
		df_path = os.path.join(os.getcwd(), '..', 'spacy_docs', df_name)
		for col_name in colnames:
			docs: list[spacy.tokens.doc.Doc] = []
			col_path = os.path.join(df_path, col_name)
			for i in range(len(df)):
				doc = spacy.tokens.Doc(empty_vocab).from_disk(os.path.join(col_path, str(i)))
				docs.append(doc)
			df[col_name] = docs
		dfs.update({df_name: df})

	return dfs

This was tweaked a bit, as directly storing this freshly read in doc resulted in the dataframes being returned with the entries being cast back to regular strings...

This is how I'm calling the function, and the data I'm printing:

loaded_dataframes: dict[str, pd.DataFrame] = create_doc_dfs(dataframes, colnames)

for name, dataframe in loaded_dataframes.items():
	print(name)
	print(dataframe.head(5))
	print(dataframe.columns)
	print()
	for index, row in dataframe.head(5).iterrows():
		print(type(row['search_term']))
		print(row['search_term'], '<->', row['product_title'], row['search_term'].similarity(row['product_title']))
		print()

But this is what happens:

train
   id  product_uid  ...             search_term relevance
0   2       100001  ...        (angle, bracket)      3.00
1   3       100001  ...            (l, bracket)      2.50
2   9       100002  ...            (deck, over)      3.00
3  16       100005  ...    (rain, shower, head)      2.33
4  17       100005  ...  (shower, only, faucet)      2.67

[5 rows x 5 columns]
Index(['id', 'product_uid', 'product_title', 'search_term', 'relevance'], dtype='object')

<class 'spacy.tokens.doc.Doc'>
~/Uni/DS/DS_3/Home-Depot/src/main.py:44:
UserWarning: [W007] The model you're using has no word vectors loaded,
so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements.
This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors.
You can always add your own word vectors, or use one of the larger models instead if available.
  print(row['search_term'], '<->', row['product_title'], row['search_term'].similarity(row['product_title']))
angle bracket <-> Simpson Strong-Tie 12-Gauge Angle 0.46330680621836995

<class 'spacy.tokens.doc.Doc'>
l bracket <-> Simpson Strong-Tie 12-Gauge Angle 0.2709140653397571

<class 'spacy.tokens.doc.Doc'>
deck over <-> BEHR Premium Textured DeckOver 1-gal. #SC-141 Tugboat Wood and Concrete Coating 0.07290199872635406

<class 'spacy.tokens.doc.Doc'>
rain shower head <-> Delta Vero 1-Handle Shower Only Faucet Trim Kit in Chrome (Valve Not Included) 0.48445525790674526

<class 'spacy.tokens.doc.Doc'>
shower only faucet <-> Delta Vero 1-Handle Shower Only Faucet Trim Kit in Chrome (Valve Not Included) 0.43127081508162296

My question is:

Why does (some of) the (meta)data seem to get lost, and what can I do to preserve it?

Thank you in advance!

Answered by polm

Jun 5, 2022

Word vectors are keyed to the Vocab, so when you use an empty Vocab you lose all that information.

For serializing Docs you should use a DocBin and the same pipeline (nlp object) you used when creating the Docs, see the serialization guide.

View full answer

polm · 2022-06-05T03:29:19Z

polm
Jun 5, 2022

Word vectors are keyed to the Vocab, so when you use an empty Vocab you lose all that information.

For serializing Docs you should use a DocBin and the same pipeline (nlp object) you used when creating the Docs, see the serialization guide.

1 reply

Josef-Hlink Jun 5, 2022
Author

Thank you, I worked all day on it but it's finally smoothly working.

I'm kinda getting the hang of combining spaCy with pandas 👾 🐼

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pandas Dataframe integration with Doc saving #10914

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Pandas Dataframe integration with Doc saving #10914

Uh oh!

Josef-Hlink Jun 4, 2022

Replies: 1 comment · 1 reply

Uh oh!

polm Jun 5, 2022

Uh oh!

Josef-Hlink Jun 5, 2022 Author

Josef-Hlink
Jun 4, 2022

Replies: 1 comment 1 reply

polm
Jun 5, 2022

Josef-Hlink Jun 5, 2022
Author