Pandas Dataframe integration with Doc saving #10914
-
For a school project, I am tasked with calculating similarity scores between search terms and product titles. For this, we use rather big datasets that are stored as csv files. I want to read the data into a pd dataframe, and map the relevant columns to spaCy Doc objects using the nlp function with the language "en_core_web_lg" like this: def parse_dataframes(dataframes: dict[str, pd.DataFrame], colnames: list[str]) -> dict[str, pd.DataFrame]:
dfs: dict[str, pd.DataFrame] = {}
nlp: spacy.Language = spacy.load('en_core_web_lg')
for df_name, df in dataframes.items():
for col in colnames:
df[col] = df[col].apply(nlp)
dfs.update({df_name: df})
return dfs This works, but because the dataset is rather huge and I don't want to wait multiple hours every time I change a little bit of code afterwards, I tried to save the parsed data. This can of course not be done by simply writing everything back to a csv file, as all spaCy's useful vector info would get lost. Using spaCy's def store_as_docs(dataframes: dict[str, pd.DataFrame], colnames: list[str]) -> None:
for df_name, df in dataframes.items():
if not os.path.exists(df_dir := os.path.join(os.getcwd(), '..', 'spacy_docs', df_name)):
os.mkdir(df_dir)
for index, row in df.iterrows():
for col in colnames:
doc = row[col]
if not os.path.exists(col_dir := os.path.join(df_dir, col)):
os.mkdir(col_dir)
doc.to_disk(os.path.join(df_dir, col, str(index))) For instance, say I'm now only interested in the dataframes called This would create a directory looking like this:
However, when I try to read the data back into the dataframe (which is a lot faster than the previously mentioned couple of hours), something weird happens with the datatypes. This is my current approach: def create_doc_dfs(dataframes: dict[str, pd.DataFrame], colnames: list[str]) -> dict[str, pd.DataFrame]:
dfs: dict[str, pd.DataFrame] = {}
empty_vocab = spacy.vocab.Vocab()
for df_name, df in dataframes.items():
df_path = os.path.join(os.getcwd(), '..', 'spacy_docs', df_name)
for col_name in colnames:
docs: list[spacy.tokens.doc.Doc] = []
col_path = os.path.join(df_path, col_name)
for i in range(len(df)):
doc = spacy.tokens.Doc(empty_vocab).from_disk(os.path.join(col_path, str(i)))
docs.append(doc)
df[col_name] = docs
dfs.update({df_name: df})
return dfs This was tweaked a bit, as directly storing this freshly read in This is how I'm calling the function, and the data I'm printing: loaded_dataframes: dict[str, pd.DataFrame] = create_doc_dfs(dataframes, colnames)
for name, dataframe in loaded_dataframes.items():
print(name)
print(dataframe.head(5))
print(dataframe.columns)
print()
for index, row in dataframe.head(5).iterrows():
print(type(row['search_term']))
print(row['search_term'], '<->', row['product_title'], row['search_term'].similarity(row['product_title']))
print() But this is what happens:
My question is: Why does (some of) the (meta)data seem to get lost, and what can I do to preserve it? Thank you in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Word vectors are keyed to the Vocab, so when you use an empty Vocab you lose all that information. For serializing Docs you should use a DocBin and the same pipeline ( |
Beta Was this translation helpful? Give feedback.
Word vectors are keyed to the Vocab, so when you use an empty Vocab you lose all that information.
For serializing Docs you should use a DocBin and the same pipeline (
nlp
object) you used when creating the Docs, see the serialization guide.