Help me segment #13616
Help me segment
#13616
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
i am using spacy sentisizer to make a sentence array for a page i downloaded from confluence. i want to make it so that a whole table is treated as a single sentence after using nlp.
Before the start of each table is 'SRT' and after the end of the table is 'END'. e.g:
SRT| Name | Wert | Bemerkung |
| --- | --- | --- |
| d_1_SiO2 | 70 [nm] | HCl-Oxid |
| d_2_Nitr | 500[nm] | TS: 13522 |
| d_3_SiO2 | 70 [nm] | HTO |
| d_4_MoSi2 | 100 [nm] | MoSi2, nominelle Dicke PVD |
| d_5_SiO2 | 50 [nm] | HTO |
| d_6_Nitr | 100[nm] | TS: 13522 |END
How ever after sentisizing it looks like this:
SRT|
,| d_1_SiO2 | 70 [nm] |
,HCl-Oxid |
| d_2_Nitr | 500[nm] | TS: 13522 |
| d_3_SiO2 | 70 [nm] |
,HTO |
| d_4_MoSi2
,| 100 [nm] | MoSi2, nominelle Dicke PVD |
| d_5_SiO2 | 50 [nm] | HTO |
| d_6_Nitr
,| 100[nm] | TS:
,13522 |END
As you can see, it has set the elements inside the table as sentences as well.
This the code that i have used:
import spacy
from spacy import language
from spacy.language import Language
num_sentence_chunk_size = 5
nlp2 = spacy.load('de_core_news_lg')
@Language.component('table_segmentor')
def table_segmentor(doc:str):
print((doc[23]))
for i, token in enumerate(doc[:-1]):
if token.text == 'SRT|':
doc[i+1].is_sent_start=True
print('found srt')
elif token.text == '|END':
print('found end')
doc[i].is_sent_start=False
return doc
nlp2.add_pipe('table_segmentor',before='parser')
s_t_s=[]
s_text_spacy=list(nlp2(f_text).sents)
s_t_s= [str(sentence) for sentence in s_text_spacy]
print(len(s_text_spacy))
for item in s_t_s:
print(',' + item)
it has correctly identified all the starting and ending points of the table but i can't get it to work properly, any help would be apreciated.
Beta Was this translation helpful? Give feedback.
All reactions