What is the best way to extract some paragraphs from pdf extracted text #10708

info2000 · 2022-04-25T22:33:28Z

info2000
Apr 25, 2022

I need to extract/structure text from pdf
That mean that the info is on a few pages over the total pages
from that few pages, I need to categorize some span (initialy spans, but that is my question)
these spans are from a single sentence up to 490 tokens (around 6 sentences)

What is the best way to categorize that blocks text?
note: the pdf to text isn't detecting the paragraphs, that will be the 1st option, extract paragraphs to categorize each and next do span cats into it

rules is not an option, because each text is from different writer

Answered by ljvmiranda921

Apr 26, 2022

Hi @info2000 ,

If I may ask, what type of documents are you extracting text from? Also, what kind of categories are you looking for? Because this might be a case of TextCategorization rather than SpanCat. If you need something that's more refined, maybe a two-stage solution can help. Categorize the texts first, then perform a more fine-grained spancat later. Or, a rules-based approach followed by text categorization.

View full answer

ljvmiranda921 · 2022-04-26T11:54:21Z

ljvmiranda921
Apr 26, 2022

Hi @info2000 ,

If I may ask, what type of documents are you extracting text from? Also, what kind of categories are you looking for? Because this might be a case of TextCategorization rather than SpanCat. If you need something that's more refined, maybe a two-stage solution can help. Categorize the texts first, then perform a more fine-grained spancat later. Or, a rules-based approach followed by text categorization.

3 replies

info2000 Apr 26, 2022
Author

are goverment new laws publications,
like from this document https://www.boe.es/boe/dias/2022/01/11/pdfs/BOE-A-2022-417.pdf
the cats or spans, begin on page 9

but when extract pdf to text, doesn't respect the paragraphs, that would be the deal breaker to use Textcategorization with confidence

`criterios de evaluación de las solicitudes que son objetivos, públicos y conocidos previamente, y garantiza una amplia participación a sus potenciales destinatarios en su elaboración.
En la elaboración de la presente orden ha emitido informe el Servicio Jurídico y la Intervención Delegada en el Departamento, de acuerdo con lo dispuesto por el artículo 17.1 de la Ley 38/2003, de 17 de noviembre, y el artículo 61.2 del Real Decreto-ley 36/2020, de 30 de diciembre.
En su virtud, dispongo: CAPÍTULO I Disposiciones generales Artículo 1. Objeto y finalidad.

La presente orden tiene por objeto aprobar las bases reguladoras de ayudas, destinadas a impulsar proyectos de redes de actores que desarrollen experiencias turísticas sostenibles, digitales, integradoras y competitivas en España, de conformidad con lo previsto en el apartado segundo de este artículo, así como aprobar la convocatoria de ayudas para el año 2021.
Se entenderá a los efectos de esta orden que una Experiencia Turismo España impulsa proyectos de redes de actores que desarrollen experiencias turísticas sostenibles, digitales, integradoras y competitivas en España, si se desarrolla en todo el territorio nacional o, al menos, en tres comunidades autónomas, y se enmarca en alguna delas siguientes líneas de trabajo incluyendo algunas de las siguientes acciones: a) Línea de trabajo INNOVA: 1.º Propuestas de creación de redes de actores en todo el territorio nacional o, al menos, en tres comunidades autónomas para impulsar el trabajo colaborativo en torno a una Experiencia Turismo España.
2.º Construcción de relatos sobre Experiencias Turismo España.
3.º Desarrollo o mejora del recurso turístico base para la creación de Experiencias Turismo España.
4.º Rediseño de las Experiencias Turismo España hacia modelos verdes y sostenibles.
5.º Apoyo a la transformación digital de las Experiencias Turismo España.
6.º Formación para la sostenibilidad y digitalización de Experiencias Turismo España.
b) Línea de trabajo INTEGRA: 1.º Estudios y propuestas de planes de adaptación de las Experiencia Turismo España a la lógica de economía circular y otras estrategias de incorporación del tejido productivo local.
2.º Planes de adaptación de productos y servicios a lógica de economía circula y de proximidad.
3.º Implantación de buenas prácticas o mejoras que impliquen mayores impactos positivos de Experiencias Turismo España en comunidades locales.
4.º Propuestas para la incorporación de la diversidad de perfiles de turistas a distintas Experiencias Turismo España (Colectivo LGBTIQ+, mayores, diversidad de familias, distintas religiones, discapacidad entre otros…).
5.º Formación y difusión en y para la adaptación a la diversidad social de las Experiencias Turismo España.
BOLETÍN OFICIAL DEL ESTADO Núm. 9 Martes 11 de enero de 2022 Sec. III. Pág. 2539 cve: BOE-A-2022-417 Verificable en https://www.boe.es/`

ljvmiranda921 Apr 27, 2022

I see, one way you can solve them is to treat this as two separate problems: (1) splitting and (2) categorizing. You can use other off-the-shelf tools like layoutparser, doctr, or anything that uses LayoutLM to split documents into distinct chunks (most of the models they use were trained on scientific publications, but your mileage may vary). Then, after splitting them, you can do text categorization for each split.

The splits may not be perfect, so this may affect how your text categorization will go, but you can perhaps add a manual QA check in the middle to ensure that the data fed into the text categorization model is good enough.

info2000 Apr 28, 2022
Author

many Thanks, I haven't located that tools
Awesome solution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

What is the best way to extract some paragraphs from pdf extracted text #10708

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

What is the best way to extract some paragraphs from pdf extracted text #10708

Uh oh!

info2000 Apr 25, 2022

Replies: 1 comment · 3 replies

Uh oh!

ljvmiranda921 Apr 26, 2022

Uh oh!

info2000 Apr 26, 2022 Author

Uh oh!

ljvmiranda921 Apr 27, 2022

Uh oh!

info2000 Apr 28, 2022 Author

info2000
Apr 25, 2022

Replies: 1 comment 3 replies

ljvmiranda921
Apr 26, 2022

info2000 Apr 26, 2022
Author

info2000 Apr 28, 2022
Author