Deprecated Pipeline Examples

The following example pipelines are currently deemed deprecated, in favour of the Simple Pipeline (which can be run via Apache Beam as well as the API Server.

The pipelines documented here can only be run via Apache Beam. However, for complex requirements this might be the right choice.

Grobid Example Pipeline

This pipeline will run Grobid is used for the actual conversion.

To run the example conversion with the defaults:

python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "/path/to/pdfs/*/*.pdf"

That will automatically download and run a Grobid Service instance.

Or specify the Grobid URL and file suffix (in that case the Grobid Service is assumed to be running):

python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "/path/to/pdfs/*/*.pdf" \
 --grobid-url http://localhost:8080 --output-suffix .tei-header.xml

Or specify an XSLT transformation, e.g. using grobid-jats.xsl:

python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "/path/to/pdfs/*/*.pdf" \
 --xslt-path grobid-jats.xsl

Assuming you have already authenticated with Google's Cloud SDK you can also work with buckets by specifying the URL:

python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "gs://example_bucket/path/to/pdfs/*.pdf"

Extending the Pipeline (deprecated)

You can use the grobid_service_pdf_to_xml.py example as a template and add your own steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecated Pipeline Examples

Grobid Example Pipeline

Extending the Pipeline (deprecated)

FilesExpand file tree

deprecated-pipeline-examples.md

Latest commit

History

deprecated-pipeline-examples.md

File metadata and controls

Deprecated Pipeline Examples

Grobid Example Pipeline

Extending the Pipeline (deprecated)