The following example pipelines are currently deemed deprecated, in favour of the Simple Pipeline (which can be run via Apache Beam as well as the API Server.
The pipelines documented here can only be run via Apache Beam. However, for complex requirements this might be the right choice.
This pipeline will run Grobid is used for the actual conversion.
To run the example conversion with the defaults:
python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "/path/to/pdfs/*/*.pdf"That will automatically download and run a Grobid Service instance.
Or specify the Grobid URL and file suffix (in that case the Grobid Service is assumed to be running):
python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "/path/to/pdfs/*/*.pdf" \
--grobid-url http://localhost:8080 --output-suffix .tei-header.xmlOr specify an XSLT transformation, e.g. using grobid-jats.xsl:
python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "/path/to/pdfs/*/*.pdf" \
--xslt-path grobid-jats.xslAssuming you have already authenticated with Google's Cloud SDK you can also work with buckets by specifying the URL:
python -m sciencebeam_pipelines.examples.grobid_service_pdf_to_xml --input "gs://example_bucket/path/to/pdfs/*.pdf"You can use the grobid_service_pdf_to_xml.py example as a template and add your own steps.