- 
                Notifications
    You must be signed in to change notification settings 
- Fork 11
xslt
The XSLTProcessor class defined in spark-xml-utils provides methods that enable the transformation of a record by applying a stylesheet. The record is assumed to be a string of xml.
The following import is required for the XSLTProcessor.
	import com.elsevier.spark_xml_utils.xslt.XSLTProcessor
All that is required is the stylesheet that will be used for the transformation. Typically I store the stylesheet (as a string) in an S3 bucket. The stylesheet can then be easily retrieved using sc.textFile. Alternatively, the stylesheet could be defined in the code as a string.
	val stylesheet = sc.textFile("/some-bucket/darin/spark-stylesheets/srctitle.xsl").collect.head
	val proc = XSLTProcessor.getInstance(stylesheet)
The result of an transform operation will be the result of applying the stylesheet against the content (a string of xml). The transformation can occur locally on the driver (if you have returned records to the driver) or on the workers. In practice, the transformation will typically occur on the workers but I will show examples of both.
When transforming locally on the driver , the code would be something like the following. In the example below local is an Array of (String,String) where the first item is the key and the second item is the string of xml.
	import com.elsevier.spark_xml_utils.xslt.XSLTProcessor
	val stylesheet = sc.textFile("/some-bucket/darin/spark-stylesheets/srctitle.xsl").collect.head
	val proc = XSLTProcessor.getInstance(stylesheet)
	val localSrctitles = local.map(rec => proc.transform(rec._2))
When transforming on the workers, the code would be something like the following. In the example below xmlKeyPair is an RDD of (String,String) where the first item is the key and the second item is the string of xml.
	import com.elsevier.spark_xml_utils.xslt.XSLTProcessor
	val stylesheet = sc.textFile("/some-bucket/darin/spark-stylesheets/srctitle.xsl").collect.head
	val srctitles = xmlKeyPair.mapPartitions(recsIter => {
                      val proc = XSLTProcessor.getInstance(stylesheet)
                      recsIter.map(rec => proc.transform(rec._2))
                    })
If there is an error encountered during the operation, the error will be logged and an exception will be raised.
I have successfully used XSLTProcessor from the spark-shell and notebook environments (such as Databricks and Zeppelin). Depending on the environment, you just need to get the spark-xml-utils.jar installed and available to the driver and workers. For the spark-shell, something like the following would be done.
	cd {spark-install-dir}
	./bin/spark-shell --jars lib/uber-spark-xml-utils-1.2.0.jar