Merge pull request #15 from ibm-watson-data-lab/ibmos2spark_COSSupport_Scala

G Adam Cox · web-flow · commit 4c6d13e4becf · 2017-08-04T13:06:54.000-07:00
Scala Cloud Object Storage Support 

+1
diff --git a/scala/README.md b/scala/README.md
@@ -1,22 +1,22 @@
 # ibmos2spark
 
-The package sets Spark Hadoop configurations for connecting to 
+The package sets Spark Hadoop configurations for connecting to
 IBM Bluemix Object Storage and Softlayer Account Object Storage instances. This packages uses the new [stocator](https://github.com/SparkTC/stocator) driver, which implements the `swift2d` protocol, and is availble
-on the latest IBM Apache Spark Service instances (and through IBM Data Science Experience). 
+on the latest IBM Apache Spark Service instances (and through IBM Data Science Experience).
 
-Using the `stocator` driver connects your Spark executor nodes directly 
+Using the `stocator` driver connects your Spark executor nodes directly
 to your data in object storage.
 This is an optimized, high-performance method to connect Spark to your data. All IBM Apache Spark kernels
-are instantiated with the `stocator` driver in the Spark kernel's classpath. 
-You can also run this locally by installing the [stocator driver](https://github.com/SparkTC/stocator) 
-and adding it to your local Apache Spark kernel's classpath. 
+are instantiated with the `stocator` driver in the Spark kernel's classpath.
+You can also run this locally by installing the [stocator driver](https://github.com/SparkTC/stocator)
+and adding it to your local Apache Spark kernel's classpath.
 
 
 ## Installation
 
 This library is cross-built on both Scala 2.10 (for Spark 1.6.0) and Scala 2.11 (for Spark 2.0.0 and greater)
 
-### Releases 
+### Releases
 
 #### SBT library dependency
 
@@ -69,8 +69,8 @@ Data Science Experience](http://datascience.ibm.com), will install the package.
 
 ### Snapshots
 
-From time-to-time, a snapshot version may be released if fixes or new features are added. 
-The following snipets show how to install snapshot releases. 
+From time-to-time, a snapshot version may be released if fixes or new features are added.
+The following snipets show how to install snapshot releases.
 Replace the version number (`0.0.7`) in the following examples with the version you desire.
 
 ##### SBT library dependency
@@ -138,24 +138,52 @@ Add SNAPSHOT repository to pom.xml
 ## Usage
 
 The usage of this package depends on *from where* your Object Storage instance was created. This package
-is intended to connect to IBM's Object Storage instances obtained from Bluemix or Data Science Experience 
-(DSX) or from a separate account on IBM Softlayer. The instructions below show how to connect to 
-either type of instance. 
+is intended to connect to IBM's Object Storage instances obtained from Bluemix or Data Science Experience
+(DSX) or from a separate account on IBM Softlayer. It also supports IBM cloud object storage (COS).
+The instructions below show how to connect to either type of instance.
 
 The connection setup is essentially the same. But the difference for you is how you deliver the
 credentials. If your Object Storage was created with Bluemix/DSX, with a few clicks on the side-tab
 within a DSX Jupyter notebook, you can obtain your account credentials in the form of a HashMap object.
 If your Object Storage was created with a Softlayer account, each part of the credentials will
-be found as text that you can copy and paste into the example code below. 
+be found as text that you can copy and paste into the example code below.
+
+### IBM Cloud Object Storage / Data Science Experience
+```scala
+import com.ibm.ibmos2spark.CloudObjectStorage
+
+// The credentials HashMap may be created for you with the
+// "insert to code" link in your DSX notebook.
+
+var credentials = scala.collection.mutable.HashMap[String, String](
+  "endPoint"->"https://identity.open.softlayer.com",
+  "accessKey"->"xx",
+  "secretKey"->"xx"
+)
+var bucketName = "myBucket"
+var objectname = "mydata.csv"
+
+var cos = new CloudObjectStorage(sc, credentials)
+var spark = SparkSession.
+    builder().
+    getOrCreate()
+
+var dfData1 = spark.
+    read.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").
+    option("header", "true").
+    option("inferSchema", "true").
+    load(cos.url(bucketName, objectname))
+```
+
 
 ### Bluemix / Data Science Experience
 
 
 ```scala
 import com.ibm.ibmos2spark.bluemix
 
-// The credentials HashMap may be created for you with the 
-// "insert to code" link in your DSX notebook. 
+// The credentials HashMap may be created for you with the
+// "insert to code" link in your DSX notebook.
 
 var credentials = scala.collection.mutable.HashMap[String, String](
   "auth_url"->"https://identity.open.softlayer.com",
@@ -199,7 +227,7 @@ var rdd = sc.textFile(slos.url(container , objectname))
 ### Package Info
 
 One can use the automatically generated object, `BuildInfo`, to obtain the package version
-and other information. This object is automatically generated by the 
+and other information. This object is automatically generated by the
 [`sbt-buildinfo`](https://github.com/sbt/sbt-buildinfo) plugin.
 
 ```
@@ -208,9 +236,9 @@ import com.ibm.ibmos2spark.BuildInfo
 var buildstring = BuildInfo.toString
 var buildbmap = BuildInfo.toMap
 var buildjson = BuildInfo.toJson
-``` 
+```
 
-## License 
+## License
 
 Copyright 2016 IBM Cloud Data Services
 
diff --git a/scala/src/main/scala/Osconfig.scala b/scala/src/main/scala/Osconfig.scala
@@ -10,14 +10,14 @@ object urlbuilder{
   }
 }
 
-/** 
+/**
 * softlayer class sets up a swift connection between an IBM Spark service
-* instance and Softlayer Object Storgae instance. 
-*  
+* instance and Softlayer Object Storgae instance.
+*
 * Constructor arguments
 
 *    sparkcontext: a SparkContext object.
-* 
+*
 *    name: string that identifies this configuration. You can
 *      use any string you like. This allows you to create
 *      multiple configurations to different Object Storage accounts.
@@ -26,14 +26,14 @@ object urlbuilder{
 *      Softlayer Object Store
 */
 
-class softlayer(sc: SparkContext, name: String, auth_url: String, 
-                  tenant: String, username: String, password: String, 
+class softlayer(sc: SparkContext, name: String, auth_url: String,
+                  tenant: String, username: String, password: String,
                   swift2d_driver: String = "com.ibm.stocator.fs.ObjectStoreFileSystem",
                   public: Boolean=false){
-    
-    
+
+
     val hadoopConf = sc.hadoopConfiguration;
-    val prefix = "fs.swift2d.service." + name 
+    val prefix = "fs.swift2d.service." + name
 
     hadoopConf.set("fs.swift2d.impl",swift2d_driver)
     hadoopConf.set(prefix + ".auth.url",auth_url)
@@ -48,13 +48,13 @@ class softlayer(sc: SparkContext, name: String, auth_url: String,
     hadoopConf.setBoolean(prefix + ".location-aware",false)
     hadoopConf.set(prefix + ".password",password)
 
-    
+
     def url(container_name: String, object_name:String) : String= {
         return(urlbuilder.swifturl2d(name= name, container_name,object_name))
     }
 }
 
-/** 
+/**
 * bluemix class sets up a swift connection between an IBM Spark service
 * instance and an Object Storage instance provisioned through IBM Bluemix.
 
@@ -63,7 +63,7 @@ class softlayer(sc: SparkContext, name: String, auth_url: String,
 *   sparkcontext:  a SparkContext object.
 
 *   credentials:  a dictionary with the following required keys:
-*   
+*
 *     auth_url
 
 *     project_id (or projectId)
@@ -73,13 +73,13 @@ class softlayer(sc: SparkContext, name: String, auth_url: String,
 *     password
 
 *     region
-* 
+*
 *   name:  string that identifies this configuration. You can
 *     use any string you like. This allows you to create
 *     multiple configurations to different Object Storage accounts.
 *     This is not required at the moment, since credentials['name']
 *     is still supported.
-* 
+*
 * When using this from a IBM Spark service instance that
 * is configured to connect to particular Bluemix object store
 * instances, the values for these credentials can be obtained
@@ -88,9 +88,9 @@ class softlayer(sc: SparkContext, name: String, auth_url: String,
 */
 
 class bluemix(sc: SparkContext, name: String, creds: HashMap[String, String],
-                swift2d_driver: String = "com.ibm.stocator.fs.ObjectStoreFileSystem", 
+                swift2d_driver: String = "com.ibm.stocator.fs.ObjectStoreFileSystem",
                 public: Boolean =false){
-    
+
 
     def ifexist(credsin: HashMap[String, String], var1: String, var2: String): String = {
         if (credsin.keySet.exists(_ == var1)){
@@ -103,7 +103,7 @@ class bluemix(sc: SparkContext, name: String, creds: HashMap[String, String],
     val username = ifexist(creds, "user_id","userId")
     val tenant = ifexist(creds, "project_id","projectId")
 
-    
+
     val hadoopConf = sc.hadoopConfiguration;
     val prefix = "fs.swift2d.service." + name;
 
@@ -118,10 +118,65 @@ class bluemix(sc: SparkContext, name: String, creds: HashMap[String, String],
     hadoopConf.setBoolean(prefix + ".public",public)
     hadoopConf.set(prefix + ".region",creds("region"))
     hadoopConf.setInt(prefix + ".http.port",8080)
-    
+
     def url(container_name: String, object_name:String) : String= {
         return(urlbuilder.swifturl2d(name= name, container_name,object_name))
     }
 }
 
+/**
+* CloudObjectStorage class sets up a s3d connection between an IBM Spark service
+* instance and an IBM Cloud Object Storage instance.
+
+* Constructor arguments:
+
+*   sparkcontext:  a SparkContext object.
+
+*   credentials:  a dictionary with the following required keys:
+*
+*     endpoint
+
+*     accessKey
+
+*     secretKey
+
+*    cosId [optional]: this parameter is the cloud object storage unique id. It is useful
+            to keep in the class instance for further checks after the initialization. However,
+            it is not mandatory for the class instance to work. This value can be retrieved by
+            calling the getCosId function.
+
+    bucket_name (projectId in DSX) [optional]:  string that identifies the defult
+             bucket nameyou want to access files from in the COS service instance.
+             In DSX, bucket_name is the same as projectId. One bucket is
+             associated with one project.
+             If this value is not specified, you need to pass it when
+             you use the url function.
+*
+    Warning: creating a new instance of this class would overwrite the existing
+              spark hadoop configs if set before if used with the same spark context instance.
+*/
+class CloudObjectStorage(sc: SparkContext, credentials: HashMap[String, String], cosId: String = "") {
 
+    // check if all credentials are available
+    val requiredValues = Array("endPoint", "accessKey", "secretKey")
+    for ( key <- requiredValues ) {
+        if (!credentials.contains(key)) {
+            throw new IllegalArgumentException("Invalid input: missing required input [" + key + "]")
+        }
+    }
+
+    // set config
+    val hadoopConf = sc.hadoopConfiguration
+    val prefix = "fs.s3d.service"
+    hadoopConf.set(prefix + ".endpoint", credentials("endPoint"))
+    hadoopConf.set(prefix + ".access.key", credentials("accessKey"))
+    hadoopConf.set(prefix + ".secret.key", credentials("secretKey"))
+
+    def getCosId() : String = {
+        return cosId
+    }
+
+    def url(bucketName: String, objectName: String) : String = {
+        return "s3d://" + bucketName + ".service/" + objectName
+    }
+}