Skip to content

Commit 4c6d13e

Browse files
author
G Adam Cox
authored
Merge pull request #15 from ibm-watson-data-lab/ibmos2spark_COSSupport_Scala
Scala Cloud Object Storage Support +1
2 parents 15e18a5 + af05266 commit 4c6d13e

File tree

2 files changed

+119
-36
lines changed

2 files changed

+119
-36
lines changed

scala/README.md

Lines changed: 46 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
11
# ibmos2spark
22

3-
The package sets Spark Hadoop configurations for connecting to
3+
The package sets Spark Hadoop configurations for connecting to
44
IBM Bluemix Object Storage and Softlayer Account Object Storage instances. This packages uses the new [stocator](https://github.com/SparkTC/stocator) driver, which implements the `swift2d` protocol, and is availble
5-
on the latest IBM Apache Spark Service instances (and through IBM Data Science Experience).
5+
on the latest IBM Apache Spark Service instances (and through IBM Data Science Experience).
66

7-
Using the `stocator` driver connects your Spark executor nodes directly
7+
Using the `stocator` driver connects your Spark executor nodes directly
88
to your data in object storage.
99
This is an optimized, high-performance method to connect Spark to your data. All IBM Apache Spark kernels
10-
are instantiated with the `stocator` driver in the Spark kernel's classpath.
11-
You can also run this locally by installing the [stocator driver](https://github.com/SparkTC/stocator)
12-
and adding it to your local Apache Spark kernel's classpath.
10+
are instantiated with the `stocator` driver in the Spark kernel's classpath.
11+
You can also run this locally by installing the [stocator driver](https://github.com/SparkTC/stocator)
12+
and adding it to your local Apache Spark kernel's classpath.
1313

1414

1515
## Installation
1616

1717
This library is cross-built on both Scala 2.10 (for Spark 1.6.0) and Scala 2.11 (for Spark 2.0.0 and greater)
1818

19-
### Releases
19+
### Releases
2020

2121
#### SBT library dependency
2222

@@ -69,8 +69,8 @@ Data Science Experience](http://datascience.ibm.com), will install the package.
6969

7070
### Snapshots
7171

72-
From time-to-time, a snapshot version may be released if fixes or new features are added.
73-
The following snipets show how to install snapshot releases.
72+
From time-to-time, a snapshot version may be released if fixes or new features are added.
73+
The following snipets show how to install snapshot releases.
7474
Replace the version number (`0.0.7`) in the following examples with the version you desire.
7575

7676
##### SBT library dependency
@@ -138,24 +138,52 @@ Add SNAPSHOT repository to pom.xml
138138
## Usage
139139

140140
The usage of this package depends on *from where* your Object Storage instance was created. This package
141-
is intended to connect to IBM's Object Storage instances obtained from Bluemix or Data Science Experience
142-
(DSX) or from a separate account on IBM Softlayer. The instructions below show how to connect to
143-
either type of instance.
141+
is intended to connect to IBM's Object Storage instances obtained from Bluemix or Data Science Experience
142+
(DSX) or from a separate account on IBM Softlayer. It also supports IBM cloud object storage (COS).
143+
The instructions below show how to connect to either type of instance.
144144

145145
The connection setup is essentially the same. But the difference for you is how you deliver the
146146
credentials. If your Object Storage was created with Bluemix/DSX, with a few clicks on the side-tab
147147
within a DSX Jupyter notebook, you can obtain your account credentials in the form of a HashMap object.
148148
If your Object Storage was created with a Softlayer account, each part of the credentials will
149-
be found as text that you can copy and paste into the example code below.
149+
be found as text that you can copy and paste into the example code below.
150+
151+
### IBM Cloud Object Storage / Data Science Experience
152+
```scala
153+
import com.ibm.ibmos2spark.CloudObjectStorage
154+
155+
// The credentials HashMap may be created for you with the
156+
// "insert to code" link in your DSX notebook.
157+
158+
var credentials = scala.collection.mutable.HashMap[String, String](
159+
"endPoint"->"https://identity.open.softlayer.com",
160+
"accessKey"->"xx",
161+
"secretKey"->"xx"
162+
)
163+
var bucketName = "myBucket"
164+
var objectname = "mydata.csv"
165+
166+
var cos = new CloudObjectStorage(sc, credentials)
167+
var spark = SparkSession.
168+
builder().
169+
getOrCreate()
170+
171+
var dfData1 = spark.
172+
read.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").
173+
option("header", "true").
174+
option("inferSchema", "true").
175+
load(cos.url(bucketName, objectname))
176+
```
177+
150178

151179
### Bluemix / Data Science Experience
152180

153181

154182
```scala
155183
import com.ibm.ibmos2spark.bluemix
156184

157-
// The credentials HashMap may be created for you with the
158-
// "insert to code" link in your DSX notebook.
185+
// The credentials HashMap may be created for you with the
186+
// "insert to code" link in your DSX notebook.
159187

160188
var credentials = scala.collection.mutable.HashMap[String, String](
161189
"auth_url"->"https://identity.open.softlayer.com",
@@ -199,7 +227,7 @@ var rdd = sc.textFile(slos.url(container , objectname))
199227
### Package Info
200228

201229
One can use the automatically generated object, `BuildInfo`, to obtain the package version
202-
and other information. This object is automatically generated by the
230+
and other information. This object is automatically generated by the
203231
[`sbt-buildinfo`](https://github.com/sbt/sbt-buildinfo) plugin.
204232

205233
```
@@ -208,9 +236,9 @@ import com.ibm.ibmos2spark.BuildInfo
208236
var buildstring = BuildInfo.toString
209237
var buildbmap = BuildInfo.toMap
210238
var buildjson = BuildInfo.toJson
211-
```
239+
```
212240

213-
## License
241+
## License
214242

215243
Copyright 2016 IBM Cloud Data Services
216244

scala/src/main/scala/Osconfig.scala

Lines changed: 73 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,14 @@ object urlbuilder{
1010
}
1111
}
1212

13-
/**
13+
/**
1414
* softlayer class sets up a swift connection between an IBM Spark service
15-
* instance and Softlayer Object Storgae instance.
16-
*
15+
* instance and Softlayer Object Storgae instance.
16+
*
1717
* Constructor arguments
1818
1919
* sparkcontext: a SparkContext object.
20-
*
20+
*
2121
* name: string that identifies this configuration. You can
2222
* use any string you like. This allows you to create
2323
* multiple configurations to different Object Storage accounts.
@@ -26,14 +26,14 @@ object urlbuilder{
2626
* Softlayer Object Store
2727
*/
2828

29-
class softlayer(sc: SparkContext, name: String, auth_url: String,
30-
tenant: String, username: String, password: String,
29+
class softlayer(sc: SparkContext, name: String, auth_url: String,
30+
tenant: String, username: String, password: String,
3131
swift2d_driver: String = "com.ibm.stocator.fs.ObjectStoreFileSystem",
3232
public: Boolean=false){
33-
34-
33+
34+
3535
val hadoopConf = sc.hadoopConfiguration;
36-
val prefix = "fs.swift2d.service." + name
36+
val prefix = "fs.swift2d.service." + name
3737

3838
hadoopConf.set("fs.swift2d.impl",swift2d_driver)
3939
hadoopConf.set(prefix + ".auth.url",auth_url)
@@ -48,13 +48,13 @@ class softlayer(sc: SparkContext, name: String, auth_url: String,
4848
hadoopConf.setBoolean(prefix + ".location-aware",false)
4949
hadoopConf.set(prefix + ".password",password)
5050

51-
51+
5252
def url(container_name: String, object_name:String) : String= {
5353
return(urlbuilder.swifturl2d(name= name, container_name,object_name))
5454
}
5555
}
5656

57-
/**
57+
/**
5858
* bluemix class sets up a swift connection between an IBM Spark service
5959
* instance and an Object Storage instance provisioned through IBM Bluemix.
6060
@@ -63,7 +63,7 @@ class softlayer(sc: SparkContext, name: String, auth_url: String,
6363
* sparkcontext: a SparkContext object.
6464
6565
* credentials: a dictionary with the following required keys:
66-
*
66+
*
6767
* auth_url
6868
6969
* project_id (or projectId)
@@ -73,13 +73,13 @@ class softlayer(sc: SparkContext, name: String, auth_url: String,
7373
* password
7474
7575
* region
76-
*
76+
*
7777
* name: string that identifies this configuration. You can
7878
* use any string you like. This allows you to create
7979
* multiple configurations to different Object Storage accounts.
8080
* This is not required at the moment, since credentials['name']
8181
* is still supported.
82-
*
82+
*
8383
* When using this from a IBM Spark service instance that
8484
* is configured to connect to particular Bluemix object store
8585
* instances, the values for these credentials can be obtained
@@ -88,9 +88,9 @@ class softlayer(sc: SparkContext, name: String, auth_url: String,
8888
*/
8989

9090
class bluemix(sc: SparkContext, name: String, creds: HashMap[String, String],
91-
swift2d_driver: String = "com.ibm.stocator.fs.ObjectStoreFileSystem",
91+
swift2d_driver: String = "com.ibm.stocator.fs.ObjectStoreFileSystem",
9292
public: Boolean =false){
93-
93+
9494

9595
def ifexist(credsin: HashMap[String, String], var1: String, var2: String): String = {
9696
if (credsin.keySet.exists(_ == var1)){
@@ -103,7 +103,7 @@ class bluemix(sc: SparkContext, name: String, creds: HashMap[String, String],
103103
val username = ifexist(creds, "user_id","userId")
104104
val tenant = ifexist(creds, "project_id","projectId")
105105

106-
106+
107107
val hadoopConf = sc.hadoopConfiguration;
108108
val prefix = "fs.swift2d.service." + name;
109109

@@ -118,10 +118,65 @@ class bluemix(sc: SparkContext, name: String, creds: HashMap[String, String],
118118
hadoopConf.setBoolean(prefix + ".public",public)
119119
hadoopConf.set(prefix + ".region",creds("region"))
120120
hadoopConf.setInt(prefix + ".http.port",8080)
121-
121+
122122
def url(container_name: String, object_name:String) : String= {
123123
return(urlbuilder.swifturl2d(name= name, container_name,object_name))
124124
}
125125
}
126126

127+
/**
128+
* CloudObjectStorage class sets up a s3d connection between an IBM Spark service
129+
* instance and an IBM Cloud Object Storage instance.
130+
131+
* Constructor arguments:
132+
133+
* sparkcontext: a SparkContext object.
134+
135+
* credentials: a dictionary with the following required keys:
136+
*
137+
* endpoint
138+
139+
* accessKey
140+
141+
* secretKey
142+
143+
* cosId [optional]: this parameter is the cloud object storage unique id. It is useful
144+
to keep in the class instance for further checks after the initialization. However,
145+
it is not mandatory for the class instance to work. This value can be retrieved by
146+
calling the getCosId function.
147+
148+
bucket_name (projectId in DSX) [optional]: string that identifies the defult
149+
bucket nameyou want to access files from in the COS service instance.
150+
In DSX, bucket_name is the same as projectId. One bucket is
151+
associated with one project.
152+
If this value is not specified, you need to pass it when
153+
you use the url function.
154+
*
155+
Warning: creating a new instance of this class would overwrite the existing
156+
spark hadoop configs if set before if used with the same spark context instance.
157+
*/
158+
class CloudObjectStorage(sc: SparkContext, credentials: HashMap[String, String], cosId: String = "") {
127159

160+
// check if all credentials are available
161+
val requiredValues = Array("endPoint", "accessKey", "secretKey")
162+
for ( key <- requiredValues ) {
163+
if (!credentials.contains(key)) {
164+
throw new IllegalArgumentException("Invalid input: missing required input [" + key + "]")
165+
}
166+
}
167+
168+
// set config
169+
val hadoopConf = sc.hadoopConfiguration
170+
val prefix = "fs.s3d.service"
171+
hadoopConf.set(prefix + ".endpoint", credentials("endPoint"))
172+
hadoopConf.set(prefix + ".access.key", credentials("accessKey"))
173+
hadoopConf.set(prefix + ".secret.key", credentials("secretKey"))
174+
175+
def getCosId() : String = {
176+
return cosId
177+
}
178+
179+
def url(bucketName: String, objectName: String) : String = {
180+
return "s3d://" + bucketName + ".service/" + objectName
181+
}
182+
}

0 commit comments

Comments
 (0)