Skip to content

Commit b3c0f31

Browse files
Merge pull request #14 from ibm-watson-data-lab/PythonCOSSupport
Python Cloud Object Storage Support
2 parents 52ad8e6 + 6329427 commit b3c0f31

File tree

3 files changed

+111
-30
lines changed

3 files changed

+111
-30
lines changed

python/README.md

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
# ibmos2spark
22

3-
The package sets Spark Hadoop configurations for connecting to
3+
The package sets Spark Hadoop configurations for connecting to
44
IBM Bluemix Object Storage and Softlayer Account Object Storage instances. This packages uses the new [stocator](https://github.com/SparkTC/stocator) driver, which implements the `swift2d` protocol, and is availble
5-
on the latest IBM Apache Spark Service instances (and through IBM Data Science Experience).
5+
on the latest IBM Apache Spark Service instances (and through IBM Data Science Experience).
66

77

8-
Using the `stocator` driver connects your Spark executor nodes directly
8+
Using the `stocator` driver connects your Spark executor nodes directly
99
to your data in object storage.
1010
This is an optimized, high-performance method to connect Spark to your data. All IBM Apache Spark kernels
11-
are instantiated with the `stocator` driver in the Spark kernel's classpath.
12-
You can also run this locally by installing the [stocator driver](https://github.com/SparkTC/stocator)
13-
and adding it to your local Apache Spark kernel's classpath.
11+
are instantiated with the `stocator` driver in the Spark kernel's classpath.
12+
You can also run this locally by installing the [stocator driver](https://github.com/SparkTC/stocator)
13+
and adding it to your local Apache Spark kernel's classpath.
1414

1515
## Installation
1616

@@ -21,22 +21,38 @@ pip install --user --upgrade ibmos2spark
2121
## Usage
2222

2323
The usage of this package depends on *from where* your Object Storage instance was created. This package
24-
is intended to connect to IBM's Object Storage instances obtained from Bluemix or Data Science Experience
25-
(DSX) or from a separate account on IBM Softlayer. The instructions below show how to connect to
26-
either type of instance.
24+
is intended to connect to IBM's Object Storage instances (Swift OS). This OS can be obtained from Bluemix or Data Science Experience (DSX) or from a separate account on IBM Softlayer. The package also supports IBM Cloud Object Storage as well (COS).
25+
The instructions below show how to connect to either type of instance.
2726

2827
The connection setup is essentially the same. But the difference for you is how you deliver the
2928
credentials. If your Object Storage was created with Bluemix/DSX, with a few clicks on the side-tab
3029
within a DSX Jupyter notebook, you can obtain your account credentials in the form of a Python dictionary.
3130
If your Object Storage was created with a Softlayer account, each part of the credentials will
32-
be found as text that you can copy and paste into the example code below.
31+
be found as text that you can copy and paste into the example code below.
32+
33+
### CloudObjectStorage / Data Science Experience
34+
```python
35+
import ibmos2spark
36+
37+
credentials = {
38+
'endpoint': 'https://s3-api.objectstorage.softlayer.net/', #just an example. Your url might be different
39+
'access_key': '',
40+
'secret_key': ''
41+
}
42+
43+
cos = ibmos2spark.CloudObjectStorage(sc, credentials) #sc is the SparkContext instance
44+
45+
bucket_name = 'some_bucket_name'
46+
object_name = 'file1'
47+
data = sc.textFile(cos.url(object_name, bucket_name))
48+
```
3349

3450
### Bluemix / Data Science Experience
3551

3652
```python
3753
import ibmos2spark
3854

39-
#To obtain these credentials in IBM Spark, click the "insert to code"
55+
#To obtain these credentials in IBM Spark, click the "insert to code"
4056
#button below your data source found on the panel to the right of your notebook.
4157

4258
credentials = {
@@ -78,7 +94,7 @@ data = sc.textFile(slos.url(container_name, object_name))
7894
```
7995

8096

81-
## License
97+
## License
8298

8399
Copyright 2016 IBM Cloud Data Services
84100

python/ibmos2spark/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,4 @@
1616
Helper to connect to Softlayer and Bluemix ObjectStore from IBM Spark Service
1717
"""
1818
from .__info__ import __version__
19-
from .osconfig import softlayer, bluemix
19+
from .osconfig import softlayer, bluemix, CloudObjectStorage

python/ibmos2spark/osconfig.py

Lines changed: 82 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
and generate the swifturl.
1717
1818
"""
19-
19+
2020
import warnings
2121

2222
def swifturl2d(name, container_name, object_name):
@@ -36,7 +36,7 @@ def __init__(self, sparkcontext, name, auth_url, tenant, username, password=None
3636
auth_url, tenant, username and password are string credentials for your
3737
Softlayer Object Store
3838
39-
Example:
39+
Example:
4040
4141
slos = softlayer(sc, 'mySLOS', 'https://dal05.objectstorage.softlayer.net/auth/v1.0',
4242
'IBMOS278685-10','[email protected]', 'password_234234ada')
@@ -49,19 +49,19 @@ def __init__(self, sparkcontext, name, auth_url, tenant, username, password=None
4949
this class should have failed when attempted to access data with swift.
5050
5151
As of the version 0.0.7 update, support for the old protocol has been removed in
52-
favor of the new swift2d/stocator protocol.
52+
favor of the new swift2d/stocator protocol.
5353
54-
Subsequently, the __init__ for this class has been changed!
54+
Subsequently, the __init__ for this class has been changed!
5555
56-
However, to support older code that may have been unused since this transition,
56+
However, to support older code that may have been unused since this transition,
5757
this __init__ function will check the arguments and attempt to determine
5858
the proper credentials. Specifically, if the <password> is None, then
59-
the <tenant> argument will be interpreted as <tenant>:<username> and the
59+
the <tenant> argument will be interpreted as <tenant>:<username> and the
6060
<username> argument will be interpreted as the <password> value. This is because
61-
the <username> for Softlayer keystone 1 authentication is equivalent to <tenant>:<username>.
61+
the <username> for Softlayer keystone 1 authentication is equivalent to <tenant>:<username>.
6262
For example, typcial usernames look like 'IBMOS278685-10:<email>', as shown here
63-
http://knowledgelayer.softlayer.com/procedure/how-do-i-access-object-storage-command-line.
64-
63+
http://knowledgelayer.softlayer.com/procedure/how-do-i-access-object-storage-command-line.
64+
6565
6666
Therefore, this class will attempt to extract tenant, username and password from
6767
uses such as
@@ -75,22 +75,22 @@ def __init__(self, sparkcontext, name, auth_url, tenant, username, password=None
7575
'''
7676
if password is None:
7777
msg = '''
78-
password was set to None!
78+
password was set to None!
7979
Attempting to interpret tentant = tenant:username and username=password.
8080
This is an attempt to support older code that may have missed the transition or
8181
errors using the old swift protocol connection to Softlayer Object Storage accounts.
8282
If you are seeing this warning, you should separate your tenant and username values,
83-
as this support will be deprecated in the near future.
83+
as this support will be deprecated in the near future.
8484
'''
8585
warnings.warn(msg, UserWarning)
8686
password = username
8787
tenant, username = tenant.split(':')
8888
warnings.warn('Trying tenant {}, username {} and password {}'.format(tenant, username, password), UserWarning)
89-
89+
9090

9191
self.name = name
9292

93-
prefix = "fs.swift2d.service." + name
93+
prefix = "fs.swift2d.service." + name
9494
hconf = sparkcontext._jsc.hadoopConfiguration()
9595
hconf.set("fs.swift2d.impl", swift2d_driver)
9696
hconf.set(prefix + ".auth.url", auth_url)
@@ -100,7 +100,7 @@ def __init__(self, sparkcontext, name, auth_url, tenant, username, password=None
100100
hconf.set(prefix + ".auth.method", "swiftauth")
101101
hconf.setInt(prefix + ".http.port", 8080)
102102
hconf.set(prefix + ".apikey", password)
103-
hconf.setBoolean(prefix + ".public", public)
103+
hconf.setBoolean(prefix + ".public", public)
104104
hconf.set(prefix + ".use.get.auth", "true")
105105
hconf.setBoolean(prefix + ".location-aware", False)
106106
hconf.set(prefix + ".password", password)
@@ -116,7 +116,7 @@ def __init__(self, sparkcontext, credentials, name=None, public=False, swift2d_d
116116
sparkcontext: a SparkContext object.
117117
118118
credentials: a dictionary with the following required keys:
119-
119+
120120
auth_url
121121
project_id (or projectId)
122122
user_id (or userId)
@@ -148,12 +148,12 @@ def __init__(self, sparkcontext, credentials, name=None, public=False, swift2d_d
148148
try:
149149
user_id = credentials['user_id']
150150
except KeyError as e:
151-
user_id = credentials['userId']
151+
user_id = credentials['userId']
152152

153153
try:
154154
tenant = credentials['project_id']
155155
except KeyError as e:
156-
tenant = credentials['projectId']
156+
tenant = credentials['projectId']
157157

158158
prefix = "fs.swift2d.service." + self.name
159159
hconf = sparkcontext._jsc.hadoopConfiguration()
@@ -170,3 +170,68 @@ def __init__(self, sparkcontext, credentials, name=None, public=False, swift2d_d
170170

171171
def url(self, container_name, object_name):
172172
return swifturl2d(self.name, container_name, object_name)
173+
174+
175+
class CloudObjectStorage(object):
176+
177+
def __init__(self, sparkcontext, credentials, cos_id='', bucket_name=''):
178+
179+
'''
180+
sparkcontext: a SparkContext object.
181+
182+
credentials: a dictionary with the following required keys:
183+
* endpoint
184+
* access_key
185+
* secret_key
186+
187+
When using this on DSX credentials and bucket_name can be obtained
188+
in DSX - Notebooks by clicking on the datasources palette then
189+
choose the datasource you want to access then hit insert credentials.
190+
191+
cos_id [optional]: this parameter is the cloud object storage unique id. It is useful
192+
to keep in the class instance for further checks after the initialization. However,
193+
it is not mandatory for the class instance to work. This value can be retrieved by
194+
calling the get_os_id function.
195+
196+
bucket_name (projectId in DSX) [optional]: string that identifies the defult
197+
bucket nameyou want to access files from in the COS service instance.
198+
In DSX, bucket_name is the same as projectId. One bucket is
199+
associated with one project.
200+
If this value is not specified, you need to pass it when
201+
you use the url function.
202+
203+
Warning: creating a new instance of this class would overwrite the existing
204+
spark hadoop configs if set before if used with the same spark context instance.
205+
206+
'''
207+
self.bucket_name = bucket_name
208+
self.cos_id = cos_id
209+
210+
# check if all required values are availble
211+
credential_key_list = ["endpoint", "access_key", "secret_key"]
212+
213+
for i in range(len(credential_key_list)):
214+
key = credential_key_list[i]
215+
if (not key in credentials):
216+
raise ValueError("Invalid input: credentials.{} is required!".format(key))
217+
218+
# setup config
219+
prefix = "fs.s3d.service"
220+
hconf = sparkcontext._jsc.hadoopConfiguration()
221+
hconf.set(prefix + ".endpoint", credentials['endpoint'])
222+
hconf.set(prefix + ".access.key", credentials['access_key'])
223+
hconf.set(prefix + ".secret.key", credentials['secret_key'])
224+
225+
def get_os_id():
226+
return self.cos_id
227+
228+
def url(self, object_name, bucket_name=''):
229+
bucket_name_var = ''
230+
if (bucket_name):
231+
bucket_name_var = bucket_name
232+
elif (self.bucket_name):
233+
bucket_name_var = self.bucket_name
234+
else:
235+
raise ValueError("Invalid input: bucket_name is required!")
236+
237+
return "s3d://{}.service/{}".format(bucket_name_var, object_name)

0 commit comments

Comments
 (0)