Skip to content

Commit bc9db8d

Browse files
committed
Implement access for Google Cloud buckets with Requester Pays set - nextflow-io#1466
Signed-off-by: Emilio Palumbo <[email protected]>
1 parent 9eda533 commit bc9db8d

File tree

9 files changed

+165
-29
lines changed

9 files changed

+165
-29
lines changed

docs/google.rst

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ When completed, shutdown the cluster instances by using the following command::
169169

170170
Replace ``my-cluster`` with the name used in your execution.
171171

172-
Preemptible instances
172+
Preemptible instances
173173
---------------------
174174

175175
An optional parameter allows you to set the instance to be preemptible. Both master and worker instances can be set to
@@ -230,7 +230,7 @@ autoscaler. For example::
230230
}
231231

232232
By doing so it is possible to create a cluster with a single node i.e. the master node. The autoscaler will then
233-
automatically add the missing instances, up to the number defined by the ``minInstances`` attributes.
233+
automatically add the missing instances, up to the number defined by the ``minInstances`` attributes.
234234

235235

236236
Limitation
@@ -301,7 +301,7 @@ Example::
301301
.. warning:: Make sure to specify in the above setting the project ID not the project name.
302302

303303
.. Note:: A container image must be specified to deploy the process execution. You can use a different Docker image for
304-
each process using one or more :ref:`config-process-selectors`.
304+
each process using one or more :ref:`config-process-selectors`.
305305

306306
Process definition
307307
------------------
@@ -384,7 +384,7 @@ specify the local storage for the jobs computed locally::
384384
nextflow run <script or project name> -bucket-dir gs://my-bucket/some/path
385385

386386
.. warning:: The Google Storage path needs to contain at least sub-directory. Don't use only the
387-
bucket name e.g. ``gs://my-bucket``.
387+
bucket name e.g. ``gs://my-bucket``.
388388

389389
Limitation
390390
----------
@@ -470,21 +470,22 @@ Example::
470470

471471
The following configuration options are available:
472472

473-
======================================= =================
474-
Name Description
475-
======================================= =================
476-
google.project The Google Project Id to use for the pipeline execution.
477-
google.region The Google *region* where the computation is executed in Compute Engine VMs. Multiple regions can be provided separating them by a comma. Do not specify if a zone is provided. See `available Compute Engine regions and zones <https://cloud.google.com/compute/docs/regions-zones/>`_
478-
google.zone The Google *zone* where the computation is executed in Compute Engine VMs. Multiple zones can be provided separating them by a comma. Do not specify if a region is provided. See `available Compute Engine regions and zones <https://cloud.google.com/compute/docs/regions-zones/>`_
479-
google.location The Google *location* where the job executions are deployed to Cloud Life Sciences API. See `available Cloud Life Sciences API locations <https://cloud.google.com/life-sciences/docs/concepts/locations>`_ (default: the same as the region or the zone specified).
480-
google.lifeSciences.bootDiskSize Set the size of the virtual machine boot disk e.g `50.GB` (default: none).
481-
google.lifeSciences.copyImage The container image run to copy input and output files. It must include the ``gsutil`` tool (default: ``google/cloud-sdk:alpine``).
482-
google.lifeSciences.debug When ``true`` copies the `/google` debug directory in that task bucket directory (defualt: ``false``)
483-
google.lifeSciences.preemptible When ``true`` enables the usage of *preemptible* virtual machines or ``false`` otherwise (default: ``true``)
484-
google.lifeSciences.usePrivateAddress When ``true`` the VM will NOT be provided with a public IP address, and only contain an internal IP. If this option is enabled, the associated job can only load docker images from Google Container Registry, and the job executable cannot use external services other than Google APIs (default: ``false``). Requires version `20.03.0-edge` or later.
485-
google.lifeSciences.sshDaemon When ``true`` runs SSH daemon in the VM carrying out the job to which it's possible to connect for debugging purposes (default: ``false``).
486-
google.lifeSciences.sshImage The container image used to run the SSH daemon (default: ``gcr.io/cloud-genomics-pipelines/tools``).
487-
======================================= =================
473+
============================================== =================
474+
Name Description
475+
============================================== =================
476+
google.project The Google Project Id to use for the pipeline execution.
477+
google.region The Google *region* where the computation is executed in Compute Engine VMs. Multiple regions can be provided separating them by a comma. Do not specify if a zone is provided. See `available Compute Engine regions and zones <https://cloud.google.com/compute/docs/regions-zones/>`_
478+
google.zone The Google *zone* where the computation is executed in Compute Engine VMs. Multiple zones can be provided separating them by a comma. Do not specify if a region is provided. See `available Compute Engine regions and zones <https://cloud.google.com/compute/docs/regions-zones/>`_
479+
google.location The Google *location* where the job executions are deployed to Cloud Life Sciences API. See `available Cloud Life Sciences API locations <https://cloud.google.com/life-sciences/docs/concepts/locations>`_ (default: the same as the region or the zone specified).
480+
google.enableRequesterPaysBuckets When ``true`` uses the configured Google project id as the billing project for storage access. This is required when accessing data from *reqester pays enabled* buckets. See `Requester Pays on Google Cloud Storage documentation <https://cloud.google.com/storage/docs/requester-pays>`_ (default: ``false``)
481+
google.lifeSciences.bootDiskSize Set the size of the virtual machine boot disk e.g `50.GB` (default: none).
482+
google.lifeSciences.copyImage The container image run to copy input and output files. It must include the ``gsutil`` tool (default: ``google/cloud-sdk:alpine``).
483+
google.lifeSciences.debug When ``true`` copies the `/google` debug directory in that task bucket directory (defualt: ``false``)
484+
google.lifeSciences.preemptible When ``true`` enables the usage of *preemptible* virtual machines or ``false`` otherwise (default: ``true``)
485+
google.lifeSciences.usePrivateAddress When ``true`` the VM will NOT be provided with a public IP address, and only contain an internal IP. If this option is enabled, the associated job can only load docker images from Google Container Registry, and the job executable cannot use external services other than Google APIs (default: ``false``). Requires version `20.03.0-edge` or later.
486+
google.lifeSciences.sshDaemon When ``true`` runs SSH daemon in the VM carrying out the job to which it's possible to connect for debugging purposes (default: ``false``).
487+
google.lifeSciences.sshImage The container image used to run the SSH daemon (default: ``gcr.io/cloud-genomics-pipelines/tools``).
488+
============================================== =================
488489

489490

490491
Process definition
@@ -617,6 +618,6 @@ Troubleshooting
617618

618619
* Enable the optional SSH daemon in the job VM using the option ``google.lifeSciences.sshDaemon = true``
619620

620-
* Make sure you are choosing a `location` where `Cloud Life Sciences API is available <https://cloud.google.com/life-sciences/docs/concepts/locations>`_,
621+
* Make sure you are choosing a `location` where `Cloud Life Sciences API is available <https://cloud.google.com/life-sciences/docs/concepts/locations>`_,
621622
and a `region` or `zone` where `Compute Engine is available <https://cloud.google.com/compute/docs/regions-zones/>`_.
622-
623+

modules/nf-google/src/main/nextflow/cloud/google/lifesciences/GoogleLifeSciencesConfig.groovy

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,13 @@ import java.nio.file.Path
2121

2222
import groovy.json.JsonSlurper
2323
import groovy.transform.CompileStatic
24+
import groovy.transform.Memoized
2425
import groovy.transform.ToString
2526
import groovy.util.logging.Slf4j
2627
import nextflow.Session
2728
import nextflow.exception.AbortOperationException
2829
import nextflow.util.MemoryUnit
30+
2931
/**
3032
* Helper class wrapping configuration required for Google Pipelines.
3133
*
@@ -55,6 +57,7 @@ class GoogleLifeSciencesConfig {
5557
Integer debugMode
5658
String copyImage
5759
boolean usePrivateAddress
60+
boolean enableRequesterPaysBuckets
5861

5962
@Deprecated
6063
GoogleLifeSciencesConfig(String project, List<String> zone, List<String> region, Path remoteBinDir = null, boolean preemptible = false) {
@@ -71,6 +74,7 @@ class GoogleLifeSciencesConfig {
7174

7275
GoogleLifeSciencesConfig() {}
7376

77+
@Memoized
7478
static GoogleLifeSciencesConfig fromSession(Session session) {
7579
try {
7680
fromSession0(session.config)
@@ -110,6 +114,7 @@ class GoogleLifeSciencesConfig {
110114
final copyImage = config.navigate('google.lifeSciences.copyImage', DEFAULT_COPY_IMAGE) as String
111115
final debugMode = config.navigate('google.lifeSciences.debug', System.getenv('NXF_DEBUG'))
112116
final privateAddr = config.navigate('google.lifeSciences.usePrivateAddress') as boolean
117+
final requesterPays = config.navigate('google.enableRequesterPaysBuckets') as boolean
113118

114119
def zones = (config.navigate("google.zone") as String)?.split(",")?.toList() ?: Collections.<String>emptyList()
115120
def regions = (config.navigate("google.region") as String)?.split(",")?.toList() ?: Collections.<String>emptyList()
@@ -127,7 +132,8 @@ class GoogleLifeSciencesConfig {
127132
copyImage: copyImage,
128133
sshDaemon: sshDaemon,
129134
sshImage: sshImage,
130-
usePrivateAddress: privateAddr )
135+
usePrivateAddress: privateAddr,
136+
enableRequesterPaysBuckets: requesterPays)
131137
}
132138

133139
static private Integer debugMode0(value) {
@@ -172,4 +178,4 @@ class GoogleLifeSciencesConfig {
172178
throw new AbortOperationException("Missing Google credentials file: $credsFilePath")
173179
}
174180
}
175-
}
181+
}

modules/nf-google/src/main/nextflow/cloud/google/lifesciences/GoogleLifeSciencesFileCopyStrategy.groovy

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,13 @@ class GoogleLifeSciencesFileCopyStrategy extends SimpleFileCopyStrategy {
5858
final createDirectories = []
5959
final stagingCommands = []
6060

61+
final gsutilPrefix = new StringBuilder()
62+
gsutilPrefix.append("gsutil -m -q")
63+
64+
if(config.enableRequesterPaysBuckets) {
65+
gsutilPrefix.append(" -u ${config.project}")
66+
}
67+
6168
for( String stageName : inputFiles.keySet() ) {
6269
final storePath = inputFiles.get(stageName)
6370
final storePathIsDir = storePath.isDirectory()
@@ -72,13 +79,13 @@ class GoogleLifeSciencesFileCopyStrategy extends SimpleFileCopyStrategy {
7279
}
7380

7481
if(storePathIsDir) {
75-
stagingCommands << "gsutil -m -q cp -R $escapedStoreUri/ $localTaskDir".toString()
82+
stagingCommands << "$gsutilPrefix cp -R $escapedStoreUri/ $localTaskDir".toString()
7683
//check if we need to move the directory (gsutil doesn't support renaming directories on copy)
7784
if(parent || !storePath.toString().endsWith(stageName)) {
7885
stagingCommands << "mv $localTaskDir/${Escape.path(storePath.name)} $localTaskDir/$escapedStageName".toString()
7986
}
8087
} else {
81-
stagingCommands << "gsutil -m -q cp $escapedStoreUri $localTaskDir/$escapedStageName".toString()
88+
stagingCommands << "$gsutilPrefix cp $escapedStoreUri $localTaskDir/$escapedStageName".toString()
8289
}
8390
}
8491

@@ -159,4 +166,4 @@ class GoogleLifeSciencesFileCopyStrategy extends SimpleFileCopyStrategy {
159166
return result.toString()
160167
}
161168

162-
}
169+
}

modules/nf-google/src/main/nextflow/cloud/google/util/GsPathFactory.groovy

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,11 @@ package nextflow.cloud.google.util
1818

1919
import java.nio.file.Path
2020

21+
import com.google.cloud.storage.contrib.nio.CloudStorageConfiguration
2122
import com.google.cloud.storage.contrib.nio.CloudStorageFileSystem
2223
import groovy.transform.CompileStatic
24+
import nextflow.Global
25+
import nextflow.cloud.google.lifesciences.GoogleLifeSciencesConfig
2326
import nextflow.file.FileSystemPathFactory
2427

2528
/**
@@ -30,14 +33,32 @@ import nextflow.file.FileSystemPathFactory
3033
@CompileStatic
3134
class GsPathFactory extends FileSystemPathFactory {
3235

36+
@Lazy
37+
static private final CloudStorageConfiguration storageConfig = {
38+
return getCloudStorageConfig()
39+
} ()
40+
41+
static private CloudStorageConfiguration getCloudStorageConfig() {
42+
def session = Global.getSession() as nextflow.Session
43+
if (!session)
44+
new IllegalStateException("Cannot initialize GsPathFactory: missing session")
45+
46+
final config = GoogleLifeSciencesConfig.fromSession(session)
47+
final builder = CloudStorageConfiguration.builder()
48+
if (config.enableRequesterPaysBuckets) {
49+
builder.userProject(config.project)
50+
}
51+
return builder.build()
52+
}
53+
3354
@Override
3455
protected Path parseUri(String uri) {
3556
if( !uri.startsWith('gs://') )
3657
return null
3758
final str = uri.substring(5)
3859
final p = str.indexOf('/')
3960
return ( p==-1
40-
? CloudStorageFileSystem.forBucket(str).getPath('')
41-
: CloudStorageFileSystem.forBucket(str.substring(0,p)).getPath(str.substring(p)) )
61+
? CloudStorageFileSystem.forBucket(str, storageConfig).getPath('')
62+
: CloudStorageFileSystem.forBucket(str.substring(0,p), storageConfig).getPath(str.substring(p)) )
4263
}
4364
}

modules/nf-google/src/test/nextflow/cloud/google/lifesciences/GoogleLifeSciencesConfigTest.groovy

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,4 +228,17 @@ class GoogleLifeSciencesConfigTest extends Specification {
228228
'foo' | 'us-central1'
229229
}
230230

231+
def 'should set requester pays' () {
232+
when:
233+
def config = GoogleLifeSciencesConfig.fromSession0([google:[project:'foo', region:'x', lifeSciences: [:]]])
234+
then:
235+
config.enableRequesterPaysBuckets == false
236+
237+
when:
238+
config = GoogleLifeSciencesConfig.fromSession0([google:[project:'foo', region:'x', enableRequesterPaysBuckets:true]])
239+
then:
240+
config.enableRequesterPaysBuckets == true
241+
242+
}
243+
231244
}

modules/nf-google/src/test/nextflow/cloud/google/lifesciences/GoogleLifeSciencesFileCopyStrategyTest.groovy

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,41 @@ class GoogleLifeSciencesFileCopyStrategyTest extends GoogleSpecification {
101101
'''.stripIndent()
102102
}
103103

104+
def 'create stage files using Requester Pays' () {
105+
given:
106+
def bean = Mock(TaskBean) {
107+
getWorkDir() >> mockGsPath('gs://my-bucket/work/xx/yy')
108+
}
109+
def handler = Mock(GoogleLifeSciencesTaskHandler) {
110+
getExecutor() >> Mock(GoogleLifeSciencesExecutor) {
111+
getConfig() >> GoogleLifeSciencesConfig.fromSession0([google:[project:'foo', region:'x', enableRequesterPaysBuckets:true]])
112+
}
113+
}
114+
and:
115+
def strategy = new GoogleLifeSciencesFileCopyStrategy(bean, handler)
116+
117+
// file with the same name
118+
when:
119+
def inputs = ['foo.txt': mockGsPath('gs://my-bucket/bar/foo.txt')]
120+
def result = strategy.getStageInputFilesScript(inputs)
121+
then:
122+
result == '''\
123+
echo start | gsutil -q cp -c - gs://my-bucket/work/xx/yy/.command.begin
124+
gsutil -m -q -u foo cp gs://my-bucket/bar/foo.txt /work/xx/yy/foo.txt
125+
'''.stripIndent()
126+
127+
// file a directory name
128+
when:
129+
inputs = ['dir1': mockGsPath('gs://my-bucket/foo/dir1', true)]
130+
result = strategy.getStageInputFilesScript(inputs)
131+
then:
132+
result == '''\
133+
echo start | gsutil -q cp -c - gs://my-bucket/work/xx/yy/.command.begin
134+
gsutil -m -q -u foo cp -R gs://my-bucket/foo/dir1/ /work/xx/yy
135+
'''.stripIndent()
136+
137+
}
138+
104139
def 'should unstage files' () {
105140
given:
106141
def bean = Mock(TaskBean) {

modules/nf-google/src/test/nextflow/cloud/google/util/GsPathPathFactoryTest.groovy

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,14 @@ package nextflow.cloud.google.util
1818

1919

2020
import spock.lang.Specification
21+
22+
import com.google.cloud.storage.contrib.nio.CloudStorageConfiguration
23+
import com.google.cloud.storage.contrib.nio.CloudStorageFileSystem
24+
import nextflow.Global
25+
import nextflow.Session
26+
import nextflow.cloud.google.lifesciences.GoogleLifeSciencesConfig
27+
import nextflow.config.ConfigBuilder
28+
2129
/**
2230
*
2331
* @author Paolo Di Tommaso <[email protected]>
@@ -26,6 +34,9 @@ class GsPathPathFactoryTest extends Specification {
2634

2735
def 'should create gs path' () {
2836
given:
37+
def sess = new Session()
38+
sess.config = [google:[project:'foo', region:'x']]
39+
Global.session = sess
2940
def factory = new GsPathFactory()
3041

3142
expect:
@@ -39,4 +50,30 @@ class GsPathPathFactoryTest extends Specification {
3950
_ | 'gs://f o o/bar'
4051
_ | 'gs://f_o_o/bar'
4152
}
53+
54+
def 'should use requester pays' () {
55+
given:
56+
def sess = new Session()
57+
sess.config = [google:[project:'foo', region:'x', enableRequesterPaysBuckets:true]]
58+
Global.session = sess
59+
60+
when:
61+
def storageConfig = GsPathFactory.getCloudStorageConfig()
62+
63+
then:
64+
storageConfig.userProject() == 'foo'
65+
}
66+
67+
def 'should not use requester pays' () {
68+
given:
69+
def sess = new Session()
70+
sess.config = [google:[project:'foo', region:'x', lifeSciences: [:]]]
71+
Global.session = sess
72+
73+
when:
74+
def storageConfig = GsPathFactory.getCloudStorageConfig()
75+
76+
then:
77+
storageConfig.userProject() == null
78+
}
4279
}

modules/nf-google/src/test/nextflow/extension/FilesExTest2.groovy

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ import spock.lang.Unroll
2222
import java.nio.file.Path
2323

2424
import com.google.cloud.storage.contrib.nio.CloudStoragePath
25+
import nextflow.Global
26+
import nextflow.Session
2527

2628
/**
2729
*
@@ -33,6 +35,9 @@ class FilesExTest2 extends Specification {
3335
def 'should return uri string for #PATH' () {
3436

3537
when:
38+
def sess = new Session()
39+
sess.config = [google:[project:'foo', region:'x']]
40+
Global.session = sess
3641
def path = PATH as Path
3742
then:
3843
path instanceof CloudStoragePath

modules/nf-google/src/test/nextflow/file/FileHelperGsTest.groovy

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,9 @@ import java.nio.file.Paths
2222
import com.google.cloud.storage.contrib.nio.CloudStorageFileSystem
2323
import spock.lang.Specification
2424

25+
import nextflow.Global
26+
import nextflow.Session
27+
2528
/**
2629
*
2730
* @author Paolo Di Tommaso <[email protected]>
@@ -30,6 +33,11 @@ class FileHelperGsTest extends Specification {
3033

3134
def 'should parse google storage path' () {
3235

36+
given:
37+
def sess = new Session()
38+
sess.config = [google:[project:'foo', region:'x']]
39+
Global.session = sess
40+
3341
expect:
3442
FileHelper.asPath('file.txt') ==
3543
Paths.get('file.txt')
@@ -44,7 +52,7 @@ class FileHelperGsTest extends Specification {
4452
and:
4553
FileHelper.asPath('gs://foo/b a r.txt') ==
4654
CloudStorageFileSystem.forBucket('foo').getPath('/b a r.txt')
47-
55+
4856
and:
4957
FileHelper.asPath('gs://f o o/bar.txt') ==
5058
CloudStorageFileSystem.forBucket('f o o').getPath('/bar.txt')
@@ -57,6 +65,9 @@ class FileHelperGsTest extends Specification {
5765

5866
def 'should strip ending slash' () {
5967
given:
68+
def sess = new Session()
69+
sess.config = [google:[project:'foo', region:'x']]
70+
Global.session = sess
6071
def nxFolder = Paths.get('/my-bucket/foo')
6172
def nxNested = Paths.get('/my-bucket/foo/bar/')
6273
and:

0 commit comments

Comments
 (0)