Can't read google cloud storage data with pyspark in VSCode jupyter #11050

vikasd22 · 2022-08-04T13:05:21Z

vikasd22
Aug 4, 2022

Hello,

I have seen a peculiar issue in VScode Jupyter, where I can not read google cloud storage parquet files with Pyspark. It works in jupyterlab in browser with no problem. For example, if I do this in VScode jupyter:

# pyspark config
spark_config =  {
    "spark.sql.execution.arrow.pyspark.enabled": "true",
    "spark.driver.memory": "8g",
    "spark.executor.memory": "8g"
}

# reading gcs data
df = spark.read.parquet("gs://ds-source/datalake/ymd=20220301/")

I get the following error:

An error occurred while calling o78.parquet.\n: org.apache.hadoop.fs.UnsupportedFileSystemException: 
No FileSystem for scheme \"gs\"\n\tat org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)

Note that above code work fine as a python script.
I have already played around with the config and vscode Jupyter versions. All the version seems to have this issue. I am not sure if It is a bug. This seems to be pyspark specific. Pandas doesn't seem to have any problem reading the files.

I have already tried this stack overflow solution as well:

https://stackoverflow.com/questions/55595263/how-to-fix-no-filesystem-for-scheme-gs-in-pyspark

I would love if anyone can shed some light on it.

rchiodo · 2022-08-04T18:00:21Z

rchiodo
Aug 4, 2022
Collaborator

My guess is that there's some bash script that needs to run first. You might try closing all instances of VS code and then launching it from a bash shell. That should get VS code to inherit the environment of the bash shell.

2 replies

vikasd22 Aug 5, 2022
Author

@rchiodo I use a remote instance and login in there with remote-ssh extension. How do you propose I do this? Should I install VScode on the remote instance?

rchiodo Aug 5, 2022
Collaborator

Installing VS code on the remote might work.

This might work too, but would probably be rather hard to get the environment correct:
https://superuser.com/questions/163167/when-sshing-how-can-i-set-an-environment-variable-on-the-server-that-changes-f

And that's presuming my assumption is correct, that there's some environment that isn't correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can't read google cloud storage data with pyspark in VSCode jupyter #11050

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can't read google cloud storage data with pyspark in VSCode jupyter #11050

Uh oh!

Uh oh!

vikasd22 Aug 4, 2022

Replies: 1 comment · 2 replies

Uh oh!

rchiodo Aug 4, 2022 Collaborator

Uh oh!

vikasd22 Aug 5, 2022 Author

Uh oh!

rchiodo Aug 5, 2022 Collaborator

vikasd22
Aug 4, 2022

Replies: 1 comment 2 replies

rchiodo
Aug 4, 2022
Collaborator

vikasd22 Aug 5, 2022
Author

rchiodo Aug 5, 2022
Collaborator