-
Notifications
You must be signed in to change notification settings - Fork 33
Open
Labels
Description
EDIT: this helped, the doc may need to be updated:
sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")
Describe the bug
According to the docs, aut should be able to read data from s3a URLs, but every way I've tried it, I get the same result (wrong FS...)
This specific run is built from aut-docker @ b64c02a343ad02ac36e84a2393ed52d86f0fb4ee), but a standalone Sparkling build does the same thing. I would file the ticket against Sparkling, but your docs actually exist, and no good deed goes unpunished ;)
I've verified that the credentials I'm providing can read this file using aws s3 cp etc.
alex@alex-work-pc:~/dev/docker-aut$ docker run -it sha256:f6a21678154c9603e5e4b3f453fa043083812bc40456469e3059edb7c5a3b36d
24/02/22 00:10:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://231b303af8fa:4040
Spark context available as 'sc' (master = local[*], app id = local-1708560659354).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.16)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import io.archivesunleashed._, io.archivesunleashed.matchbox._
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "REDACTED")
scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "REDACTED")
scala> RecordLoader.loadArchives("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().take(10)
java.lang.IllegalArgumentException: Wrong FS: s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
at org.archive.webservices.sparkling.io.HdfsIO.files(HdfsIO.scala:156)
at org.archive.webservices.sparkling.util.RddUtil$.loadFilesLocality(RddUtil.scala:61)
at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:95)
... 51 elided
scala>
To Reproduce
Steps to reproduce the behavior (e.g.):
git clone git@github.com:archivesunleashed/docker-aut.git && cd docker-autdocker build .docker run -it <hash of above>- import packages and set credentials as documented
- eval
RecordLoader.loadArchives("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().take(10)
Expected behavior
A DataFrame is returned by RecordLoader ;)
Screenshots
If applicable, add screenshots to help explain your problem.
Environment information
- AUT version: HEAD of
docker-aut, currently at b64c02a343ad02ac36e84a2393ed52d86f0fb4ee - OS: Ubuntu 23.10, but Docker
- Java version: 11 (from Dockerfile)
- Apache Spark version: 3.3.1 (from Dockerfile)
- Apache Spark w/aut: (from Dockerfile)
- Apache Spark command used to run AUT:
docker run -it sha256:f6a21678154c9603e5e4b3f453fa043083812bc40456469e3059edb7c5a3b36d
Reactions are currently unavailable