Skip to content

s3a URLs don't work as in documentation #556

@acruise

Description

@acruise

EDIT: this helped, the doc may need to be updated:

sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")

Describe the bug
According to the docs, aut should be able to read data from s3a URLs, but every way I've tried it, I get the same result (wrong FS...)

This specific run is built from aut-docker @ b64c02a343ad02ac36e84a2393ed52d86f0fb4ee), but a standalone Sparkling build does the same thing. I would file the ticket against Sparkling, but your docs actually exist, and no good deed goes unpunished ;)

I've verified that the credentials I'm providing can read this file using aws s3 cp etc.

alex@alex-work-pc:~/dev/docker-aut$ docker run -it sha256:f6a21678154c9603e5e4b3f453fa043083812bc40456469e3059edb7c5a3b36d
24/02/22 00:10:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://231b303af8fa:4040
Spark context available as 'sc' (master = local[*], app id = local-1708560659354).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/
         
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.16)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import io.archivesunleashed._, io.archivesunleashed.matchbox._
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "REDACTED")

scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "REDACTED")

scala> RecordLoader.loadArchives("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().take(10)
java.lang.IllegalArgumentException: Wrong FS: s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
  at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
  at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
  at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
  at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
  at org.archive.webservices.sparkling.io.HdfsIO.files(HdfsIO.scala:156)
  at org.archive.webservices.sparkling.util.RddUtil$.loadFilesLocality(RddUtil.scala:61)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:95)
  ... 51 elided

scala> 

To Reproduce
Steps to reproduce the behavior (e.g.):

  1. git clone git@github.com:archivesunleashed/docker-aut.git && cd docker-aut
  2. docker build .
  3. docker run -it <hash of above>
  4. import packages and set credentials as documented
  5. eval RecordLoader.loadArchives("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().take(10)

Expected behavior
A DataFrame is returned by RecordLoader ;)

Screenshots
If applicable, add screenshots to help explain your problem.

Environment information

  • AUT version: HEAD of docker-aut, currently at b64c02a343ad02ac36e84a2393ed52d86f0fb4ee
  • OS: Ubuntu 23.10, but Docker
  • Java version: 11 (from Dockerfile)
  • Apache Spark version: 3.3.1 (from Dockerfile)
  • Apache Spark w/aut: (from Dockerfile)
  • Apache Spark command used to run AUT: docker run -it sha256:f6a21678154c9603e5e4b3f453fa043083812bc40456469e3059edb7c5a3b36d

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions