Cloud storage #507

saul-data · 2025-11-27T19:19:34Z

saul-data
Nov 27, 2025

In addition to local file storage, it would be great to configure a cloud storage endpoint. To have the same file /folder structure as local storage. This would make it easier to download all the history and keep it in say S3. Even keep S3 up to date with the latest submissions.

saul-data · 2025-11-27T19:23:20Z

saul-data
Nov 27, 2025
Author

I was thinking of setting up a s3 storage backed static web server and then using export EDGAR_DATA_URL="https://xyz" to serve up the data. If there is an easier way, would be good to know. I am also not sure what is the best way to keep s3 up to date with all the filings.

func main() {
    initS3() // Initialize S3 client

    app := fiber.New()

    // Use a route with a wildcard to catch all files under /static
    // The wildcard parameter can be accessed via c.Params("*1")
    app.Get("/static/*", s3FileHandler)

    log.Fatal(app.Listen(":3000"))
}

0 replies

dgunning · 2025-11-27T21:09:30Z

dgunning
Nov 27, 2025
Maintainer

What is your primary motivation?

Bypass the 10 requests per second SEC limit?

1 reply

saul-data Nov 27, 2025
Author

Correct, so if I wanted to download all the submissions going back as far as possible. I am also in a stateless Kubernetes environment so I'd prefer not to use file directory but keep all the history in S3. Going forward its easier as its incremental daily but its the history that worries me.

dgunning · 2025-11-27T22:01:03Z

dgunning
Nov 27, 2025
Maintainer

Thank you @saul-data for this suggestion and for sharing your implementation approach!

TL;DR: EdgarTools Already Supports Cloud Storage (Two Ways!)

You actually have two options that work TODAY with existing features:

EDGAR_DATA_URL + S3 static website (your current approach) - best for reading
EDGAR_LOCAL_DATA_DIR + FUSE mount - best for writing/downloading

Option 1: EDGAR_DATA_URL (Your Current Approach) ✅

Your workaround using EDGAR_DATA_URL + S3-backed static server is the ideal architecture for reading filings:

# Point EdgarTools to your S3-backed static server
export EDGAR_DATA_URL="https://your-s3-static-site.example.com"

from edgar import get_filings
filings = get_filings()  # Fetches from your S3 endpoint

Why this is great for reading:

✅ HTTP caching (CloudFront, Varnish, etc.)
✅ CDN distribution for fast global access
✅ Works with ANY cloud provider
✅ No FUSE drivers needed
✅ Standard web server architecture

Option 2: EDGAR_LOCAL_DATA_DIR + FUSE Mount 🆕

For writing to cloud storage (like download_filings()), you can mount cloud storage as a local directory using FUSE:

Cloud	Mount Tool	Command
S3	s3fs-fuse	`s3fs mybucket /mnt/edgar-data`
S3	rclone	`rclone mount s3:mybucket /mnt/edgar-data`
S3	goofys	`goofys mybucket /mnt/edgar-data` (high performance)
Azure	blobfuse2	`blobfuse2 /mnt/edgar-data --tmp-path=/tmp/blobfuse`
GCP	gcsfuse	`gcsfuse mybucket /mnt/edgar-data`

Then configure EdgarTools:

export EDGAR_LOCAL_DATA_DIR=/mnt/edgar-data
export EDGAR_USE_LOCAL_DATA=1

Now download_filings() writes directly to cloud storage:

from edgar.storage import download_filings

# Downloads filings directly to S3 via FUSE mount
download_filings('2025-01-01:2025-01-31')

Why this works: EdgarTools uses standard pathlib.Path operations (path.exists(), file.open(), etc.) which work transparently with FUSE-mounted filesystems.

Recommended Hybrid Approach

For best performance, use both methods together:

# 1. Mount S3 for downloading/writing
s3fs edgar-bucket /mnt/edgar-data
export EDGAR_LOCAL_DATA_DIR=/mnt/edgar-data
export EDGAR_USE_LOCAL_DATA=1

# 2. Download filings (writes to S3 via FUSE)
python -c "from edgar.storage import download_filings; download_filings('2025-01-01:')"

# 3. For reading, use the static website endpoint (faster with CDN caching)
export EDGAR_DATA_URL="https://edgar-bucket.s3-website.amazonaws.com"

This gives you:

Fast writes via FUSE mount for bulk downloads
Fast reads via HTTP/CDN for daily usage
Automatic sync - files written via FUSE are immediately available via static website

Keeping S3 Up to Date

For incremental updates, you have several options:

Option A: Scheduled FUSE Downloads

# Cron job or Lambda that runs daily
export EDGAR_LOCAL_DATA_DIR=/mnt/s3-edgar
python -c "from edgar.storage import download_filings; download_filings()"  # Downloads latest

Option B: Local Download + S3 Sync

# Download locally first (faster), then sync
from edgar.storage import download_filings
download_filings('2025-01-01:')

# Then sync to S3
aws s3 sync ~/.edgar/filings s3://edgar-bucket/filings

Why We're Not Adding Native Cloud SDK Support

We considered adding native boto3/azure-storage-blob support, but your current approaches are actually better:

Native SDK Support	Your Approaches
❌ Tight coupling to specific cloud SDKs	✅ Works with ANY provider
❌ Credential management complexity	✅ Uses standard OS/cloud auth
❌ Adds dependencies for 95% of users who don't need it	✅ Zero new dependencies
❌ Version management burden	✅ Leverages mature FUSE/HTTP tooling

Documentation Coming

We've created a tracking issue (Beads: edgartools-5i3) to document these patterns properly:

S3 Static Website Setup (bucket config, CloudFront, CORS)
FUSE Mount Configuration (s3fs, goofys, rclone, blobfuse2, gcsfuse)
Hybrid Architecture Guide
Sync Strategies for Incremental Updates
Example Configurations for AWS/Azure/GCP

Target: v4.34.0 or v4.35.0 documentation

Summary

Use Case	Recommended Approach
Reading filings	`EDGAR_DATA_URL` + S3 static website
Downloading/writing filings	`EDGAR_LOCAL_DATA_DIR` + FUSE mount
Both reading and writing	Hybrid: FUSE for writes, static URL for reads

Your current solution is solid! The FUSE mount option gives you an additional tool for the download/sync workflow.

Questions? Let us know what cloud provider you're using and we can provide more specific guidance.

1 reply

saul-data Nov 27, 2025
Author

That is very interesting - OK let me try the options and see which one works best - I'll report back my findings. Any recommendations on getting the history? Not sure if there is a large zip to download or something like that. Or just loop through daily and download?

saul-data · 2025-11-30T16:42:15Z

saul-data
Nov 30, 2025
Author

Been working on this all weekend and it is quite the setup. Looks simple but not really. I would highly recommend adding native S3 compatible support by passing environment variables for keys and secrets. When it downloads, it can write to S3 and it can also fetch from S3.

0 replies

dgunning · 2025-11-30T17:17:42Z

dgunning
Nov 30, 2025
Maintainer

Update: Native Cloud Storage Support Coming

Based on @saul-data's feedback about the complexity of the FUSE approach ("spent all weekend on setup"), we've created a plan for native cloud storage support via optional dependencies.

What's Changing

Instead of requiring FUSE mounts, you'll be able to do:

# Install cloud extras
pip install edgartools[s3]      # AWS S3
pip install edgartools[gcs]     # Google Cloud Storage
pip install edgartools[azure]   # Azure Blob Storage
pip install edgartools[cloud]   # All providers

import edgar

# Simple one-liner configuration
edgar.use_cloud_storage('s3://my-edgar-bucket/data/')

# Or via environment variable
# export EDGAR_STORAGE_URL=s3://my-edgar-bucket/data/
edgar.use_cloud_storage()

# Now all storage operations use cloud
from edgar.storage import download_filings
download_filings('2025-01-01:2025-01-31')  # Downloads directly to S3

# Reading also works from cloud
filing = edgar.find("AAPL", form="10-K")[0]
html = filing.html()  # Reads from S3 cache if available

Key Features

Feature	Support
AWS S3	✅ via `s3fs`
Google Cloud Storage	✅ via `gcsfs`
Azure Blob Storage	✅ via `adlfs`
S3-Compatible (MinIO, R2, etc.)	✅ via custom endpoint
Standard cloud auth	✅ IAM roles, service accounts, etc.

S3-Compatible Services (MinIO, Cloudflare R2, etc.)

edgar.use_cloud_storage(
    's3://my-bucket/',
    client_kwargs={'endpoint_url': 'https://minio.example.com'}
)

Technical Approach

Uses fsspec as the abstraction layer (automatically installed with cloud extras)
Zero new dependencies for users who don't need cloud storage
Existing EDGAR_DATA_URL and FUSE approaches still work as alternatives

Timeline

Targeting v4.35.0 or v4.36.0 for this feature.

Feedback Welcome

Does this approach address your needs? Any specific cloud provider requirements or use cases we should consider?

0 replies

saul-data · 2025-11-30T17:20:53Z

saul-data
Nov 30, 2025
Author

My only feedback would have been the part on S3-Compatible Services but you seem to have that covered. We are using R2 and we sometimes need to input the region. This is great thank you!

0 replies

dgunning · 2025-11-30T19:03:34Z

dgunning
Nov 30, 2025
Maintainer

🚀 Native Cloud Storage Implementation Available for Testing

The native cloud storage feature is now implemented and available on the feature/cloud-storage branch for early testing and feedback.

Installation

pip install "edgartools[s3] @ git+https://github.com/dgunning/edgartools.git@feature/cloud-storage"

Or for other providers:

pip install "edgartools[gcs] @ git+https://github.com/dgunning/edgartools.git@feature/cloud-storage"   # GCS
pip install "edgartools[azure] @ git+https://github.com/dgunning/edgartools.git@feature/cloud-storage" # Azure

Usage

import edgar

# AWS S3 (uses default credentials from ~/.aws or environment)
edgar.use_cloud_storage("s3://my-edgar-bucket/")

# Cloudflare R2 (S3-compatible with custom endpoint)
edgar.use_cloud_storage(
    "s3://my-bucket/",
    client_kwargs={
        "endpoint_url": "https://ACCOUNT_ID.r2.cloudflarestorage.com",
        "region_name": "auto"  # R2 requires this
    }
)

# Google Cloud Storage
edgar.use_cloud_storage("gs://my-edgar-bucket/")

# Azure Blob Storage
edgar.use_cloud_storage("az://my-container/edgar/")

# Now reading filings works from cloud storage
filing = edgar.find("0000320193-24-000123")
html = filing.html()  # Reads from cloud if available

Current Limitations

Reading: ✅ Fully supported - local_filing_path() and read_content() work with cloud storage

Writing: ⚠️ Not yet implemented - download_filings() still writes to local filesystem

For now, the workflow for populating cloud storage is:

Download locally: download_filings("2025-01-01:")
Sync to cloud: aws s3 sync ~/.edgar/ s3://bucket/ (or gsutil, azcopy)
Read from cloud: use_cloud_storage("s3://bucket/")

We are tracking cloud write support as a follow-up enhancement.

What We Would Like Feedback On

API Design: Is use_cloud_storage() intuitive? Any changes to the function signature?
Provider Support: Any S3-compatible services we should test? (MinIO, Wasabi, DigitalOcean Spaces, etc.)
Authentication: Are the default credential flows (IAM roles, environment vars) sufficient?
Write Operations: Should we prioritize native cloud writes, or is the sync workaround acceptable?

Technical Details

Uses fsspec as the abstraction layer (thin wrapper, not full protocol architecture)
Thread-safe configuration via threading.Lock()
Zero new dependencies for users who do not enable cloud storage
Full test suite in tests/test_filesystem.py

Please try it out and let us know your experience, especially with R2 @saul-data!

0 replies

dgunning · 2025-12-01T23:46:21Z

dgunning
Dec 1, 2025
Maintainer

Released 4.34.0 with better support for cloud storage

https://edgartools.readthedocs.io/en/latest/guides/cloud-storage/

I plan to test and do some additional improvements

0 replies

saul-data · 2025-12-03T19:48:15Z

saul-data
Dec 3, 2025
Author

This is great, still going through it. May want to remove Goofys from documentation, I don't think it is being maintained anymore. AWS came out with this Rust based alternative: https://github.com/awslabs/mountpoint-s3

0 replies

saul-data · 2025-12-03T19:52:36Z

saul-data
Dec 3, 2025
Author

@dgunning - this didn't work for me, got an error message: no matches found: edgartools[s3]

# AWS S3, Cloudflare R2, MinIO, DigitalOcean Spaces
uv pip install edgartools[s3]

0 replies

dgunning · 2025-12-03T20:21:50Z

dgunning
Dec 3, 2025
Maintainer

Try

uv pip install "edgartools[s3]"

I will update the documentation

0 replies

saul-data · 2025-12-03T20:49:07Z

saul-data
Dec 3, 2025
Author

for download, would it be possible to suppress progress bar in logging? The reason is we have real-time logging in our data platform and it will write a line for every progress change and may become quite a lot of logging in our database. It is useful to know just the size and how long it took per line, just the updating progress bar can be an issue.

I am seeing files arrive in R2 so that is going well :)

0 replies

dgunning · 2025-12-03T20:53:30Z

dgunning
Dec 3, 2025
Maintainer

Sure will look into that

2 replies

saul-data Dec 3, 2025
Author

Legend!

dgunning Dec 5, 2025
Maintainer

Added disable_progress parameter to all download functions in Release 4.35.0

saul-data · 2025-12-03T20:58:15Z

saul-data
Dec 3, 2025
Author

@dgunning Does this also download to cloud or just download_filings()?

# Download all data types (submissions, facts, reference data)
download_edgar_data()

0 replies

saul-data · 2025-12-03T21:02:46Z

saul-data
Dec 3, 2025
Author

@dgunning - this code, is it uploading all the filings for that day or just the forms specified? I think it is downloading all the forms and I am guessing filtering for the forms specified?

from edgar import get_filings, download_filings
filings = get_filings(form=["10-K","10-Q","13F-HR"], filing_date="2025-11-01:")
download_filings(filings=filings, upload_to_cloud=True)

1 reply

dgunning Dec 3, 2025
Maintainer

It should upload all the filings for those forms for the date range - so 9,579 filings as of today

filings = get_filings(form=["10-K","10-Q","13F-HR"], filing_date="2025-11-01:")
>>> Showing 1 to 50 of 9,579 filings

Uh oh!

Cloud storage #507

Uh oh!

saul-data Nov 27, 2025

Replies: 15 comments · 5 replies

Uh oh!

Uh oh!

saul-data Nov 27, 2025 Author

Uh oh!

dgunning Nov 27, 2025 Maintainer

Uh oh!

saul-data Nov 27, 2025 Author

Uh oh!

dgunning Nov 27, 2025 Maintainer

TL;DR: EdgarTools Already Supports Cloud Storage (Two Ways!)

Option 1: EDGAR_DATA_URL (Your Current Approach) ✅

Option 2: EDGAR_LOCAL_DATA_DIR + FUSE Mount 🆕

Recommended Hybrid Approach

Keeping S3 Up to Date

Why We're Not Adding Native Cloud SDK Support

Documentation Coming

Summary

Uh oh!

saul-data Nov 27, 2025 Author

Uh oh!

saul-data Nov 30, 2025 Author

Uh oh!

dgunning Nov 30, 2025 Maintainer

Update: Native Cloud Storage Support Coming

What's Changing

Key Features

S3-Compatible Services (MinIO, Cloudflare R2, etc.)

Technical Approach

Timeline

Feedback Welcome

Uh oh!

saul-data Nov 30, 2025 Author

Uh oh!

dgunning Nov 30, 2025 Maintainer

🚀 Native Cloud Storage Implementation Available for Testing

Installation

Usage

Current Limitations

What We Would Like Feedback On

Technical Details

Uh oh!

dgunning Dec 1, 2025 Maintainer

Uh oh!

saul-data Dec 3, 2025 Author

Uh oh!

saul-data Dec 3, 2025 Author

Uh oh!

dgunning Dec 3, 2025 Maintainer

Uh oh!

saul-data Dec 3, 2025 Author

Uh oh!

dgunning Dec 3, 2025 Maintainer

Uh oh!

saul-data Dec 3, 2025 Author

Uh oh!

dgunning Dec 5, 2025 Maintainer

Uh oh!

saul-data Dec 3, 2025 Author

Uh oh!

saul-data Dec 3, 2025 Author

Uh oh!

dgunning Dec 3, 2025 Maintainer

saul-data
Nov 27, 2025

Replies: 15 comments 5 replies

saul-data
Nov 27, 2025
Author

dgunning
Nov 27, 2025
Maintainer

saul-data Nov 27, 2025
Author

dgunning
Nov 27, 2025
Maintainer

saul-data Nov 27, 2025
Author

saul-data
Nov 30, 2025
Author

dgunning
Nov 30, 2025
Maintainer

saul-data
Nov 30, 2025
Author

dgunning
Nov 30, 2025
Maintainer

dgunning
Dec 1, 2025
Maintainer

saul-data
Dec 3, 2025
Author

saul-data
Dec 3, 2025
Author

dgunning
Dec 3, 2025
Maintainer

saul-data
Dec 3, 2025
Author

dgunning
Dec 3, 2025
Maintainer

saul-data Dec 3, 2025
Author

dgunning Dec 5, 2025
Maintainer

saul-data
Dec 3, 2025
Author

saul-data
Dec 3, 2025
Author

dgunning Dec 3, 2025
Maintainer