Self-hosting EDGAR #179

tom-flamelit · 2024-12-31T02:13:16Z

tom-flamelit
Dec 31, 2024

Here you mention setting up a backend to self-host EDGAR: #168 (comment)

What do you have in mind?

dgunning · 2025-01-08T19:30:09Z

dgunning
Jan 8, 2025
Maintainer

Hi, I'm Tom. I am thinking about using John Friedman's solution for self-hosting SEC filing files and connecting edgartools to it. We already have storage for company submissions JSON files and the company facts. We just don't do downloads of the actual filing files themselves. I am currently reviewing his documentation and his blog post about it seeing what we would need to implement

0 replies

dgunning · 2025-01-17T19:30:41Z

dgunning
Jan 17, 2025
Maintainer

Hi @tom-flamelit local storage is partiallyimplemented https://dgunning.github.io/edgartools/local-data/. Need your feedback and suggestions

0 replies

tom-flamelit · 2025-01-17T19:59:49Z

tom-flamelit
Jan 17, 2025
Author

Awesome! Will test in next few days and report back

…

On Fri, Jan 17, 2025 at 2:31 PM Dwight Gunning ***@***.***> wrote: Hi @tom-flamelit <https://github.com/tom-flamelit> local storage is partiallyimplemented https://dgunning.github.io/edgartools/local-data/. Need your feedback and suggestions — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWX2S4XJ4EMTX6DDNY7QMHT2LFK7PAVCNFSM6AAAAABUM3A3J6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOBXGA4DGNA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

dgunning · 2025-01-21T11:37:35Z

dgunning
Jan 21, 2025
Maintainer

I am going to leave it open to the community to decide on the further development of local storage. I've done a lot of the internal wiring of local storage into the core Edgar functionality so it operates seamlessly but the mechanics of how to integrate into the current local storage needs to be worked out.

Downloading data so far is implemented for submissions (company data), facts and reference data

def download_edgar_data(submissions: bool = True,
                        facts: bool = True,
                        reference: bool = True):
    """
    Download Edgar data to the local storage directory
    :param submissions: Download submissions
    :param facts: Download facts
    :param reference: Download reference data

There is a new method download_filings(filing_date) that is yet to be integrated because it is a lot of data (.75GB per filing day) and I have to setup the test server. We also need to figure out how it will be used - what is the date range to download for? Should it be called from download_edgar_data?

def download_filings(filing_date: Optional[str] = None,
                     data_directory: Optional[str] = None,
                     overwrite_existing:bool=False):
    """
    Download feed files for the specified date or date range.

    Examples

    download_filings('2025-01-03:')
    download_filings('2025-01-03', overwrite_existing=False)
    download_filings('2024-01-01:2025-01-05', overwrite_existing=True)

    Args:
        filing_date: String in format 'YYYY-MM-DD', 'YYYY-MM-DD:', ':YYYY-MM-DD',
                    or 'YYYY-MM-DD:YYYY-MM-DD'
        data_directory: Directory to save the downloaded files. Defaults to the Edgar data directory.
        overwrite_existing: If True, overwrite existing files. Default is False.
    """
    ```

3 replies

otoburb Jan 23, 2025

As a layperson, data providers will often provide tiered pricing of 5/10/30 year windows of fundamentals (IS, BS, CF).

>>We also need to figure out how it will be used - what is the date range to download for?

The ability to constrain/specify the filings downloads would be very useful, perhaps with a default of only the last 4 quarters of filings (TTM) if no time period is specified. The demand for this type of bulk download capability seems like it would be used by those who have the capability, storage space and time to manage a backend store.

>>Should it be called from download_edgar_data?

I like that downloading the EDGAR submissions (download_edgar_data()) and filings (download_filings()) are currently tentatively separated into two functions. If download_edgar_data() will be a combination (true to its name), then perhaps adding a boolean parameter (e.g. download_filings=False) would be beneficial here. I like that download_filings() has a date range that can be specific -- that could/should be passed in from download_edgar_data() although I don't think that function has that capability yet.

dgunning Jan 27, 2025
Maintainer

Thanks for those points.

Downloading filings require more thought from the user due to the storage requirements and time to download. So it's best to keep download_filings separate. Also download_fiings requires a date or date range parameter so that would also need to be integrated into download_edgar_data

I had a thought that download_filings should be integrated with the 'FIlings` object, maybe be used as a filter

filings = get_filings(form="4", filling_date="2020-01-01:")
download_filings(filings=filings)
fillings.download()

This will allow you to control exactly which fillings are downloaded. If ou change the filter to add additional dates or forms then it would add to local storage.

otoburb Jan 27, 2025

This will allow you to control exactly which fillings are downloaded. If ou change the filter to add additional dates or forms then it would add to local storage.

I agree that keeping download_filings separate is better having tried to use it myself in the past couple of days and seeing the daily filings fill up my storage. I think being able to control which specific company filings to download would be very valuable (perhaps passed as a python list), but I'm not sure that the SEC allows for this type of selective download filter by CUSIP.

dgunning · 2025-01-27T12:18:44Z

dgunning
Jan 27, 2025
Maintainer

Would it also be valuable to integrate cloud storage as a storage backend?

3 replies

otoburb Jan 27, 2025

It sounds like something very valuable, but I haven't seen anybody request it. Also, although it sounds simple, there are many cloud storage backends (Azure, GCP, AWS) and types (e.g. AWS S3 vs. S3 Glacier), and I'm not sure that the effort would be worth your time if nobody has really asked.

ammarlam10 Jul 8, 2025

Hi guys, I'm actually trying to do something similar. I want to store latest filing data (10k 10q and 8k) for 1000s of ticker on AWS S3 and then use it from there when required. I dont want to make calls to SEC each time. Is it possible to do this?

dgunning Jul 9, 2025
Maintainer

I wrote a design for Cloud Storage a month ago but haven't gotten around to it. I can do some code generation for it now

paultiq · 2025-01-31T21:49:47Z

paultiq
Jan 31, 2025

I've been experimenting with a different approach here... using a custom HTTPX cacher. This could write to any storage platform, including S3, and would be transparent to the edgartools. Adding local storage to edgartools seems to increase overall complexity, vs adding a cache of (effectively) http responses.

The cache could be built on the fly, or primed explicitly by hitting the endpoints.

If there's interest in this approach, I could publish it... this is a very different approach than the local_storage approach... and would operate at the request level.

0 replies

dgunning · 2025-01-31T23:39:10Z

dgunning
Jan 31, 2025
Maintainer

Hey @paultiq nice idea.

Couple of thoughts

Local storage is mostly already implemented though the complex part of how and what and when to download filing attachments is not fully worked out.
transparent caching makes a lot of sense if different users can share the same cache or if the same calls are made frequently
if you are sophisticated enough to setup S3 for transparent caching then you probably work at a company that can configure local storage
some users want to download all Form NP-X from 2020 and caching won't help until all filings have been downloaded
local storage uses the bulk apis so it's two http calls to download all filings for 2024-09-02 instead of 4000 - I think this is the biggest advantage of bulk apis/local storage. How would caching take advantage of the reduction of http calls?

That being said I like the idea, it would add value. Can you elaborate on how it would be configured e.g. to use S3?

4 replies

paultiq Feb 1, 2025

There's a few different aspects of this, and maybe worth moving to a separate discussion, but:

I'm using hishel. It's a pretty full-featured cache for HTTPX. S3 caching is baked in Hishel Configurations, altho would be easy to implement a custom if you wanted something different.

There's three main parts involved here:

Storage : The persistence mechanism for Hishel.
CacheController: Defines the rules for what's cacheable or not. By default, you rely on standard browser caching rules which can be noisy when checking for changes. Edgar sets fairly short expirations, or relies on etags, which incurs a lot of roundtrips. I'm using a custom cache controller, and setting my own logic, such as declaring certain URLs immutable or explicitly cacheable for a longer time period.
custom_key_generator: A key part is generating stable object identifiers for cached content, such as storing objects by accession number rather than by some URL. If this is done "right", then bulk APIs might be able to prime the cache - as long as there's a mapping (1:1 or otherwise) from bulk API filenames to the request name, presumably by accession number.

With the recent HTTPX changes (thank you), this is fairly easy to drop in. I need to think a bit about the bulk APIs and how/if that could be used the prime the cache. Any feedback is certainly welcome!

Edit:
Ultimately this comes down to:

can / should caching be "shimmed" at the request layer, and integrated, to achieve the same result as local storage
or, is caching just an alternate strategy to reduce requests, and perhaps allow multiple users to share a common cache
is there some hybrid world of local storage + caching? local storage could leverage the requests cache, but otherwise operate the same. This might give edgartools a little more freedom to make changes in local storage behavior across releases, if the Edgar request cache is considered stable.

Edit2: A secondary note here is needing to move the rate limiting behind the requests. Where it sits now will throttle all requests, including cache hits.

dgunning Feb 2, 2025
Maintainer

I think caching is an alternate strategy to reduce requests but has a much different performance profile. With LocalStorage you run large HTTP requests at the start then you get much fewer individual requests later. With caching the numbers of individual requests gradually decreases over time unless you aggressively prime at the start.

I think there is mostly overlap but some things LocalStorage is better at and some things caching is a better strategy. You could have both, and it can bet transparent if you can make the backend cache locations the same.

It would be nice to do a survey but I would guess 10% or less are power users who want to have local copies of filings so they can run intensive jobs and caching won't help with the old filings since you have to download the old filings once with either the cache or LocalStorage options, and downloading the bulk files makes it 2 large requests instead of 4000.

Some urls are tricky to cache like the quarterly filing index https://www.sec.gov/Archives/edgar/full-index/2025/QTR1/form.idx. The url is the same but the data changes every weekday night at 10:30, and only for the very latest quarterly index.

Yes, we'd need to move the throttling

kalavrcom Mar 17, 2025

It would be nice to do a survey but I would guess 10% or less are power users who want to have local copies of filings so they can run intensive jobs

I'd like to throw in my two cents here and state that I don't think we're a "power user" but from a cost perspective it seems to be (much) cheaper for retail investors and (very) small prop firms/investment clubs etc. to download local copies of filings to run as many queries as they want without hitting API limits. Note also that in early development stages, or when trying to backtest models that incorporate fundamentals, that audience needs almost unlimited queries because of the amount of wasted iterations required for amateur or small 1-2 person teams to get a full pipeline setup.

That being said, in terms of priority, I would prefer that the individual XBRL2 parsing be locked down solid first along with MultiFinancials updates, before the LocalStorage is finalized.

In terms of caching: I can see this being useful if investors (primarily retail investors?) have small portfolios, but then I struggle to figure out why they wouldn't use existing frontend UIs like Google Finance, Bing Finance, YahooFinance, etc.

xingren23 Apr 30, 2025

In my opinion, LocalStorage can only download by time range, and the amount of data that needs to be downloaded and processed is very large, with several GB of compressed files every day. It takes a long time to download and process these files. In some scenarios, users only need to analyze certain companies or certain types of files (such as 10-K). In this case, transparent proxy caching is a good solution to avoid redundant file downloads

For some URLs whose contents change, you can set a separate cache expiration time for them. For example, https:www.sec.gov/Archives/edgar/full-index/2025/QTR1/form.idx, updated at 10:30 pm every day, i think it's expiration time could be set to 10:30

paultiq · 2025-07-09T22:13:23Z

paultiq
Jul 9, 2025

The infrastructure is all in place to do this at the http caching layer: httpclient_cache, using Hishel: "a library that implements HTTP Caching for HTTPX and HTTP Core libraries in accordance with RFC 9111, the most recent caching specification."

Note: this is, as mentioned earlier, one of multiple strategies you could use. At the low HTTP caching layer, you're caching the data before it's processed. You'll still incur the performance hit of processing each file.

Using an S3 storage cache:

To switch to an S3 cache, you'd need to change the storage from:

hishel.FileStorage => hishel.S3Storage
hishel.AsyncFileStorage => hishel.AsyncS3Storage

https://hishel.com/#Configurations

Fine tuning the cache policies

There's some fine tuning of the caching policies that will be needed to make this really useful. The comments at the top of httpclient_cache.py have a few examples, but I suspect you'll want to implement your own controller.

Example using a local file cache

from edgar import httpclient_cache, set_identity, Company
from pathlib import Path
import logging

logging.basicConfig(format='%(asctime)s %(name)s %(levelname)-8s %(message)s', level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S')
logging.getLogger("hishel.controller").setLevel(logging.DEBUG)

httpclient_cache.install_cached_client(cache_directory = Path(r"my_cachedir"), controller_args = {"allow_heuristics": True, "allow_stale": True, "always_revalidate": False})

# set_identity("you@email.com")
filings = Company('MS').get_filings(form="10-Q").latest(10)

1 reply

saul-data Nov 27, 2025

@dgunning Has anybody got a working example to use S3/cloud for storage? This would be super useful, especially in a distributed computing environment.

Uh oh!

Self-hosting EDGAR #179

Uh oh!

Replies: 8 comments · 11 replies

Uh oh!

dgunning Jan 8, 2025 Maintainer

Uh oh!

dgunning Jan 17, 2025 Maintainer

Uh oh!

tom-flamelit Jan 17, 2025 Author

Uh oh!

dgunning Jan 21, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

dgunning Jan 27, 2025 Maintainer

Uh oh!

Uh oh!

dgunning Jan 27, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

dgunning Jul 9, 2025 Maintainer

Uh oh!

Uh oh!

dgunning Jan 31, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

dgunning Feb 2, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Using an S3 storage cache:

Fine tuning the cache policies

Example using a local file cache

Uh oh!

Replies: 8 comments 11 replies

dgunning
Jan 8, 2025
Maintainer

dgunning
Jan 17, 2025
Maintainer

tom-flamelit
Jan 17, 2025
Author

dgunning
Jan 21, 2025
Maintainer

dgunning Jan 27, 2025
Maintainer

dgunning
Jan 27, 2025
Maintainer

dgunning Jul 9, 2025
Maintainer

dgunning
Jan 31, 2025
Maintainer

dgunning Feb 2, 2025
Maintainer