Self-hosting EDGAR #179
Replies: 8 comments 11 replies
-
|
Hi, I'm Tom. I am thinking about using John Friedman's solution for self-hosting SEC filing files and connecting edgartools to it. We already have storage for company submissions JSON files and the company facts. We just don't do downloads of the actual filing files themselves. I am currently reviewing his documentation and his blog post about it seeing what we would need to implement |
Beta Was this translation helpful? Give feedback.
-
|
Hi @tom-flamelit local storage is partiallyimplemented https://dgunning.github.io/edgartools/local-data/. Need your feedback and suggestions |
Beta Was this translation helpful? Give feedback.
-
|
Awesome! Will test in next few days and report back
…On Fri, Jan 17, 2025 at 2:31 PM Dwight Gunning ***@***.***> wrote:
Hi @tom-flamelit <https://github.com/tom-flamelit> local storage is
partiallyimplemented https://dgunning.github.io/edgartools/local-data/.
Need your feedback and suggestions
—
Reply to this email directly, view it on GitHub
<#179 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWX2S4XJ4EMTX6DDNY7QMHT2LFK7PAVCNFSM6AAAAABUM3A3J6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOBXGA4DGNA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
I am going to leave it open to the community to decide on the further development of local storage. I've done a lot of the internal wiring of local storage into the core Edgar functionality so it operates seamlessly but the mechanics of how to integrate into the current local storage needs to be worked out. Downloading data so far is implemented for submissions (company data), facts and reference data def download_edgar_data(submissions: bool = True,
facts: bool = True,
reference: bool = True):
"""
Download Edgar data to the local storage directory
:param submissions: Download submissions
:param facts: Download facts
:param reference: Download reference dataThere is a new method def download_filings(filing_date: Optional[str] = None,
data_directory: Optional[str] = None,
overwrite_existing:bool=False):
"""
Download feed files for the specified date or date range.
Examples
download_filings('2025-01-03:')
download_filings('2025-01-03', overwrite_existing=False)
download_filings('2024-01-01:2025-01-05', overwrite_existing=True)
Args:
filing_date: String in format 'YYYY-MM-DD', 'YYYY-MM-DD:', ':YYYY-MM-DD',
or 'YYYY-MM-DD:YYYY-MM-DD'
data_directory: Directory to save the downloaded files. Defaults to the Edgar data directory.
overwrite_existing: If True, overwrite existing files. Default is False.
"""
```
|
Beta Was this translation helpful? Give feedback.
-
|
Would it also be valuable to integrate cloud storage as a storage backend? |
Beta Was this translation helpful? Give feedback.
-
|
I've been experimenting with a different approach here... using a custom HTTPX cacher. This could write to any storage platform, including S3, and would be transparent to the edgartools. Adding local storage to edgartools seems to increase overall complexity, vs adding a cache of (effectively) http responses. The cache could be built on the fly, or primed explicitly by hitting the endpoints. If there's interest in this approach, I could publish it... this is a very different approach than the local_storage approach... and would operate at the request level. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @paultiq nice idea. Couple of thoughts
That being said I like the idea, it would add value. Can you elaborate on how it would be configured e.g. to use S3? |
Beta Was this translation helpful? Give feedback.
-
|
The infrastructure is all in place to do this at the http caching layer: httpclient_cache, using Hishel: "a library that implements HTTP Caching for HTTPX and HTTP Core libraries in accordance with RFC 9111, the most recent caching specification." Note: this is, as mentioned earlier, one of multiple strategies you could use. At the low HTTP caching layer, you're caching the data before it's processed. You'll still incur the performance hit of processing each file. Using an S3 storage cache:To switch to an S3 cache, you'd need to change the storage from:
https://hishel.com/#Configurations Fine tuning the cache policiesThere's some fine tuning of the caching policies that will be needed to make this really useful. The comments at the top of httpclient_cache.py have a few examples, but I suspect you'll want to implement your own controller. Example using a local file cachefrom edgar import httpclient_cache, set_identity, Company
from pathlib import Path
import logging
logging.basicConfig(format='%(asctime)s %(name)s %(levelname)-8s %(message)s', level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S')
logging.getLogger("hishel.controller").setLevel(logging.DEBUG)
httpclient_cache.install_cached_client(cache_directory = Path(r"my_cachedir"), controller_args = {"allow_heuristics": True, "allow_stale": True, "always_revalidate": False})
# set_identity("you@email.com")
filings = Company('MS').get_filings(form="10-Q").latest(10) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Here you mention setting up a backend to self-host EDGAR: #168 (comment)
What do you have in mind?
Beta Was this translation helpful? Give feedback.
All reactions