Advarchs is simple tool for retrieving data from web archives. It is especially useful if you are working with remote data stored in compressed spreadsheets or of similar format.
Say you need to perform some data anlytics on an excel spreadsheet that gets refreshed every month and stored in RAR format. You can target a that file and convert it to a pandas dataframe with the following procedure:
import pd
import os
import tempfile
from advarchs import webfilename,extract_web_archive
TEMP_DIR = tempfile.gettempdir()
url = "http://www.site.com/archive.rar"
arch_file_name = webfilename(url)
arch_path = os.path.join(TEMP_DIR, arch_file_name)
xlsx_files = extract_web_archive(url, arch_path, ffilter=['xlsx'])
for xlsx_f in xlsx_files:
xlsx = pd.ExcelFile(xlsx_f)
...Python 3.5+p7zip
On CentOS and Ubuntu <= 16.04, the following packages are needed:
unrar
pip install advarchsSee CONTRIBUTING
This project adheres to the Contributor Covenant 1.2. By participating, you are advised to adhere to this Code of Conduct in all your interactions with this project.