Skip to content

Commit d693cb2

Browse files
committed
fix rk, update docs, fix json gen, update html viz
1 parent 979a4a7 commit d693cb2

File tree

16 files changed

+160
-70
lines changed

16 files changed

+160
-70
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
*.csv
22
*.json
33
*.html
4+
*.txt
5+
6+
test.py
47

58
waybacktweets/__pycache__
69
waybacktweets/api/__pycache__

README.md

Lines changed: 68 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# Wayback Tweets
22

3-
[![PyPI](https://img.shields.io/pypi/v/waybacktweets)](https://pypi.org/project/waybacktweets) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.12528447.svg)](https://doi.org/10.5281/zenodo.12528447) [![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://waybacktweets.streamlit.app) [![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tnaM3rMWpoSHBZ4P_6iHFPjraWRQ3OGe?usp=sharing)
4-
3+
[![PyPI](https://img.shields.io/pypi/v/waybacktweets)](https://pypi.org/project/waybacktweets) [![PyPI Downloads](https://static.pepy.tech/badge/waybacktweets)](https://pepy.tech/projects/waybacktweets)
54

65
Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing (see [Field Options](https://claromes.github.io/waybacktweets/field_options.html)), and saves the data in HTML, for easy viewing of the tweets using the iframe tags, CSV, and JSON formats.
76

@@ -11,21 +10,50 @@ Retrieves archived tweets CDX data from the Wayback Machine, performs necessary
1110
pip install waybacktweets
1211
```
1312

14-
## Quickstart
15-
16-
### Using Wayback Tweets as a standalone command line tool
17-
18-
waybacktweets [OPTIONS] USERNAME
13+
## CLI
1914

2015
```shell
21-
waybacktweets --from 20150101 --to 20191231 --limit 250 jack
16+
Usage: waybacktweets [OPTIONS] USERNAME
17+
18+
USERNAME: The Twitter username without @
19+
20+
Options:
21+
-c, --collapse [urlkey|digest|timestamp:XX]
22+
Collapse results based on a field, or a
23+
substring of a field. XX in the timestamp
24+
value ranges from 1 to 14, comparing the
25+
first XX digits of the timestamp field. It
26+
is recommended to use from 4 onwards, to
27+
compare at least by years.
28+
-f, --from DATE Filtering by date range from this date.
29+
Format: YYYYmmdd
30+
-t, --to DATE Filtering by date range up to this date.
31+
Format: YYYYmmdd
32+
-l, --limit INTEGER Query result limits.
33+
-rk, --resumption_key TEXT Allows for a simple way to scroll through
34+
the results. Key to continue the query from
35+
the end of the previous query.
36+
-mt, --matchtype [exact|prefix|host|domain]
37+
Results matching a certain prefix, a certain
38+
host or all subdomains.
39+
-v, --verbose Shows the log.
40+
--version Show the version and exit.
41+
-h, --help Show this message and exit.
42+
43+
Examples:
44+
45+
Retrieve all tweets: waybacktweets jack
46+
47+
With options and verbose output: waybacktweets --from 20200305 --to 20231231 --limit 300 --verbose jack
48+
49+
Documentation:
50+
51+
https://claromes.github.io/waybacktweets/
2252
```
2353

24-
### Using Wayback Tweets as a Web App
25-
26-
[Open the application](https://waybacktweets.streamlit.app), a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
54+
## Module
2755

28-
### Using Wayback Tweets as a Python Module
56+
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tnaM3rMWpoSHBZ4P_6iHFPjraWRQ3OGe?usp=sharing)
2957

3058
```python
3159
from waybacktweets import WaybackTweets, TweetsParser, TweetsExporter
@@ -37,29 +65,52 @@ archived_tweets = api.get()
3765

3866
if archived_tweets:
3967
field_options = [
68+
"archived_urlkey",
4069
"archived_timestamp",
41-
"original_tweet_url",
70+
"parsed_archived_timestamp",
4271
"archived_tweet_url",
72+
"parsed_archived_tweet_url",
73+
"original_tweet_url",
74+
"parsed_tweet_url",
75+
"available_tweet_text",
76+
"available_tweet_is_RT",
77+
"available_tweet_info",
78+
"archived_mimetype",
4379
"archived_statuscode",
80+
"archived_digest",
81+
"archived_length",
82+
"resumption_key",
4483
]
4584

4685
parser = TweetsParser(archived_tweets, USERNAME, field_options)
4786
parsed_tweets = parser.parse()
4887

4988
exporter = TweetsExporter(parsed_tweets, USERNAME, field_options)
5089
exporter.save_to_csv()
90+
exporter.save_to_json()
91+
exporter.save_to_html()
5192
```
5293

94+
## Web App
95+
96+
[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://waybacktweets.streamlit.app)
97+
98+
A prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
99+
100+
> [!NOTE]
101+
> Starting from version 1.0, the web app will not receive all updates from the official package. To access all features, prefer the package via PyPI.
102+
53103
## Documentation
54104

55105
- [Wayback Tweets documentation](https://claromes.github.io/waybacktweets)
56106
- [Wayback CDX Server API (Beta) documentation](https://archive.org/developers/wayback-cdx-server.html)
57107

58108
## Acknowledgements
59109

60-
- Tristan Lee (Bellingcat's Data Scientist) for the idea of the application.
110+
- Tristan Lee (Bellingcat's Data Scientist) for the idea.
61111
- Jessica Smith (Snowflake's Community Growth Specialist) and Streamlit/Snowflake team for the additional server resources on Streamlit Cloud.
62-
- OSINT Community for recommending the application.
112+
- OSINT Community for recommending the package and the application.
63113

64-
> [!NOTE]
65-
> If the Streamlit application is down, please check the [Streamlit Cloud Status](https://www.streamlitstatus.com/).
114+
## License
115+
116+
[GPL-3.0](LICENSE.md)

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
project = "Wayback Tweets"
66
release, version = get_version("waybacktweets")
77
rst_epilog = f".. |release| replace:: v{release}"
8-
copyright = f"2023 - {datetime.datetime.now().year}, Claromes · Icon by The Doodle Library · Title font by Google, licensed under the Open Font License · Pre-release: v{release}" # noqa: E501
8+
copyright = f"2023 - {datetime.datetime.now().year}, Claromes · Icon by The Doodle Library · Title font by Google, licensed under the Open Font License · Release: v{release}" # noqa: E501
99
author = "Claromes"
1010

1111
# -- General configuration ---------------------------------------------------

docs/field_options.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,3 +40,5 @@ The package performs several parses to facilitate the analysis of archived tweet
4040
- ``archived_digest``: (`str`) The ``SHA1`` hash digest of the content, excluding the headers. It's usually a base-32-encoded string.
4141

4242
- ``archived_length``: (`int`) The compressed byte size of the corresponding WARC record, which includes WARC headers, HTTP headers, and content payload.
43+
44+
- ``resumption_key``: (`str`) Allows for a simple way to scroll through the results. Key to continue the query from the end of the previous query.

docs/handson.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,3 @@ Hands-On Examples
1919
:target: https://colab.research.google.com/drive/1tnaM3rMWpoSHBZ4P_6iHFPjraWRQ3OGe?usp=sharing
2020
:alt: Open In Collab
2121

22-
.. raw:: html

docs/index.rst

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,23 +39,21 @@ Command-Line Interface
3939

4040
cli
4141

42-
Streamlit Web App
43-
-------------------
42+
API Reference
43+
---------------
4444

4545
.. toctree::
4646
:maxdepth: 2
4747

48-
streamlit
49-
48+
api
5049

51-
API Reference
52-
---------------
50+
Streamlit Web App
51+
-------------------
5352

5453
.. toctree::
5554
:maxdepth: 2
5655

57-
api
58-
56+
streamlit
5957

6058
Additional Information
6159
-----------------------

docs/outputs.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,14 @@ This format allows for easy viewing of the archived tweets, through the use of t
1414

1515
- ``original_tweet_url``: (`str`) The original tweet URL.
1616

17-
- ``parsed_tweet_url``: (`str`) The original tweet URL after parsing. Old URLs were archived in a nested manner. The parsing applied here unnests these URLs, when necessary. Check the :ref:`utils`.
17+
- ``parsed_tweet_url``: (`str`) The original tweet URL after parsing. Old URLs were archived in a nested manner. The parsing applied here unnests these URLs when necessary. Refer to the :ref:`utils` for more details.
1818

1919
Additionally, other fields are displayed.
2020

21+
.. note::
22+
23+
The iframes (accordions) are best viewed in Firefox.
24+
2125
CSV
2226
--------
2327

docs/quickstart.rst

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,6 @@ waybacktweets [OPTIONS] USERNAME
1212
1313
waybacktweets --from 20150101 --to 20191231 --limit 250 jack
1414
15-
Web App
16-
-------------
17-
18-
Using Wayback Tweets as a Streamlit Web App.
19-
20-
`Open the application <https://waybacktweets.streamlit.app>`_, a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.
21-
2215
Module
2316
-------------
2417

@@ -35,14 +28,34 @@ Using Wayback Tweets as a Python Module.
3528
3629
if archived_tweets:
3730
field_options = [
31+
"archived_urlkey",
3832
"archived_timestamp",
39-
"original_tweet_url",
33+
"parsed_archived_timestamp",
4034
"archived_tweet_url",
35+
"parsed_archived_tweet_url",
36+
"original_tweet_url",
37+
"parsed_tweet_url",
38+
"available_tweet_text",
39+
"available_tweet_is_RT",
40+
"available_tweet_info",
41+
"archived_mimetype",
4142
"archived_statuscode",
43+
"archived_digest",
44+
"archived_length",
45+
"resumption_key",
4246
]
4347
4448
parser = TweetsParser(archived_tweets, USERNAME, field_options)
4549
parsed_tweets = parser.parse()
4650
4751
exporter = TweetsExporter(parsed_tweets, USERNAME, field_options)
4852
exporter.save_to_csv()
53+
exporter.save_to_json()
54+
exporter.save_to_html()
55+
56+
Web App
57+
-------------
58+
59+
Using Wayback Tweets as a Streamlit Web App.
60+
61+
`Open the application <https://waybacktweets.streamlit.app>`_, a prototype written in Python with the Streamlit framework and hosted on Streamlit Cloud.

docs/streamlit.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
Web App
22
=========
33

4+
.. note::
5+
6+
Starting from version 1.0, the web app will not receive all updates from the official package. To access all features, prefer the package via PyPI.
7+
48
The application is a prototype hosted on Streamlit Cloud, serving as an alternative to the command line tool.
59

610
`Open the application <https://waybacktweets.streamlit.app>`_.
@@ -13,8 +17,6 @@ Filters
1317

1418
- Limit: Query result limits.
1519

16-
- Resumption Key: Allows for a simple way to scroll through the results. Key to continue the query from the end of the previous query.
17-
1820
- Only unique Wayback Machine URLs: Filtering by the collapse option using the ``urlkey`` field and the URL Match Scope ``prefix``
1921

2022

pyproject.toml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
[tool.poetry]
22
name = "waybacktweets"
3-
version = "1.0rc1"
3+
version = "1.0"
44
description = "Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing, and saves the data."
55
authors = ["Claromes <support@claromes.com>"]
66
license = "GPLv3"
77
readme = "README.md"
88
repository = "https://github.com/claromes/waybacktweets"
99
keywords = [
1010
"twitter",
11+
"X",
1112
"tweet",
1213
"internet-archive",
1314
"wayback-machine",
@@ -16,13 +17,14 @@ keywords = [
1617
"command-line",
1718
]
1819
classifiers = [
19-
"Development Status :: 4 - Beta",
20+
"Development Status :: 5 - Production/Stable",
2021
"Intended Audience :: Developers",
2122
"Intended Audience :: Science/Research",
2223
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
2324
"Natural Language :: English",
2425
"Programming Language :: Python :: 3.10",
2526
"Programming Language :: Python :: 3.11",
27+
"Programming Language :: Python :: 3.12",
2628
"Topic :: Software Development",
2729
"Topic :: Utilities",
2830
]

0 commit comments

Comments
 (0)