Skip to content

Commit d03816a

Browse files
Update big query script (#155)
* Move data collection to dedicated folder * Add last blocks collected * Move data collection to dedicated folder * Add last blocks collected * Add block number and timestamp conditions to queries * Flake8 * Remove force_query arg * Fix typo * Add quotes around timestamp in queries
1 parent ae769b0 commit d03816a

File tree

4 files changed

+59
-21
lines changed

4 files changed

+59
-21
lines changed
Lines changed: 37 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,40 @@
11
"""
22
This script can be used to run queries on BigQuery for any number of blockchains, and save the results in the
33
raw_block_data directory of the project.
4-
The relevant queries must be stored in a file named 'queries.yaml' in the root directory of the project.
4+
The relevant queries must be stored in a file named 'queries.yaml' in the `data_collection_scripts` directory of
5+
the project.
56
67
Attention! Before running this script, you need to generate service account credentials from Google, as described
78
here (https://developers.google.com/workspace/guides/create-credentials#service-account) and save your key in the
8-
root directory of the project under the name 'google-service-account-key.json'
9+
`data_collection_scripts` directory of the project under the name 'google-service-account-key.json'
910
"""
1011
import consensus_decentralization.helper as hlp
1112
import google.cloud.bigquery as bq
1213
import json
1314
import argparse
1415
import logging
1516
from yaml import safe_load
17+
from datetime import datetime
1618

1719
from consensus_decentralization.helper import ROOT_DIR, RAW_DATA_DIR
1820

1921

20-
def collect_data(ledgers, force_query):
22+
def collect_data(ledgers, from_block, to_date):
2123
if not RAW_DATA_DIR.is_dir():
2224
RAW_DATA_DIR.mkdir()
2325

24-
with open(ROOT_DIR / "queries.yaml") as f:
26+
data_collection_dir = ROOT_DIR / "data_collection_scripts"
27+
28+
with open(data_collection_dir / "queries.yaml") as f:
2529
queries = safe_load(f)
2630

27-
client = bq.Client.from_service_account_json(json_credentials_path=ROOT_DIR / "google-service-account-key.json")
31+
client = bq.Client.from_service_account_json(json_credentials_path=data_collection_dir / "google-service-account-key.json")
2832

2933
for ledger in ledgers:
3034
file = RAW_DATA_DIR / f'{ledger}_raw_data.json'
31-
if not force_query and file.is_file():
32-
logging.info(f'{ledger} data already exists locally. '
33-
f'For querying {ledger} anyway please run the script using the flag --force-query')
34-
continue
3535
logging.info(f"Querying {ledger}..")
36-
query = (queries[ledger])
36+
37+
query = (queries[ledger]).replace("{{block_number}}", str(from_block[ledger]) if from_block[ledger] else "-1").replace("{{timestamp}}", to_date)
3738
query_job = client.query(query)
3839
try:
3940
rows = query_job.result()
@@ -44,13 +45,30 @@ def collect_data(ledgers, force_query):
4445
continue
4546

4647
logging.info(f"Writing {ledger} data to file..")
47-
# write json lines to file
48-
with open(file, 'w') as f:
48+
# Append result to file
49+
with open(file, 'a') as f:
4950
for row in rows:
5051
f.write(json.dumps(dict(row), default=str) + "\n")
5152
logging.info(f'Done writing {ledger} data to file.\n')
5253

5354

55+
def get_last_block_collected(ledger):
56+
"""
57+
Get the last block collected for a ledger. This is useful for knowing where to start collecting data from.
58+
Assumes that the data is stored in a json lines file, ordered in increasing block number.
59+
:param ledger: the ledger to get the last block collected for
60+
:returns: the number of the last block collected for the specified ledger
61+
"""
62+
file = RAW_DATA_DIR / f'{ledger}_raw_data.json'
63+
if not file.is_file():
64+
return None
65+
with open(file) as f:
66+
for line in f:
67+
pass
68+
last_block = json.loads(line)
69+
return last_block['number']
70+
71+
5472
if __name__ == '__main__':
5573
logging.basicConfig(format='[%(asctime)s] %(message)s', datefmt='%Y/%m/%d %I:%M:%S %p', level=logging.INFO)
5674

@@ -66,9 +84,12 @@ def collect_data(ledgers, force_query):
6684
help='The ledgers to collect data for.'
6785
)
6886
parser.add_argument(
69-
'--force-query',
70-
action='store_true',
71-
help='Flag to specify whether to query for project data regardless if the relevant data already exist.'
87+
'--to_date',
88+
type=hlp.valid_date,
89+
default=datetime.today().strftime('%Y-%m-%d'),
90+
help='The date until which to get data for (YYYY-MM-DD format). Defaults to today.'
7291
)
92+
7393
args = parser.parse_args()
74-
collect_data(ledgers=args.ledgers, force_query=args.force_query)
94+
from_block = {ledger: get_last_block_collected(ledger) for ledger in args.ledgers}
95+
collect_data(ledgers=args.ledgers, from_block=from_block, to_date=args.to_date)

google-service-account-key-SAMPLE.json renamed to data_collection_scripts/google-service-account-key-SAMPLE.json

File renamed without changes.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,36 +3,46 @@ bitcoin:
33
FROM `bigquery-public-data.crypto_bitcoin.transactions`
44
JOIN `bigquery-public-data.crypto_bitcoin.blocks` ON `bigquery-public-data.crypto_bitcoin.transactions`.block_number = `bigquery-public-data.crypto_bitcoin.blocks`.number
55
WHERE is_coinbase is TRUE
6+
AND number > {{block_number}}
67
AND timestamp > '2018-01-01'
8+
AND timestamp < '{{timestamp}}'
79
ORDER BY timestamp
810

911
bitcoin_cash:
1012
SELECT block_number as number, block_timestamp as timestamp, coinbase_param as identifiers, `bigquery-public-data.crypto_bitcoin_cash.transactions`.outputs
1113
FROM `bigquery-public-data.crypto_bitcoin_cash.transactions`
1214
JOIN `bigquery-public-data.crypto_bitcoin_cash.blocks` ON `bigquery-public-data.crypto_bitcoin_cash.transactions`.block_number = `bigquery-public-data.crypto_bitcoin_cash.blocks`.number
1315
WHERE is_coinbase is TRUE
16+
AND number > {{block_number}}
1417
AND timestamp > '2018-01-01'
18+
AND timestamp < '{{timestamp}}'
1519
ORDER BY timestamp
1620

1721
cardano:
1822
SELECT `blockchain-analytics-392322.cardano_mainnet.block`.slot_no as number, `blockchain-analytics-392322.cardano_mainnet.pool_offline_data`.ticker_name as identifiers, `blockchain-analytics-392322.cardano_mainnet.block`.block_time as timestamp,`blockchain-analytics-392322.cardano_mainnet.block`.pool_hash as reward_addresses
1923
FROM `blockchain-analytics-392322.cardano_mainnet.block`
2024
LEFT JOIN `blockchain-analytics-392322.cardano_mainnet.pool_offline_data` ON `blockchain-analytics-392322.cardano_mainnet.block`.pool_hash = `blockchain-analytics-392322.cardano_mainnet.pool_offline_data`.pool_hash
2125
WHERE `blockchain-analytics-392322.cardano_mainnet.block`.block_time > '2018-01-01'
26+
AND `blockchain-analytics-392322.cardano_mainnet.block`.block_time < '{{timestamp}}'
27+
AND number > {{block_number}}
2228
ORDER BY `blockchain-analytics-392322.cardano_mainnet.block`.block_time
2329

2430
dogecoin:
2531
SELECT block_number as number, block_timestamp as timestamp, coinbase_param as identifiers, `bigquery-public-data.crypto_dogecoin.transactions`.outputs
2632
FROM `bigquery-public-data.crypto_dogecoin.transactions`
2733
JOIN `bigquery-public-data.crypto_dogecoin.blocks` ON `bigquery-public-data.crypto_dogecoin.transactions`.block_number = `bigquery-public-data.crypto_dogecoin.blocks`.number
2834
WHERE is_coinbase is TRUE
35+
AND number > {{block_number}}
2936
AND timestamp > '2018-01-01'
37+
AND timestamp < '{{timestamp}}'
3038
ORDER BY timestamp
3139

3240
ethereum:
3341
SELECT number, timestamp, miner as reward_addresses, extra_data as identifiers
3442
FROM `bigquery-public-data.crypto_ethereum.blocks`
3543
WHERE timestamp > '2018-01-01'
44+
AND timestamp < '{{timestamp}}'
45+
AND number > {{block_number}}
3646
ORDER BY timestamp
3747

3848

@@ -41,20 +51,27 @@ litecoin:
4151
FROM `bigquery-public-data.crypto_litecoin.transactions`
4252
JOIN `bigquery-public-data.crypto_litecoin.blocks` ON `bigquery-public-data.crypto_litecoin.transactions`.block_number = `bigquery-public-data.crypto_litecoin.blocks`.number
4353
WHERE is_coinbase is TRUE
54+
AND number > {{block_number}}
4455
AND timestamp > '2018-01-01'
56+
AND timestamp < '{{timestamp}}'
4557
ORDER BY timestamp
4658

59+
4760
tezos:
4861
SELECT level as number, timestamp, baker as reward_addresses
4962
FROM `public-data-finance.crypto_tezos.blocks`
5063
WHERE timestamp > '2018-01-01'
64+
AND timestamp < '{{timestamp}}'
65+
AND number > {{block_number}}
5166
ORDER BY timestamp
5267

5368
zcash:
5469
SELECT block_number as number, block_timestamp as timestamp, coinbase_param as identifiers, `bigquery-public-data.crypto_zcash.transactions`.outputs
5570
FROM `bigquery-public-data.crypto_zcash.transactions`
5671
JOIN `bigquery-public-data.crypto_zcash.blocks` ON `bigquery-public-data.crypto_zcash.transactions`.block_number = `bigquery-public-data.crypto_zcash.blocks`.number
5772
WHERE is_coinbase is TRUE
73+
AND number > {{block_number}}
5874
AND timestamp > '2018-01-01'
75+
AND timestamp < '{{timestamp}}'
5976
ORDER BY timestamp
6077

docs/data.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -97,22 +97,22 @@ ORDER BY timestamp
9797

9898
Instead of executing each of these queries separately on the BigQuery console and saving the results manually, it is
9999
also possible to automate the process using a
100-
[script](https://github.com/Blockchain-Technology-Lab/consensus-decentralization/blob/main/consensus_decentralization/collect_data.py)
100+
[script](https://github.com/Blockchain-Technology-Lab/consensus-decentralization/blob/main/data_collection_scripts/collect_block_data.py)
101101
and collect all relevant data in one go. Executing this script will run queries
102-
from [this file](https://github.com/Blockchain-Technology-Lab/consensus-decentralization/blob/main/queries.yaml).
102+
from [this file](https://github.com/Blockchain-Technology-Lab/consensus-decentralization/blob/main/data_collection_scripts/queries.yaml).
103103

104104
IMPORTANT: the script uses service account credentials for authentication, therefore before running it, you need to
105105
generate the relevant credentials from Google, as described
106106
[here](https://developers.google.com/workspace/guides/create-credentials#service-account) and save your key in the
107-
root directory of the project under the name 'google-service-account-key.json'. There is a
108-
[sample file](https://github.com/Blockchain-Technology-Lab/consensus-decentralization/blob/main/google-service-account-key-SAMPLE.json)
107+
`data_collection_scripts` directory of the project under the name 'google-service-account-key.json'. There is a
108+
[sample file](https://github.com/Blockchain-Technology-Lab/consensus-decentralization/blob/main/data_collection_scripts/google-service-account-key-SAMPLE.json)
109109
that you can consult, which shows what your credentials are supposed to look like (but note that this is for
110110
informational purposes only, this file is not used in the code).
111111

112112
Once you have set up the credentials, you can just run the following command from the root
113113
directory to retrieve data for all supported blockchains:
114114

115-
`python -m consensus_decentralization.collect_data`
115+
`python -m data_collection_scripts.collect_block_data`
116116

117117
There are also two command line arguments that can be used to customize the data collection process:
118118

0 commit comments

Comments
 (0)