Merge pull request #63 from Blockchain-Technology-Lab/docs

dimkarakostas · web-flow · commit d60b78f8b4a3 · 2024-02-12T18:22:35.000+02:00
Update documentation
diff --git a/docs/contribute.md b/docs/contribute.md
@@ -4,4 +4,76 @@ You can contribute to the tool by adding support for a ledger, updating the
 mapping process for an existing ledger, or adding a new metric. In all cases,
 the information should be submitted via a GitHub PR.
 
-...
+## Add support for ledgers
+
+You can add support for a ledger that is not already supported as follows.
+
+### Mapping information
+
+In the directory `mapping_information/`, there exist two folders: `addresses`
+and `special_addresses`.
+
+`addresses` contains information about the owner or manager of an address. This
+information should be publicly available and verifiable, for example it may come
+from a public explorer, social media or forum posts, articles, etc. Each file in
+this folder is named `<project_name>.json` (for the corresponding ledger) and
+contains a dictionary where the key is the address and the value is a dictionary
+with the following information:
+(i) the name of the entity (that controls the address);
+(ii) the source of the information (e.g., an explorer's URL);
+(iii) (optional) a boolean value `is_contract` (if omitted then it is assumed false);
+(iv) (optional) `extra_info` that might be relevant or interesting (not used for
+the analysis).
+
+`special_addresses` contains information about addresses that should be treated
+specially, e.g., excluded from the analysis. This includes burn addresses,
+protocol-related addresses (e.g., Ethereum's staking contract), treasury
+addresses, etc. Here each file is named `<project_name>.json` and contains a
+list of dictionaries with the following information:
+(i) the address;
+(ii) the source of the information;
+(iii) `extra_info` which describes the reason why the address is special.
+
+To contribute mapping information you can either update an existing file, by
+changing and/or adding some entries, or create a new file for a newly-supported
+ledger.
+
+### Price information
+
+The directory `price_data/` contains information about the supported ledgers'
+market price. Each file in this folder is named `<project_name>.csv` (for the
+corresponding ledger). The csv file has no header and each line contains two
+comma-separated values:
+(i) a day (in the form YYYY-MM-DD);
+(ii) the USD market price of the token on the set day.
+
+To contribute price information you can either update an existing file, by
+adding entries for days where data is missing, or create a new file for a
+newly-supported ledger and add historical price data.
+
+## Add metrics
+
+To add a new metric, you should do the following steps.
+
+First, create a relevant function in the script
+`tokenomics_decentralization/metrics.py`. The function should be named
+`compute_{metric_name}` and is given two parameters:
+(i) a list of tuples, where each tuple's first value is a numeric type that
+defines the balance of an address;
+(ii) an integer that defines the circulation (that is the sum of all address
+balances).
+
+Second, import this new function to `tokenomics_decentralization/analyze.py`.
+In this file, include the function as the value to the dictionary
+`compute_functions` of the `analyze_snapshot` function, using as a key the name
+of the function (which will be used in the config file).
+
+Third, add the name of the metric (which was used as the key to the dictionary
+in `analyze.py`) to the file `config.yaml` under `metrics`. You can optionally
+also add it under the plot parameters, if you want it to be included in the
+plots by default.
+
+Finally, you should add unit tests for the new metric
+[here](https://github.com/Blockchain-Technology-Lab/tokenomics-decentralization/tree/main/tests)
+and update the [corresponding documentation
+page](https://github.com/Blockchain-Technology-Lab/tokenomics-decentralization/blob/main/docs/metrics.md)
diff --git a/docs/data.md b/docs/data.md
@@ -46,55 +46,6 @@ WITH double_entry_book AS (
     ORDER BY balance DESC
 ```
 
-### Cardano
-
-```
-SELECT *    
-    FROM
-    (
-        WITH blocks AS (
-          SELECT
-          slot_no AS block_number,
-          block_time
-          FROM `iog-data-analytics.cardano_mainnet.block`
-          WHERE block_time < "{{timestamp}}"
-        ),
-        OUTPUTS AS (
-          SELECT
-          slot_no as output_slot_number,
-          CAST(JSON_VALUE(a, '$.out_address') AS STRING) AS address,
-          CAST(JSON_VALUE(a, '$.out_idx') AS INT64) as out_idx,
-          CAST(JSON_VALUE(a, '$.out_value') AS INT64 ) AS value
-          FROM `iog-data-analytics.cardano_mainnet.vw_tx_in_out_with_inputs_value`
-          JOIN blocks ON block_number = slot_no
-          JOIN UNNEST(JSON_QUERY_ARRAY(outputs)) AS a
-        ),
-        INPUTS AS (
-          SELECT
-          address,
-          CAST(JSON_VALUE(i, '$.out_value') AS INT64 ) AS value
-          FROM `iog-data-analytics.cardano_mainnet.vw_tx_in_out_with_inputs_value`
-          JOIN OUTPUTS ON slot_no = output_slot_number
-          JOIN UNNEST(JSON_QUERY_ARRAY(inputs)) AS i ON CAST(JSON_VALUE(i, '$.in_idx') AS INT64) = OUTPUTS.out_idx
-        ),
-        INCOMING AS (
-          SELECT address, SUM(CAST(value AS numeric)) as sum_incoming
-          FROM INPUTS
-          GROUP BY address
-        ),
-        OUTGOING AS (
-          SELECT address, SUM(CAST(value AS numeric)) as sum_outgoing
-          FROM OUTPUTS
-          GROUP BY address
-        )
-        SELECT i.address, i.sum_incoming - o.sum_outgoing AS balance
-        FROM INCOMING AS i
-        JOIN OUTGOING AS o ON i.address = o.address
-    )
-    WHERE balance > 0
-    ORDER BY balance DESC
-```
-
 ### Dogecoin
 
 ```
diff --git a/docs/metrics.md b/docs/metrics.md
@@ -2,24 +2,19 @@
 
 The metrics that have been implemented so far are the following:
 
-1. **Nakamoto coefficient**: The Nakamoto coefficient represents the minimum number of entities that
-   collectively produce more than 50% of the total blocks within a given timeframe. The output of the metric is an
-   integer.
-2. **Gini coefficient**: The Gini coefficient represents the degree of inequality in block production. The
+1. **Nakamoto coefficient**: The Nakamoto coefficient represents the minimum number of entities that collectively control 
+   more than 50% of all tokens in circulation at a given point in time. The output of the metric is an integer.
+2. **Gini coefficient**: The Gini coefficient represents the degree of inequality in token ownership. The
    output of the metric is a decimal number in [0,1]. Values close to 0 indicate equality (all entities in
-   the system produce the same number of blocks) and values close to 1 indicate inequality (one entity
-   produces most or all blocks).
-3. **Entropy**: Entropy represents the expected amount of information in the distribution of blocks across entities.
+   the system control the same amount of assets) and values close to 1 indicate inequality (one entity
+   holds most or all tokens).
+3. **Entropy**: Shannon entropy represents the expected amount of information in the distribution of tokens across entities.
    The output of the metric is a real number. Typically, a higher value of entropy indicates higher decentralization
-   (lower predictability). Entropy is parameterized by a base rate α, which defines different types of entropy:
-    - α = -1: min entropy
-    - α = 0: Hartley entropy
-    - α = 1: Shannon entropy (this is used by default)
-    - α = 2: collision entropy
+   (lower predictability).
 4. **HHI**: The Herfindahl-Hirschman Index (HHI) is a measure of market concentration. It is defined as the sum of the
    squares of the market shares (as whole numbers, e.g. 40 for 40%) of the entities in the system. The output of the
-   metric is a real number in (0, 10000]. Values close to 0 indicate low concentration (many entities produce a similar
-   number of blocks) and values close to 1 indicate high concentration (one entity produces most or all blocks).
+   metric is a real number in (0, 10000]. Values close to 0 indicate low concentration (many entities hold a similar
+   number of tokens) and values close to 10000 indicate high concentration (one entity controls most or all tokens).
    The U.S. Department of Justice has set the following thresholds for interpreting HHI values (in traditional markets):
     - (0, 1500): Competitive market
     - [1500, 2500]: Moderately concentrated market
@@ -28,9 +23,9 @@ The metrics that have been implemented so far are the following:
    or the redundancy, in a population. In practice, it is calculated as the maximum possible entropy minus the observed
    entropy. The output is a real number. Values close to 0 indicate equality and values towards infinity indicate
    inequality. Therefore, a high Theil Index suggests a population that is highly centralized.
-6. **Max power ratio**: The max power ratio represents the share of blocks that are produced by the most "powerful"
-   entity, i.e. the entity that produces the most blocks. The output of the metric is a decimal number in [0,1].
+6. **Max power ratio**: The max power ratio represents the share of tokens that are owned by the most "powerful"
+   entity, i.e. the wealthiest entity. The output of the metric is a decimal number in [0,1].
 7. **Tau-decentralization index**: The tau-decentralization index is a generalization of the Nakamoto coefficient.
-   It is defined as the minimum number of entities that collectively produce more than a given threshold of the total
-   blocks within a given timeframe. The threshold parameter is a decimal in [0, 1] (0.66 by default) and the output of
+   It is defined as the minimum number of entities that collectively control more than a given threshold of the total
+   tokens in circulation. The threshold parameter is a decimal in [0, 1] (0.66 by default) and the output of
    the metric is an integer.
diff --git a/docs/setup.md b/docs/setup.md
@@ -16,10 +16,77 @@ project:
 
     python -m pip install -r requirements.txt
 
-
 ## Execution
 
 The tokenomics decentralization analysis tool is a CLI tool.
-The following process describes the most typical workflow.
+To run the tool simply do:
+
+    python run.py
+
+The execution is controlled and parameterized by the configuration file
+`config.yml` as follows.
+
+`metrics` defines the metrics that should be computed in the analysis. By
+default all supported metrics are included here (to add support for a new metric
+see the [conributions
+page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/contribute/)).
+
+`ledgers` defines the ledgers that should be analyzed. By default, all supported
+ledgers are included here (to add support for a new ledger see the [conributions
+page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/contribute/)).
+
+`execution_flags` defines various flags that control the data handling:
+
+* `force_map_addresses`: the address helper data from the directory
+  `mapping_information` is re-computed; you should set this flag to true if the
+  data has been updated since the last execution for the given ledger
+* `force_map_balances`: the balance data of the ledger's addresses is
+  recomputed; you should set this flag to true if the data has been updated
+  since the last execution for the given ledger
+* `force_analyze`: the computation of a metric is recomputed; you should set
+  this flag to true if any type of data has been updated since the last
+  execution for the given ledger
+
+`analyze_flags` defines various analysis-related flags:
+
+* `no_clustering`: a boolean that disables clustering of addresses (under the
+  same entity, as defined in the mapping information)
+* `top_limit_type`: a string of two values (`absolute` or `percentage`) that
+  enables applying a threshold on the addresses that will be considered
+* `top_limit_value`: the value of the top limit that should be applied; if 0,
+  then no limit is used (regardless of the value of `top_limit_type`); if the
+  type is `absolute`, then the `top_limit_value` should be an integer (e.g., if
+  set to 100, then only the 100 wealthiest entities/addresses will be considered
+  in the analysis); if the type is `percentage` the the `top_limit_value` should
+  be an integer (e.g., if set to 0.50, then only the top 50% of wealthiest
+  entities/addresses will be considered)
+* `exclude_contract_addresses`: a boolean value that enables the exclusion of
+  contract addresses from the analysis
+* `exclude_below_usd_cent`: a boolean value that enables the exclusion of
+  addresses, the balance of which at the analyzed point in time was less than
+  $0.01 (based on the historical price information in the directory
+  `price_data`)
+
+`snapshot_dates` and `granularity` control the snapshots for which an analysis
+will be performed. `granularity` is a string that can be empty or one of `day`, `week`,
+`month`, `year`. If granularity is empty, then `snapshot_dates` define the exact
+time points for which an analysis will be conducted, in the form YYYY-MM-DD.
+Otherwise, if granularity is set, then the two farthest entries in
+`snapshot_dates` define the timeframe over which the analysis will be conducted,
+at the set granular rate. For example, if the farthest points are `2010` and
+`2023` and the granularity is set to `month`, then (the first day of) every
+month in the years 2010-2023 (inclusive) will be analyzed.
+
+`input_directories` and `output_directories` are both lists of directories that
+define the source of data. `input_directories` defines the directories that
+contain raw address balance information, as obtained from BigQuery or a full
+node (for more information about this see the [data collection
+page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/data/)).
+`output_directories` defines the directories to store the databases which
+contain the mapping information and analyzed data. The first entry in the output
+directories is also used to store the output files of the analysis and the
+plots.
 
+Finally, `plot_parameters` contains various parameters that control the type and
+data that will be produced as plots.
 ...