Skip to content

Conversation

IvanUkhov
Copy link
Contributor

@IvanUkhov IvanUkhov commented Jul 24, 2025

Makes progress on #4073

Fonts

Resources

Structure

The queries are split by the section where they are used:

  • design/ is about foundries and families,
  • development/ is about tools and technologies, and
  • performance/ is about hosting and serving.

Each file name starts with one of the following prefixes indicating the primary subject of the corresponding analysis:

  • fonts_ is about font files,
  • pages_ is about HTML pages,
  • scripts_ is about JavaScript scripts, and
  • styles_ is about CSS style sheets.

The prefix is followed by the property studied given in singular, potentially extended one or several suffixes narrowing down the scope, as in fonts_size_by_table.sql and pages_link_relation.sql.

Content

Each query starts with a preamble indicating the section, question, and normalization type, as illustrated below:

-- Section: Performance
-- Question: What is the distribution of the file size broken down by table?
-- Normalization: Pages

Many queries rely on temporary functions for convenience and clarity. The functions that appear in several queries are extracted into a common file called common.sql. Whenever any of the functions defined in common.sql is used by a query, the query has the following pseudo-directive at the top:

-- INCLUDE https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/{year}/fonts/common.sql

The pseudo-directive has to be replaced with the content of common.sql prior to executing the query in question.

In addition, queries generally have parameters, as in @date, so as to be able to run them for different configurations. The values for the parameters will have to be supplied upon execution.

All the above is taken take of automatically if the queries are executed using execute.py, which we discuss next.

Execution

The queries can be executed using the execute.py script. The results are first saved in local CSV files sitting next to the SQL files and then uploaded to the spreadsheet. In the spreadsheet, for each query, a separate sheet is created and named after the question the query answers, which is given in its preamble. If the CSV file already exists, the corresponding query is not executed. If cell A1 is already populated, the corresponding sheet is not updated.

First, ensure that the Application Default Credentials authorization strategy is configured, and that the HTTP Archive project is used as the quota project:

gcloud auth application-default login \
  --scopes https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/spreadsheets
gcloud auth application-default set-quota-project httparchive

Second, install the Python prerequisites for the script:

pip install -r requirements.txt

The script can be run for all or a subset of the queries as illustrated below:

python execute.py
python execute.py design/*.sql
python execute.py development/fonts_*.sql

By default, it operates in a dry-run mode: it does not run the queries but prints an estimate of the amount of data that would be processed by each query. To actually run the queries, pass the --no-dry-run option as follows:

python execute.py --no-dry-run
python execute.py --no-dry-run design/*.sql
python execute.py --no-dry-run development/fonts_*.sql

@IvanUkhov IvanUkhov force-pushed the fonts branch 2 times, most recently from 01d64b2 to 1d97236 Compare July 31, 2025 08:03
@IvanUkhov IvanUkhov force-pushed the fonts branch 2 times, most recently from 68e71e2 to e12330c Compare August 2, 2025 03:37
@IvanUkhov IvanUkhov changed the title Fonts 2025 Fonts 2025 queries Aug 2, 2025
@IvanUkhov IvanUkhov force-pushed the fonts branch 2 times, most recently from 38e6979 to ac584bd Compare August 4, 2025 09:35
@IvanUkhov IvanUkhov marked this pull request as ready for review August 4, 2025 09:36
@IvanUkhov
Copy link
Contributor Author

@tunetheweb, I think you were the one reviewing the queries last year. If you do not mind, I would like to invite you to review this year, too, but please feel free to assign someone else. This year, we did not change anything. We just migrated the queries to crawl and added a Python script for execution. The instructions are in the readme.

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some non-blocking comments.

@IvanUkhov
Copy link
Contributor Author

(The linter is failing due to the code elsewhere.)

@tunetheweb
Copy link
Member

(The linter is failing due to the code elsewhere.)

Fixing in #4196

@tunetheweb
Copy link
Member

That's fixed in main now if you can resync this branch @IvanUkhov .

After that are you good to merge this?

@IvanUkhov
Copy link
Contributor Author

Thank you. Rebased.

Well, I have not received any feedback from the lead. I would merge, if you are OK with potential follow-up PRs.

@tunetheweb
Copy link
Member

Thank you. Rebased.

Well, I have not received any feedback from the lead. I would merge, if you are OK with potential follow-up PRs.

Yeah lets do that.

@tunetheweb tunetheweb merged commit b6bcddb into HTTPArchive:main Aug 22, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants