Skip to content

Conversation

@burakguneli
Copy link
Contributor

@burakguneli burakguneli commented Aug 18, 2025

πŸ“ Description

Working on queries for #4086.

βœ… Migrated

The following queries have been migrated from 2024 to 2025 using the new crawl dataset: https://har.fyi/guides/migrating-to-crawl-dataset/

  • anchor-rel-attribute-usage-2025.sql
  • anchor-same-site-occurance-stats-2025.sql
  • content-language-2025.sql
  • core-web-vitals-2025.sql
  • hreflang-header-usage-2025.sql
  • hreflang-link-tag-usage-2025.sql
  • html-response-content-language-2025.sql
  • html-response-vary-header-used-2025.sql
  • iframe-loading-property-usage-2025.sql
  • image-alt-stats-2025.sql
  • image-loading-property-usage-2025.sql
  • invalid-head-elements-2025.sql
  • invalid-head-sites-2025.sql
  • lighthouse-seo-stats-2025.sql
  • mark-up-stats-2025.sql
  • media-property-usage-link-tags-rel-alternate-2025.sql
  • meta-tag-usage-by-name-2025.sql
  • meta-tag-usage-by-property-2025.sql
  • outgoing_links_by_rank-2025.sql
  • pages-canonical-stats-2025.sql
  • pages-containing-a-video-element-2025.sql
  • robots-meta-usage-2025.sql
  • robots-text-size-2025.sql
  • robots-txt-size-2025.sql
  • robots-txt-status-codes -2025.sql
  • robots-txt-user-agent-usage-2025.sql
  • seo-stats-2025.sql
  • seo-stats-by-percentile-2025.sql
  • structured-data-formats-2025.sql
  • structured-data-schema-types-2025.sql
  • unused-css-js-2025.sql
  • videos_per_page-2025.sql

@burakguneli burakguneli marked this pull request as draft August 18, 2025 15:20
@burakguneli burakguneli changed the title Migrate 2024 SEO queries to 2025 crawl dataset [WIP] Migrate 2024 SEO queries to 2025 crawl dataset Aug 18, 2025
@tunetheweb tunetheweb changed the title [WIP] Migrate 2024 SEO queries to 2025 crawl dataset SEO queries 2025 Aug 18, 2025
@tunetheweb tunetheweb added the analysis Querying the dataset label Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not convinced the filename and the query outputs are aligned here - needs a check!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query returning no data

Copy link

@chr156r33n chr156r33n Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now works, new SQL:

CREATE TEMPORARY FUNCTION getRelStatsWptBodies(wpt_bodies_string STRING)
RETURNS STRUCT< rel ARRAY<STRING> >
LANGUAGE js AS '''
var result = {rel: []};
function getKey(dict){
  const arr = [], obj = Object.keys(dict || {});
  for (var i=0;i<obj.length;i++){
    if (Number(dict[obj[i]]) > 0) arr.push(obj[i]);
  }
  return arr;
}
try {
  var wpt_bodies = JSON.parse(wpt_bodies_string);
  if (Array.isArray(wpt_bodies) || typeof wpt_bodies !== 'object') return result;

  if (wpt_bodies.anchors && wpt_bodies.anchors.rendered && wpt_bodies.anchors.rendered.rel_attributes) {
    result.rel = getKey(wpt_bodies.anchors.rendered.rel_attributes);
  }
} catch (e) {}
return result;
''';

WITH rel_stats_table AS (
  SELECT
    client,
    root_page,
    page,
    CASE
      WHEN is_root_page = FALSE THEN 'Secondarypage'
      WHEN is_root_page = TRUE  THEN 'Homepage'
      ELSE 'No Assigned Page'
    END AS is_root_page,

    getRelStatsWptBodies(
      TO_JSON_STRING(JSON_QUERY(TO_JSON(custom_metrics), '$.wpt_bodies'))
    ) AS wpt_bodies_info

  FROM `httparchive.crawl.pages`
  WHERE date = '2025-06-01'
)

SELECT
  client,
  is_root_page,
  rel,
  COUNT(DISTINCT page) AS sites,
  SUM(COUNT(DISTINCT page)) OVER (PARTITION BY client, is_root_page) AS total,
  COUNT(0) / SUM(COUNT(DISTINCT page)) OVER (PARTITION BY client, is_root_page) AS pct
FROM rel_stats_table, UNNEST(wpt_bodies_info.rel) AS rel
GROUP BY client, is_root_page, rel
ORDER BY sites DESC, rel, client DESC;

Output - new sample - https://docs.google.com/spreadsheets/d/1mBFI6sXDuqP72No4VpZUv-5xBI8U_AUdgzmmz_c9FDY/edit?usp=sharing

Copy link

@chr156r33n chr156r33n Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key differences (old vs new)

  1. Source location

    • Old: JSON_EXTRACT_SCALAR(payload, '$._wpt_bodies')
    • New: TO_JSON_STRING(JSON_QUERY(TO_JSON(custom_metrics), '$.wpt_bodies'))
    • _wpt_bodies moved from inside payload (JSON string) to custom_metrics (STRUCT).
  2. Type handling

    • Old UDFs received a STRING.
    • JSON_QUERY now returns JSON, so wrap with TO_JSON_STRING(...) before passing to JS UDFs.
  3. JSON path stability

    • Old crawls: wpt_bodies.anchors.rendered.rel_attributes
    • Newer crawls: wpt_bodies.anchors.raw.rel_attributes
  4. Other points

    • REGEXP_CONTAINS on JSON β†’ wrap in TO_JSON_STRING(...).
    • Aggregating BOOLs β†’ use COUNTIF(...).
    • custom_metrics is a STRUCT β†’ convert with TO_JSON(...) before pathing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This:

TO_JSON_STRING(JSON_QUERY(TO_JSON(custom_metrics), '$.wpt_bodies'))

can probably be simplified to this:

custom_metrics.wpt_bodies

the beauty of the new JSON type columns is you can reference them directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old UDFs received a STRING.

You can change these to expect JSON input instead. To save you converting to atring before passing, and then converting back to JSON in the UDF.

Copy link

@chr156r33n chr156r33n Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@chr156r33n chr156r33n Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, SQL, new sampled output - https://docs.google.com/spreadsheets/d/1MdX8mhyuuz5vPyiHq4BSQ0S1Tf2dytMcQFsNUt5dMqA/edit?usp=sharing

#standardSQL
-- Anchor same site occurrence stats

CREATE TEMPORARY FUNCTION getLinkDesciptionsWptBodies(wpt_bodies_string STRING)
RETURNS STRUCT<
  links_same_site INT64,
  links_window_location INT64,
  links_window_open INT64,
  links_href_javascript INT64
>
LANGUAGE js AS '''
var result = {
  links_same_site: 0,
  links_window_location: 0,
  links_window_open: 0,
  links_href_javascript: 0
};
try {
  var w = JSON.parse(wpt_bodies_string);
  if (Array.isArray(w) || typeof w !== 'object') return result;

  var r = w && w.anchors && w.anchors.rendered ? w.anchors.rendered : null;
  if (!r) return result;

  // Defensive: coerce to numbers or 0
  result.links_same_site       = Number(r.same_site) || 0;
  var spd = (r.same_page && r.same_page.dynamic) ? r.same_page.dynamic : {};
  var oa  = spd.onclick_attributes || {};

  result.links_window_location = Number(oa.window_location) || 0;
  result.links_window_open     = Number(oa.window_open) || 0;
  result.links_href_javascript = Number(spd.href_javascript) || 0;
} catch (e) {}
return result;
''';

WITH same_links_info AS (
  SELECT
    client,
    root_page,
    page,
    CASE WHEN is_root_page THEN 'Homepage' ELSE 'Secondarypage' END AS is_root_page,
    -- CHANGED: read from custom_metrics.wpt_bodies (STRUCT -> JSON -> STRING)
    getLinkDesciptionsWptBodies(
      TO_JSON_STRING(JSON_QUERY(TO_JSON(custom_metrics), '$.wpt_bodies'))
    ) AS wpt_bodies_info
  FROM `httparchive.crawl.pages`
  WHERE date = '2025-06-01'
)

SELECT
  client,
  wpt_bodies_info.links_same_site AS links_same_site,
  is_root_page,
  COUNT(DISTINCT page) AS sites,
  SAFE_DIVIDE(COUNT(0), COUNT(DISTINCT page)) AS pct_links_same_site,
  AVG(wpt_bodies_info.links_window_location) AS avg_links_window_location,
  AVG(wpt_bodies_info.links_window_open)     AS avg_links_window_open,
  AVG(wpt_bodies_info.links_href_javascript) AS avg_links_href_javascript,
  AVG(wpt_bodies_info.links_window_location
    + wpt_bodies_info.links_window_open
    + wpt_bodies_info.links_href_javascript) AS avg_links_any,
  MAX(wpt_bodies_info.links_window_location
    + wpt_bodies_info.links_window_open
    + wpt_bodies_info.links_href_javascript) AS max_links_any,
  SUM(COUNT(DISTINCT page)) OVER (PARTITION BY client, is_root_page) AS total,
  COUNT(0) / SUM(COUNT(DISTINCT page)) OVER (PARTITION BY client, is_root_page) AS pct
FROM same_links_info
GROUP BY client, is_root_page, wpt_bodies_info.links_same_site
ORDER BY links_same_site ASC;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query returns no data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query returns no data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query returns no data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query returns no data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query returns no data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Extract Lighthouse audits (one row per audit per page)
CREATE TEMPORARY FUNCTION getAudits(audits STRING)
RETURNS ARRAY<STRUCT<id STRING, weight INT64, title STRING, description STRING, score FLOAT64>>
LANGUAGE js AS """
var out = [];
if (!audits) return out;
try {
  var obj = JSON.parse(audits);
  if (!obj || typeof obj !== 'object' || Array.isArray(obj)) return out;
  for (var k in obj) {
    if (Object.prototype.hasOwnProperty.call(obj, k)) {
      var a = obj[k] || {};
      // score can be 0..1 or null
      var sc = (a.score === undefined || a.score === null) ? null : Number(a.score);
      // weight usually isn't on audits; default to 0
      var wt = (a.weight === undefined || a.weight === null) ? 0 : Number(a.weight);
      if (isNaN(wt)) wt = 0;
      out.push({
        id: String(k),
        weight: Math.floor(wt),
        title: a.title || "",
        description: a.description || "",
        score: sc
      });
    }
  }
} catch (e) {}
return out;
""";

WITH lighthouse_extraction AS (
  SELECT
    client,
    CASE
      WHEN is_root_page = FALSE THEN 'Secondarypage'
      WHEN is_root_page = TRUE  THEN 'Homepage'
      ELSE 'No Assigned Page'
    END AS is_root_page,
    page,
    lighthouse AS report
  FROM `httparchive.crawl.pages` TABLESAMPLE SYSTEM (0.002 PERCENT)
  WHERE date = '2025-06-01'
)
SELECT
  client,
  audits.id AS id,
  is_root_page,
  COUNTIF(audits.score > 0) AS num_pages,
  COUNT(DISTINCT page) AS sites,
  COUNTIF(audits.score IS NOT NULL) AS total_applicable,
  SAFE_DIVIDE(COUNTIF(audits.score > 0), COUNTIF(audits.score IS NOT NULL)) AS pct,
  APPROX_QUANTILES(audits.weight, 100)[OFFSET(50)] AS median_weight,
  MAX(audits.title) AS title,
  MAX(audits.description) AS description,
  SUM(COUNT(DISTINCT page)) OVER (PARTITION BY client, is_root_page) AS total
FROM lighthouse_extraction
CROSS JOIN UNNEST(getAudits(TO_JSON_STRING(JSON_QUERY(report, '$.audits')))) AS audits
GROUP BY client, is_root_page, id
ORDER BY client, median_weight DESC, id;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query works - sampled output https://docs.google.com/spreadsheets/d/11XwOhMb0sFlji6yBDaZk-XAGp5HD6IVDBn1FrPy9lXg/edit?usp=sharing - but is not returning any of the key metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query returns data - sampled output https://docs.google.com/spreadsheets/d/1568FZo0Z4kqRZR1QK619gawEsAMSd2Orz_cdQdYv07s/edit?usp=sharing - but I don't think it is capturing the values correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

analysis Querying the dataset

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants