Add stacking/sync script for long wide table by yukinko-iwasaki · Pull Request #15 · dime-worldbank/mega-gld

yukinko-iwasaki · 2026-02-18T14:47:39Z

No description provided.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot · 2026-02-18T15:03:16Z

tests/pytest/test_stacking_pipeline.py

+import pyspark.sql.functions as f
+


The PySpark module is imported twice, on lines 3 and 7. Remove the duplicate import to follow Python best practices and improve code maintainability.

Suggested change

import pyspark.sql.functions as f

Copilot · 2026-02-18T15:03:17Z

tests/pytest/test_stacking_pipeline.py

+    spark.sql(f"""
+        UPDATE `prd_csc_mega`.`sgld48`.`test_ingestion_metadata`
+        SET stacked_ouo_table_version = stacked_ouo_table_version + 1,
+            stacked_all_table_version = stacked_all_table_version + 1
+        WHERE table_name IN {names_tuple}
+    """)


This code uses string formatting to construct an SQL UPDATE statement, which is vulnerable to SQL injection. If the table_name field contains malicious SQL code, it could be executed. Use parameterized queries or proper SQL escaping to prevent SQL injection vulnerabilities.

helpers/stacking_schema.r

Copilot · 2026-02-18T15:03:17Z

tests/pytest/test_stacking_pipeline.py

+# 1. Read the source table
+# 2. Randomize using f.rand()
+# 3. Limit to 20 rows
+test_df = spark.table("`prd_csc_mega`.`sgld48`.`_ingestion_metadata`") \
+    .orderBy(f.rand()) \
+    .limit(20)
+
+# 4. Save as the new test table
+# 'overwrite' mode mimics 'CREATE OR REPLACE'
+test_df.write \
+    .mode("overwrite") \
+    .format("delta") \
+    .saveAsTable("`prd_csc_mega`.`sgld48`.`test_ingestion_metadata`")
+
+# 1. Grab 5 random table names and "freeze" them into a Python list
+# .collect() pulls the data out of the Spark plan and into local memory
+random_rows = spark.table("`prd_csc_mega`.`sgld48`.`test_ingestion_metadata`") \
+    .select("table_name") \
+    .orderBy(f.rand()) \
+    .limit(5) \
+    .collect()
+
+# Create a list of strings: ['table_a', 'table_b', ...]
+target_tables = [row.table_name for row in random_rows]
+
+# 2. Update the Delta table using the fixed list
+# This uses the standard SQL 'IN' syntax but fills it with our Python list
+if target_tables:
+    names_tuple = str(tuple(target_tables)).replace(",)", ")") # Handle single-item edge case
+
+    spark.sql(f"""
+        UPDATE `prd_csc_mega`.`sgld48`.`test_ingestion_metadata`
+        SET stacked_ouo_table_version = stacked_ouo_table_version + 1,
+            stacked_all_table_version = stacked_all_table_version + 1
+        WHERE table_name IN {names_tuple}
+    """)
+
+    print(f"Successfully updated: {target_tables}")
+else:
+    print("No rows found to update.")


This test file doesn't define any actual test functions. All pytest test functions must start with "test_" prefix. The code appears to be a script that directly executes operations rather than a test. Consider restructuring this as proper test functions with assertions to validate expected behavior.

Suggested change

# 1. Read the source table

# 2. Randomize using f.rand()

# 3. Limit to 20 rows

test_df = spark.table("`prd_csc_mega`.`sgld48`.`_ingestion_metadata`") \

.orderBy(f.rand()) \

.limit(20)

# 4. Save as the new test table

# 'overwrite' mode mimics 'CREATE OR REPLACE'

test_df.write \

.mode("overwrite") \

.format("delta") \

.saveAsTable("`prd_csc_mega`.`sgld48`.`test_ingestion_metadata`")

# 1. Grab 5 random table names and "freeze" them into a Python list

# .collect() pulls the data out of the Spark plan and into local memory

random_rows = spark.table("`prd_csc_mega`.`sgld48`.`test_ingestion_metadata`") \

.select("table_name") \

.orderBy(f.rand()) \

.limit(5) \

.collect()

# Create a list of strings: ['table_a', 'table_b', ...]

target_tables = [row.table_name for row in random_rows]

# 2. Update the Delta table using the fixed list

# This uses the standard SQL 'IN' syntax but fills it with our Python list

if target_tables:

names_tuple = str(tuple(target_tables)).replace(",)", ")") # Handle single-item edge case

spark.sql(f"""

UPDATE `prd_csc_mega`.`sgld48`.`test_ingestion_metadata`

SET stacked_ouo_table_version = stacked_ouo_table_version + 1,

stacked_all_table_version = stacked_all_table_version + 1

WHERE table_name IN {names_tuple}

""")

print(f"Successfully updated: {target_tables}")

else:

print("No rows found to update.")

def _prepare_and_update_test_ingestion_metadata():

# 1. Read the source table

# 2. Randomize using f.rand()

# 3. Limit to 20 rows

test_df = spark.table("`prd_csc_mega`.`sgld48`.`_ingestion_metadata`") \

.orderBy(f.rand()) \

.limit(20)

# 4. Save as the new test table

# 'overwrite' mode mimics 'CREATE OR REPLACE'

test_df.write \

.mode("overwrite") \

.format("delta") \

.saveAsTable("`prd_csc_mega`.`sgld48`.`test_ingestion_metadata`")

# 1. Grab 5 random table names and "freeze" them into a Python list

# .collect() pulls the data out of the Spark plan and into local memory

random_rows = spark.table("`prd_csc_mega`.`sgld48`.`test_ingestion_metadata`") \

.select("table_name") \

.orderBy(f.rand()) \

.limit(5) \

.collect()

# Create a list of strings: ['table_a', 'table_b', ...]

target_tables = [row.table_name for row in random_rows]

# 2. Update the Delta table using the fixed list

# This uses the standard SQL 'IN' syntax but fills it with our Python list

if target_tables:

names_tuple = str(tuple(target_tables)).replace(",)", ")") # Handle single-item edge case

spark.sql(f"""

UPDATE `prd_csc_mega`.`sgld48`.`test_ingestion_metadata`

SET stacked_ouo_table_version = stacked_ouo_table_version + 1,

stacked_all_table_version = stacked_all_table_version + 1

WHERE table_name IN {names_tuple}

""")

print(f"Successfully updated: {target_tables}")

else:

print("No rows found to update.")

def test_prepare_and_update_test_ingestion_metadata():

"""

Basic pytest-compatible test that runs the stacking pipeline and verifies

that the target test table can be read and contains at most 20 rows.

"""

_prepare_and_update_test_ingestion_metadata()

# Verify the test table exists and the row count matches the expected limit.

df = spark.table("`prd_csc_mega`.`sgld48`.`test_ingestion_metadata`")

row_count = df.count()

assert 0 <= row_count <= 20

helpers/stacking_schema.r

gld_incremental_sync.r

Copilot · 2026-02-18T15:03:20Z

gld_incremental_sync.r

+  spark_write_table(ouo_df, HARMONIZED_OFFICIAL, mode = "overwrite", options = list("overwriteSchema" = "true"))
+}
+
+#


The comment on line 183 is incomplete or unnecessary. It consists only of a "#" with no content, which should either be removed or completed with meaningful information.

Suggested change

#

Copilot · 2026-02-18T15:03:20Z

tests/pytest/test_stacking_pipeline.py

+# Databricks notebook source
+#TESTING


This file has a Databricks notebook source comment, but it appears to be a test file in the pytest directory. Test files should not be Databricks notebooks and should follow the proper pytest structure. The comment "#TESTING" on line 2 also appears to be a placeholder or debugging comment that should be removed.

Suggested change

# Databricks notebook source

#TESTING

Copilot · 2026-02-18T15:03:20Z

gld_incremental_sync.r

+if (SparkR::tableExists(HARMONIZED_CONFIDENTIAL)) {
+  harmonized_all <- tbl(sc, HARMONIZED_CONFIDENTIAL)
+} else {
+  create_query <- paste0(
+    "CREATE TABLE ", HARMONIZED_CONFIDENTIAL,
+    " (", columns_sql, ") USING DELTA"
+  )
+  DBI::dbExecute(sc, create_query)
+  harmonized_all <- tbl(sc, HARMONIZED_CONFIDENTIAL)
+}
+
+# Check and create HARMONIZED_OFFICIAL if needed
+if (SparkR::tableExists(HARMONIZED_OFFICIAL)) {


Using SparkR::tableExists while the rest of the script uses sparklyr is inconsistent and may not work correctly. The sparklyr package provides its own table existence checking through DBI::dbExistsTable or by catching errors when trying to access the table. This mixing of Spark interfaces could lead to compatibility issues.

gld_incremental_sync.r

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…mega-gld into stacking-scripts

yukinko-iwasaki added 12 commits December 10, 2025 20:56

add stacking script

82bcfbc

add official table

f8fee58

add R script for stackig

ad12e19

change names

272f419

Merge branch 'main' into stacking-scripts

7915660

update version

747774f

make the code modular

9e6eb26

R conversion

638130e

delete python files

9c3ef26

add validation logic

0e64b9e

fix stackoverflow

246f97a

fix names

42bb616

yukinko-iwasaki requested a review from Copilot February 18, 2026 14:47

Delete 5-gld-table-stacking.R

ecf9887

Copilot started reviewing on behalf of yukinko-iwasaki February 18, 2026 14:53 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

yukinko-iwasaki and others added 9 commits February 18, 2026 16:10

adding R tests

9951a6a

Delete tests/pytest/test_stacking_pipeline.py

57353e2

Apply suggestions from code review

b89c875

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update test.yml

c3470b5

address comments

5310e4b

fix tests

55068f3

Update stacking_functions.r

807f188

Merge branch 'stacking-scripts' of https://github.com/dime-worldbank/…

c35a646

…mega-gld into stacking-scripts

adjust changes in the metadata table

f6b5cfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stacking/sync script for long wide table#15

Add stacking/sync script for long wide table#15
yukinko-iwasaki wants to merge 22 commits intomainfrom
stacking-scripts

yukinko-iwasaki commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yukinko-iwasaki commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants