Adding defog Academic database for testing #415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

john-sanchez31 wants to merge 23 commits into main from John/acaddb

Contributor

john-sanchez31 commented Aug 25, 2025

Resolves #414

john-sanchez31 added 9 commits

August 19, 2025 16:51


          derm treatment basic1

06057fc


          derm treatment basic questions

1fd4be7


          defog dermtreatments adv questions

170246b


          Merge branch 'main' into John/defogdbs

4564b1a


          dermtreatment adv added


          defog dermtreatment adv questions and mysql defog tests

cedb935


          dermtreatments gen questions

ba497ba


          dermtreatments gen sql files

c11a1ce


          defog academic db init

ffe50a4

review-notebook-app bot commented Aug 25, 2025

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

john-sanchez31 added 12 commits

August 27, 2025 09:57


          Merge branch 'main' into John/acaddb

d4a592a


          conflicts with main solved

b6eb2d4


          init data postgres and sf, metadata added

68aab6c


          adding pydough functions, fixing metadata

3057d52


          WIP: academic metadata naming and descriptions

92ad7d3


          academic metadata

68228da


          metadata fixed and WIP gen1 pydough

8fd10b1


          WIP 12 gen questions

690b662


          Merge branch 'main' into John/acaddb

ddc3cfd


          gen test 13-20

4078f93


          Merge branch 'main' into John/acaddb

70bb6f4


          mysql and sf e2e test added

831c1bc

john-sanchez31 commented

View reviewed changes

tests/gen_data/init_defog_sf.sql

    
              (2, '2023-06-01', 500.00, 2500.00, 1, 'bcryptHash(qpwo9874zyGk!)', NULL, 'mobile_yjp08q', '198.51.100.233, 70.121.39.25', true, false, '2023-06-01 09:00:00');

              (2, '2023-06-01', 500.00, 2500.00, 1, 'bcryptHash(qpwo9874zyGk!)', NULL, 'mobile_yjp08q', '198.51.100.233, 70.121.39.25', true, false, '2023-06-01 09:00:00');

Contributor Author

john-sanchez31 Oct 8, 2025

This file doesn't load the data in snowflake, I added it directly. Why we have this file exactly?

Contributor

knassre-bodo Oct 9, 2025

The one in question is sf_task.sql

john-sanchez31 added 2 commits

October 8, 2025 09:28


          minor fixes

ec9106d


          Merge branch 'main' into John/acaddb [run all]

28e4857

john-sanchez31 requested review from a team, hadia206, knassre-bodo and juankx-bodo and removed request for a team

October 8, 2025 17:19

juankx-bodo reviewed

View reviewed changes

tests/gen_data/init_defog_postgres.sql

    
              (14, 13, 'Acetaminophen', '2023-01-08', '2023-01-14', 500, 'mg', 6),

              (15, 14, 'Hydrocortisone cream', '2023-02-25', '2023-03-07', 10, 'g', 12);

              -- ACADEMIC

Contributor

juankx-bodo Oct 9, 2025

My concern with this format (all defog tables in the same schema: main) is that we could have name conflicts if any table has the same name. These tables should be in a different schema ACADEMIC, the same administrative order we use in snowflake. The same can be done for SQLite using ATTACH. Probably also for MySQL using schemas.

Contributor

juankx-bodo Oct 9, 2025

If that is the case, we don't really need 4 different metadata files if the types used are compatible with the metadata type.

Contributor

knassre-bodo Oct 9, 2025

This is copying how things are done for the real defog benchmark. We can change things potentially, but for now let's keep it consistent.

tests/gen_data/init_defog_sf.sql

    
              -- https://github.com/defog-ai/defog-data/blob/main/defog_data/academic/academic.sql

              -------------------------------------------------------------------------------

              CREATE SCHEMA ACADEMIC;

Contributor

juankx-bodo Oct 9, 2025

We would prefer to use this same format for the other engines. Then, we could share the same metadata.

tests/test_metadata/defog_graphs.json

    
                    {

                      "name": "authors",

                      "type": "simple table",

                      "table path": "main.author",

Contributor

juankx-bodo Oct 9, 2025

Is there any critical difference between schemas, other than the table path of the tables? (e.g., main for SQLite and PostgreSQL, none for MySQL, and schema_name for Snowflake?)

Using different arrangements for each engine will make this process unscalable as we add more Defog tests and new engines.

Contributor

juankx-bodo commented Oct 9, 2025

Change the PR status from draft to ready for review.

knassre-bodo reviewed

View reviewed changes

Contributor

knassre-bodo left a comment

Did a first pass of revisions (half of them are minor formatting nitpicks)

tests/test_pydough_functions/defog_test_functions.py

Comment on lines +2568 to +2571

    
                  return Academic.CALCULATE(

                      publication_to_author_ratio=NDISTINCT(publications.publication_id)

                      / NDISTINCT(authors.author_id)

                  )

Contributor

knassre-bodo Oct 9, 2025

We don't need NDISTINCT here. That was done in the original query because it joined the two tables before aggregating. We can rewrite it as this:

Suggested change

      
                return Academic.CALCULATE(
          
                    publication_to_author_ratio=NDISTINCT(publications.publication_id)
          
                    / NDISTINCT(authors.author_id)
          
                )
          
                n_pub = COUNT(publications)
          
                n_auth = COUNT(authors)
          
                return Academic.CALCULATE(
          
                    publication_to_author_ratio=n_pub / KEEP_IF(n_auth, n_auth > 0)
          
                )

tests/test_pydough_functions/defog_test_functions.py

Comment on lines +2582 to +2584

    
                  return Academic.CALCULATE(

                      ratio=NDISTINCT(publications.conference_id) / NDISTINCT(publications.journal_id)

                  )

Contributor

knassre-bodo Oct 9, 2025

This isn't what the question is asking. If you look at the SQL query refsol closely, you'll notice it is using NDISTINCT on the pid values based on whether there was a cid or jid. We can rewrite that as this:

Suggested change

      
                return Academic.CALCULATE(
          
                    ratio=NDISTINCT(publications.conference_id) / NDISTINCT(publications.journal_id)
          
                )
          
                n_confs = SUM(PRESENT(publications.conference_id))
          
                n_jours = SUM(PRESENT(publications.journal_id))
          
                return Academic.CALCULATE(
          
                    ratio=n_pubs / KEEP_IF(n_jours, n_jours > 0)
          
                )

tests/test_pydough_functions/defog_test_functions.py

    
                  domain name?

                  """

                  return domains.CALCULATE(

                      name, average_references=AVG(domain_publications.publication.reference_num)

Contributor

knassre-bodo Oct 9, 2025

Suggested change

      
                    name, average_references=AVG(domain_publications.publication.reference_num)
          
                    name,
          
                    average_references=AVG(domain_publications.publication.reference_num)

tests/test_pydough_functions/defog_test_functions.py

Comment on lines +2471 to +2473

    
                  return publications.PARTITION(name="years", by=year).CALCULATE(

                      year, COUNT(publications)

                  )

Contributor

knassre-bodo Oct 9, 2025

Suggested change

      
                return publications.PARTITION(name="years", by=year).CALCULATE(
          
                    year, COUNT(publications)
          
                )
          
                return (
          
                  publications
          
                  .PARTITION(name="years", by=year)
          
                  .CALCULATE(year, COUNT(publications))
          
                )

tests/test_pydough_functions/defog_test_functions.py

Comment on lines +2459 to +2461

    
                  return authors.WHERE(HAS(publications_selected)).CALCULATE(

                      name, total_citations=SUM(publications_selected.citation_num)

                  )

Contributor

knassre-bodo Oct 9, 2025

Suggested change

      
                return authors.WHERE(HAS(publications_selected)).CALCULATE(
          
                    name, total_citations=SUM(publications_selected.citation_num)
          
                )
          
                return (
          
                  authors
          
                  .WHERE(HAS(publications_selected))
          
                  .CALCULATE(name, total_citations=SUM(publications_selected.citation_num))
          
                )

tests/test_pydough_functions/defog_test_functions.py

Comment on lines +2642 to +2647

    
                          ratio=IFF(

                              HAS(organizations.authors),

                              NDISTINCT(organizations.authors.author_id)

                              / NDISTINCT(organizations.organization_id),

                              0,

                          ),

Contributor

knassre-bodo Oct 9, 2025

Mathematically, the IF/HAS here are not needed. Can simplify a bit (this will automatically become 0 if there are no authors):

Suggested change

      
                        ratio=IFF(
          
                            HAS(organizations.authors),
          
                            NDISTINCT(organizations.authors.author_id)
          
                            / NDISTINCT(organizations.organization_id),
          
                            0,
          
                        ),
          
                        ratio=COUNT(organizations.authors) / COUNT(organizations)

tests/test_pydough_functions/defog_test_functions.py

Comment on lines +2661 to +2671

    
                  return (

                      writes.CALCULATE(author_name=author.name)

                      .PARTITION(name="authors", by=author_name)

                      .CALCULATE(

                          author_name,

                          count_publication=NDISTINCT(

                              writes.WHERE(publication.year == 2021).publication_id

                          ),

                      )

                      .TOP_K(1, by=count_publication.DESC())

                  )

Contributor

knassre-bodo Oct 9, 2025

You don't need the partition at all. This whole thing becomes simpler if you start from the perspective of authors:

Suggested change

      
                return (
          
                    writes.CALCULATE(author_name=author.name)
          
                    .PARTITION(name="authors", by=author_name)
          
                    .CALCULATE(
          
                        author_name,
          
                        count_publication=NDISTINCT(
          
                            writes.WHERE(publication.year == 2021).publication_id
          
                        ),
          
                    )
          
                    .TOP_K(1, by=count_publication.DESC())
          
                )
          
                selected_pubs = writes.publication.WHERE(year == 2021)
          
                return (
          
                   authors
          
                   .WHERE(HAS(selected_pubs))
          
                   .CALCULATE(
          
                        name,
          
                        count_publication=NDISTINCT(selected_pubs.publication_id),
          
                    )
          
                    .TOP_K(1, by=count_publication.DESC())
          
                )

tests/test_pydough_functions/defog_test_functions.py

Comment on lines +2681 to +2683

    
                  return conferences.CALCULATE(name, count_publications=COUNT(proceedings)).ORDER_BY(

                      count_publications.DESC(), name.DESC()

                  )

Contributor

knassre-bodo Oct 9, 2025

Suggested change

      
                return conferences.CALCULATE(name, count_publications=COUNT(proceedings)).ORDER_BY(
          
                    count_publications.DESC(), name.DESC()
          
                )
          
                return (
          
                  conferences
          
                  .CALCULATE(name, count_publications=COUNT(proceedings))
          
                  .ORDER_BY(count_publications.DESC(), name.DESC())
          
                )

tests/test_pydough_functions/defog_test_functions.py

Comment on lines +2694 to +2696

    
                  return journals.CALCULATE(

                      name, jid=journal_id, num_publications=COUNT(archives)

                  ).ORDER_BY(num_publications.DESC())

Contributor

knassre-bodo Oct 9, 2025

Suggested change

      
                return journals.CALCULATE(
          
                    name, jid=journal_id, num_publications=COUNT(archives)
          
                ).ORDER_BY(num_publications.DESC())
          
                return (
          
                  journals
          
                  .CALCULATE(name, journal_id, num_publications=COUNT(archives))
          
                  .ORDER_BY(num_publications.DESC())

tests/test_pydough_functions/defog_test_functions.py

Comment on lines +2708 to +2710

    
                  return conferences.CALCULATE(name, num_publications=COUNT(proceedings)).ORDER_BY(

                      num_publications.DESC(), name

                  )

Contributor

knassre-bodo Oct 9, 2025

Suggested change

      
                return conferences.CALCULATE(name, num_publications=COUNT(proceedings)).ORDER_BY(
          
                    num_publications.DESC(), name
          
                )
          
                return (
          
                  conferences
          
                  .CALCULATE(name, num_publications=COUNT(proceedings))
          
                  .ORDER_BY(num_publications.DESC(), name.ASC())
          
                )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet