Skip to content

Conversation

john-sanchez31
Copy link
Contributor

Resolves #414

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

(2, '2023-06-01', 500.00, 2500.00, 1, 'bcryptHash(qpwo9874zyGk!)', NULL, 'mobile_yjp08q', '198.51.100.233, 70.121.39.25', true, false, '2023-06-01 09:00:00');
(2, '2023-06-01', 500.00, 2500.00, 1, 'bcryptHash(qpwo9874zyGk!)', NULL, 'mobile_yjp08q', '198.51.100.233, 70.121.39.25', true, false, '2023-06-01 09:00:00');


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file doesn't load the data in snowflake, I added it directly. Why we have this file exactly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one in question is sf_task.sql

@john-sanchez31 john-sanchez31 requested review from a team, hadia206, knassre-bodo and juankx-bodo and removed request for a team October 8, 2025 17:19
(14, 13, 'Acetaminophen', '2023-01-08', '2023-01-14', 500, 'mg', 6),
(15, 14, 'Hydrocortisone cream', '2023-02-25', '2023-03-07', 10, 'g', 12);

-- ACADEMIC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with this format (all defog tables in the same schema: main) is that we could have name conflicts if any table has the same name. These tables should be in a different schema ACADEMIC, the same administrative order we use in snowflake. The same can be done for SQLite using ATTACH. Probably also for MySQL using schemas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, we don't really need 4 different metadata files if the types used are compatible with the metadata type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copying how things are done for the real defog benchmark. We can change things potentially, but for now let's keep it consistent.

-- https://github.com/defog-ai/defog-data/blob/main/defog_data/academic/academic.sql

-------------------------------------------------------------------------------
CREATE SCHEMA ACADEMIC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would prefer to use this same format for the other engines. Then, we could share the same metadata.

{
"name": "authors",
"type": "simple table",
"table path": "main.author",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any critical difference between schemas, other than the table path of the tables? (e.g., main for SQLite and PostgreSQL, none for MySQL, and schema_name for Snowflake?)

Using different arrangements for each engine will make this process unscalable as we add more Defog tests and new engines.

@juankx-bodo
Copy link
Contributor

Change the PR status from draft to ready for review.

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass of revisions (half of them are minor formatting nitpicks)

Comment on lines +2568 to +2571
return Academic.CALCULATE(
publication_to_author_ratio=NDISTINCT(publications.publication_id)
/ NDISTINCT(authors.author_id)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need NDISTINCT here. That was done in the original query because it joined the two tables before aggregating. We can rewrite it as this:

Suggested change
return Academic.CALCULATE(
publication_to_author_ratio=NDISTINCT(publications.publication_id)
/ NDISTINCT(authors.author_id)
)
n_pub = COUNT(publications)
n_auth = COUNT(authors)
return Academic.CALCULATE(
publication_to_author_ratio=n_pub / KEEP_IF(n_auth, n_auth > 0)
)

Comment on lines +2582 to +2584
return Academic.CALCULATE(
ratio=NDISTINCT(publications.conference_id) / NDISTINCT(publications.journal_id)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't what the question is asking. If you look at the SQL query refsol closely, you'll notice it is using NDISTINCT on the pid values based on whether there was a cid or jid. We can rewrite that as this:

Suggested change
return Academic.CALCULATE(
ratio=NDISTINCT(publications.conference_id) / NDISTINCT(publications.journal_id)
)
n_confs = SUM(PRESENT(publications.conference_id))
n_jours = SUM(PRESENT(publications.journal_id))
return Academic.CALCULATE(
ratio=n_pubs / KEEP_IF(n_jours, n_jours > 0)
)

domain name?
"""
return domains.CALCULATE(
name, average_references=AVG(domain_publications.publication.reference_num)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name, average_references=AVG(domain_publications.publication.reference_num)
name,
average_references=AVG(domain_publications.publication.reference_num)

Comment on lines +2471 to +2473
return publications.PARTITION(name="years", by=year).CALCULATE(
year, COUNT(publications)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return publications.PARTITION(name="years", by=year).CALCULATE(
year, COUNT(publications)
)
return (
publications
.PARTITION(name="years", by=year)
.CALCULATE(year, COUNT(publications))
)

Comment on lines +2459 to +2461
return authors.WHERE(HAS(publications_selected)).CALCULATE(
name, total_citations=SUM(publications_selected.citation_num)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return authors.WHERE(HAS(publications_selected)).CALCULATE(
name, total_citations=SUM(publications_selected.citation_num)
)
return (
authors
.WHERE(HAS(publications_selected))
.CALCULATE(name, total_citations=SUM(publications_selected.citation_num))
)

Comment on lines +2642 to +2647
ratio=IFF(
HAS(organizations.authors),
NDISTINCT(organizations.authors.author_id)
/ NDISTINCT(organizations.organization_id),
0,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mathematically, the IF/HAS here are not needed. Can simplify a bit (this will automatically become 0 if there are no authors):

Suggested change
ratio=IFF(
HAS(organizations.authors),
NDISTINCT(organizations.authors.author_id)
/ NDISTINCT(organizations.organization_id),
0,
),
ratio=COUNT(organizations.authors) / COUNT(organizations)

Comment on lines +2661 to +2671
return (
writes.CALCULATE(author_name=author.name)
.PARTITION(name="authors", by=author_name)
.CALCULATE(
author_name,
count_publication=NDISTINCT(
writes.WHERE(publication.year == 2021).publication_id
),
)
.TOP_K(1, by=count_publication.DESC())
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the partition at all. This whole thing becomes simpler if you start from the perspective of authors:

Suggested change
return (
writes.CALCULATE(author_name=author.name)
.PARTITION(name="authors", by=author_name)
.CALCULATE(
author_name,
count_publication=NDISTINCT(
writes.WHERE(publication.year == 2021).publication_id
),
)
.TOP_K(1, by=count_publication.DESC())
)
selected_pubs = writes.publication.WHERE(year == 2021)
return (
authors
.WHERE(HAS(selected_pubs))
.CALCULATE(
name,
count_publication=NDISTINCT(selected_pubs.publication_id),
)
.TOP_K(1, by=count_publication.DESC())
)

Comment on lines +2681 to +2683
return conferences.CALCULATE(name, count_publications=COUNT(proceedings)).ORDER_BY(
count_publications.DESC(), name.DESC()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return conferences.CALCULATE(name, count_publications=COUNT(proceedings)).ORDER_BY(
count_publications.DESC(), name.DESC()
)
return (
conferences
.CALCULATE(name, count_publications=COUNT(proceedings))
.ORDER_BY(count_publications.DESC(), name.DESC())
)

Comment on lines +2694 to +2696
return journals.CALCULATE(
name, jid=journal_id, num_publications=COUNT(archives)
).ORDER_BY(num_publications.DESC())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return journals.CALCULATE(
name, jid=journal_id, num_publications=COUNT(archives)
).ORDER_BY(num_publications.DESC())
return (
journals
.CALCULATE(name, journal_id, num_publications=COUNT(archives))
.ORDER_BY(num_publications.DESC())

Comment on lines +2708 to +2710
return conferences.CALCULATE(name, num_publications=COUNT(proceedings)).ORDER_BY(
num_publications.DESC(), name
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return conferences.CALCULATE(name, num_publications=COUNT(proceedings)).ORDER_BY(
num_publications.DESC(), name
)
return (
conferences
.CALCULATE(name, num_publications=COUNT(proceedings))
.ORDER_BY(num_publications.DESC(), name.ASC())
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Academic Defog Database and its questions
3 participants