-
Notifications
You must be signed in to change notification settings - Fork 3
Adding defog Academic database for testing #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
(2, '2023-06-01', 500.00, 2500.00, 1, 'bcryptHash(qpwo9874zyGk!)', NULL, 'mobile_yjp08q', '198.51.100.233, 70.121.39.25', true, false, '2023-06-01 09:00:00'); | ||
(2, '2023-06-01', 500.00, 2500.00, 1, 'bcryptHash(qpwo9874zyGk!)', NULL, 'mobile_yjp08q', '198.51.100.233, 70.121.39.25', true, false, '2023-06-01 09:00:00'); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file doesn't load the data in snowflake, I added it directly. Why we have this file exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one in question is sf_task.sql
(14, 13, 'Acetaminophen', '2023-01-08', '2023-01-14', 500, 'mg', 6), | ||
(15, 14, 'Hydrocortisone cream', '2023-02-25', '2023-03-07', 10, 'g', 12); | ||
|
||
-- ACADEMIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with this format (all defog tables in the same schema: main
) is that we could have name conflicts if any table has the same name. These tables should be in a different schema ACADEMIC
, the same administrative order we use in snowflake. The same can be done for SQLite using ATTACH. Probably also for MySQL using schemas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, we don't really need 4 different metadata files if the types used are compatible with the metadata type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copying how things are done for the real defog benchmark. We can change things potentially, but for now let's keep it consistent.
-- https://github.com/defog-ai/defog-data/blob/main/defog_data/academic/academic.sql | ||
|
||
------------------------------------------------------------------------------- | ||
CREATE SCHEMA ACADEMIC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would prefer to use this same format for the other engines. Then, we could share the same metadata.
{ | ||
"name": "authors", | ||
"type": "simple table", | ||
"table path": "main.author", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any critical difference between schemas, other than the table path of the tables? (e.g., main
for SQLite and PostgreSQL, none
for MySQL, and schema_name
for Snowflake?)
Using different arrangements for each engine will make this process unscalable as we add more Defog tests and new engines.
Change the PR status from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass of revisions (half of them are minor formatting nitpicks)
return Academic.CALCULATE( | ||
publication_to_author_ratio=NDISTINCT(publications.publication_id) | ||
/ NDISTINCT(authors.author_id) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need NDISTINCT
here. That was done in the original query because it joined the two tables before aggregating. We can rewrite it as this:
return Academic.CALCULATE( | |
publication_to_author_ratio=NDISTINCT(publications.publication_id) | |
/ NDISTINCT(authors.author_id) | |
) | |
n_pub = COUNT(publications) | |
n_auth = COUNT(authors) | |
return Academic.CALCULATE( | |
publication_to_author_ratio=n_pub / KEEP_IF(n_auth, n_auth > 0) | |
) |
return Academic.CALCULATE( | ||
ratio=NDISTINCT(publications.conference_id) / NDISTINCT(publications.journal_id) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't what the question is asking. If you look at the SQL query refsol closely, you'll notice it is using NDISTINCT
on the pid
values based on whether there was a cid
or jid
. We can rewrite that as this:
return Academic.CALCULATE( | |
ratio=NDISTINCT(publications.conference_id) / NDISTINCT(publications.journal_id) | |
) | |
n_confs = SUM(PRESENT(publications.conference_id)) | |
n_jours = SUM(PRESENT(publications.journal_id)) | |
return Academic.CALCULATE( | |
ratio=n_pubs / KEEP_IF(n_jours, n_jours > 0) | |
) |
domain name? | ||
""" | ||
return domains.CALCULATE( | ||
name, average_references=AVG(domain_publications.publication.reference_num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name, average_references=AVG(domain_publications.publication.reference_num) | |
name, | |
average_references=AVG(domain_publications.publication.reference_num) |
return publications.PARTITION(name="years", by=year).CALCULATE( | ||
year, COUNT(publications) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return publications.PARTITION(name="years", by=year).CALCULATE( | |
year, COUNT(publications) | |
) | |
return ( | |
publications | |
.PARTITION(name="years", by=year) | |
.CALCULATE(year, COUNT(publications)) | |
) |
return authors.WHERE(HAS(publications_selected)).CALCULATE( | ||
name, total_citations=SUM(publications_selected.citation_num) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return authors.WHERE(HAS(publications_selected)).CALCULATE( | |
name, total_citations=SUM(publications_selected.citation_num) | |
) | |
return ( | |
authors | |
.WHERE(HAS(publications_selected)) | |
.CALCULATE(name, total_citations=SUM(publications_selected.citation_num)) | |
) |
ratio=IFF( | ||
HAS(organizations.authors), | ||
NDISTINCT(organizations.authors.author_id) | ||
/ NDISTINCT(organizations.organization_id), | ||
0, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mathematically, the IF/HAS here are not needed. Can simplify a bit (this will automatically become 0 if there are no authors):
ratio=IFF( | |
HAS(organizations.authors), | |
NDISTINCT(organizations.authors.author_id) | |
/ NDISTINCT(organizations.organization_id), | |
0, | |
), | |
ratio=COUNT(organizations.authors) / COUNT(organizations) |
return ( | ||
writes.CALCULATE(author_name=author.name) | ||
.PARTITION(name="authors", by=author_name) | ||
.CALCULATE( | ||
author_name, | ||
count_publication=NDISTINCT( | ||
writes.WHERE(publication.year == 2021).publication_id | ||
), | ||
) | ||
.TOP_K(1, by=count_publication.DESC()) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the partition at all. This whole thing becomes simpler if you start from the perspective of authors
:
return ( | |
writes.CALCULATE(author_name=author.name) | |
.PARTITION(name="authors", by=author_name) | |
.CALCULATE( | |
author_name, | |
count_publication=NDISTINCT( | |
writes.WHERE(publication.year == 2021).publication_id | |
), | |
) | |
.TOP_K(1, by=count_publication.DESC()) | |
) | |
selected_pubs = writes.publication.WHERE(year == 2021) | |
return ( | |
authors | |
.WHERE(HAS(selected_pubs)) | |
.CALCULATE( | |
name, | |
count_publication=NDISTINCT(selected_pubs.publication_id), | |
) | |
.TOP_K(1, by=count_publication.DESC()) | |
) |
return conferences.CALCULATE(name, count_publications=COUNT(proceedings)).ORDER_BY( | ||
count_publications.DESC(), name.DESC() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return conferences.CALCULATE(name, count_publications=COUNT(proceedings)).ORDER_BY( | |
count_publications.DESC(), name.DESC() | |
) | |
return ( | |
conferences | |
.CALCULATE(name, count_publications=COUNT(proceedings)) | |
.ORDER_BY(count_publications.DESC(), name.DESC()) | |
) |
return journals.CALCULATE( | ||
name, jid=journal_id, num_publications=COUNT(archives) | ||
).ORDER_BY(num_publications.DESC()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return journals.CALCULATE( | |
name, jid=journal_id, num_publications=COUNT(archives) | |
).ORDER_BY(num_publications.DESC()) | |
return ( | |
journals | |
.CALCULATE(name, journal_id, num_publications=COUNT(archives)) | |
.ORDER_BY(num_publications.DESC()) |
return conferences.CALCULATE(name, num_publications=COUNT(proceedings)).ORDER_BY( | ||
num_publications.DESC(), name | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return conferences.CALCULATE(name, num_publications=COUNT(proceedings)).ORDER_BY( | |
num_publications.DESC(), name | |
) | |
return ( | |
conferences | |
.CALCULATE(name, num_publications=COUNT(proceedings)) | |
.ORDER_BY(num_publications.DESC(), name.ASC()) | |
) |
Resolves #414