You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm new to the world of AI, but I'm already a programmer with some experience (but none with vector databases).
I apologize for the long text, but thanks in advance for any help.
I'm planning to launch a free scientific "data" aggregator, which uses data from scientific studies published on websites that make it publicly and openly available, providing everything in an organized and centralized way. to illustrate what I mean: this includes grouping all studies that use a given approach to understanding a disease and "rewriting" all content in a linear flow of text (to avoid repetition and allow full understanding, as if everything had been provided from a single article), for example.
however, I've been facing some difficulties on how to implement this.
these are the main difficulties:
scientific articles are usually very long, and would easily exceed even the 32k token limit of chatGPT 4, for example. the immediate logical approach would be to simply break the content into smaller chunks. however, what would be the best way to summarize or extract the relevant content from the entire article, without losing the context of the information (especially the most relevant ones, which may be dispersed throughout the article)?
it is not uncommon for preliminary studies to be published and subsequently updated (this was common during the covid-19 pandemic), or (in the most likely case to happen) for other scientific studies to be a complement to others done recently. Repetition is something the website really wants to avoid (after all, it saves users time). how to "securely" check if a specific aggregate already has information mentioned in the new article, so that only really non-existent data is inserted (that is, what is effectively new for the aggregate)?
complementing item 2, I don't want, for example, an aggregate that gathers data from cancer studies from 20 years ago to be updated with current scientific articles (after all, there will very likely be a significant difference in relation to the effectiveness of anything proposed ). when exceeding a certain period of time considered "safe" (this is also a little difficult to determine, since not always only the chronological aspect is completely relevant), I want that, instead of updating an aggregate, a new one is created. how to ensure that only content that relates to the subject, recently, can be updated?
these are the main difficulties, and in many ways the cost factor is very important, because AI services are not very cheap, and there are millions of scientific studies to analyze. the cheaper the solution of difficulties can be, the better.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I'm new to the world of AI, but I'm already a programmer with some experience (but none with vector databases).
I apologize for the long text, but thanks in advance for any help.
I'm planning to launch a free scientific "data" aggregator, which uses data from scientific studies published on websites that make it publicly and openly available, providing everything in an organized and centralized way. to illustrate what I mean: this includes grouping all studies that use a given approach to understanding a disease and "rewriting" all content in a linear flow of text (to avoid repetition and allow full understanding, as if everything had been provided from a single article), for example.
however, I've been facing some difficulties on how to implement this.
these are the main difficulties:
these are the main difficulties, and in many ways the cost factor is very important, because AI services are not very cheap, and there are millions of scientific studies to analyze. the cheaper the solution of difficulties can be, the better.
I really appreciate anyone who can help!
Beta Was this translation helpful? Give feedback.
All reactions