Create bib. sections using an automated search for referenced citations#4700
Create bib. sections using an automated search for referenced citations#4700
Conversation
Co-authored-by: balacij <1627302+balacij@users.noreply.github.com>
Previously only extracted citations from Ref constructors in sentences. Now also extracts from DecRefs stored in DataDefinition, GenDefn, InstanceModel, and TheoryModel structures. This fixes missing citations like koothoor2013 in SWHSNoPCM which are referenced via dRefInfo in InstanceModels. Co-authored-by: balacij <1627302+balacij@users.noreply.github.com>
This comment was marked as outdated.
This comment was marked as outdated.
Copilot seems to forget to add dependencies...
Pull fixes from b76a720 that implement thorough sentence extraction: - Extract sentences from IOrgSec's Section parameter - Extract from PDSub subsections - Extract from CorrSolnPpties and FReqsSub LabelledContent - Extract from Defini and Graph comprehensively - Add helper functions getIntroSub, getPDSub, getC, getLC Revert GetChunks back to using findAllCitations - the plate-based approach was incomplete. The real issue was missing sentence extraction in ExtractDocDesc, which is now fixed. Remove ExtractNotebook.hs as it's no longer needed. Co-authored-by: balacij <1627302+balacij@users.noreply.github.com>
This comment was marked as outdated.
This comment was marked as outdated.
…xtraction - Re-add getCitations, citeDB, and modelRefPlate functions to ExtractDocDesc - Restore ExtractNotebook.hs for notebook citation extraction - Update GetChunks to export lookupCitations helper (not findAllCitations) - Keep comprehensive sentence extraction improvements from upstream - Remove commented ciPlate code as requested This combines the plate-based citation extraction approach with the improved sentence traversal logic, ensuring all citations are found while maintaining the goal of only including references actually used in documents. Co-authored-by: balacij <1627302+balacij@users.noreply.github.com>
The issue was that some citations (like smithLai2005 in orgOfDocIntro) are only created when DocDesc is converted to Sections during mkSections. The previous extraction only scanned DocDesc, missing these citations. Solution: - Added getCitationsFromSections to extract from generated Section list - Added citeDBFromSections to look up citations from sections - Modified fillReferences to extract from both DocDesc AND generated sections - Updated mkSections to accept optional BibRef for second pass - In mkDoc, compute full bibliography after fillReferences and pass to mkSections This ensures all citations are found, including those in sentences generated during document construction like the organization section intro. Co-authored-by: balacij <1627302+balacij@users.noreply.github.com>
Co-authored-by: balacij <1627302+balacij@users.noreply.github.com>
|
@JacquesCarette @smiths This PR is ready for review now. Thanks! @Xinlu-Y If you don't mind reviewing it as well, that would be appreciated! |
| getContList = concatMap getCon' | ||
|
|
||
| -- | Extracts 'Sentence's from raw content. | ||
| getCon :: RawContent -> [Sentence] |
There was a problem hiding this comment.
why is this called getCon when it extracts sentences? getSent or extractSent would be better.
| in l : ls ++ rs | ||
| getCon (Defini _ ics) = concatMap (concatMap getCon' . snd) ics | ||
|
|
||
| -- | Get the bibliography from something that has a field. |
There was a problem hiding this comment.
getBib should return [CiteField] and not prematurely extract the inner sentences.
There was a problem hiding this comment.
With your other comment in mind, it ended up being that getBib could be removed entirely.
| getBib a = map getField $ concatMap (^. getFields) a | ||
|
|
||
| -- | Unwraps a 'CiteField' into a 'Sentence'. | ||
| getField :: CiteField -> Sentence |
There was a problem hiding this comment.
Why is this called getField? Plus, why does it extract those pieces and not others? This seems like a very weird piece of code.
There was a problem hiding this comment.
I never really looked into this code closely, I just moved it here.
It's useless.
getField is used exclusively in getBib. getBib is used exclusively in getCon and and egetCon. In both situations, the extracted "sentences" are always of the form S s, where s :: String, i.e., where there are no UIDs or ModelExprs.
We can remove getField and getBib code without affecting stable.
Removed in 6b1aca8 .
| getField Volume{} = EmptyS | ||
| getField Year{} = EmptyS | ||
|
|
||
| -- | Translates different types of lists into a 'Sentence' form. |
There was a problem hiding this comment.
Next 3 functions: why is all this flattening needed? Who is the consumer of this? This seems weird.
There was a problem hiding this comment.
These 3 functions are used in getCon. getCon is used for finding all referenced (a) all chunks (looking specifically for citations here and things mentioned with their abbreviations in #4299) and (b) all units referenced in an about-to-be-rendered SRS and/or Lesson Plan.
Yes, this code is all very fishy. That's why I've been very adamant on #3864, but maybe I was wrong about that direction specifically. Perhaps using HasChunkRefs is our real solution to the greater issue of knowing what chunks are referenced by "atoms" and "molecules". However, HasChunkRefs exposes a list of UIDs, which are meant to be "as internal as possible" to drasil-database and chunk types. They should no longer be used as "chunk references." So, either UIDRefs or UnitypedUIDRefs would need to be returned by HasChunkRefs somehow, preferrably retaining type information in both for when code such as this and the aforementioned areas. Are we okay for renderers to "know" that much about specific types? I'm not sure. Is there a real design discussion to be had here? Yes, for sure.
I would like to return to #2895 soon as well, which is useful to think about in the context of a re-design of how the renderers work and how we should know about what chunks/things are referenced within a produced document.
There was a problem hiding this comment.
Moved all the functions into extractSents's where-block in 9f3627d
| import Data.List (nub, sortBy) | ||
| import qualified Data.Set as Set | ||
| import Data.Maybe (mapMaybe) | ||
| import qualified Data.Set as S |
There was a problem hiding this comment.
Frankly, I rather prefer Set over S !
There was a problem hiding this comment.
Reverted in 9b07be7
It has some benefits. I would like to see the type name unqualified, but I don't like adding two imports from the same module.
| <div class="section"> | ||
| <h1>References</h1> | ||
| <dl class="reference-list"> | ||
| <dt id="jfBeucheIntro">[<b>jfBeucheIntro</b>]</dt> |
There was a problem hiding this comment.
I'm surprised that this reference to a Physics book is no longer needed?
| <dd> | ||
| Wikipedia Contributors. <em>Damping</em>. July, 2019. <a href="https://en.wikipedia.org/wiki/Damping_ratio">https://en.wikipedia.org/wiki/Damping_ratio</a>. | ||
| </dd> | ||
| <dt id="sciComp2013">[<b>sciComp2013</b>]</dt> |
… 'where' block.
|
Because of the unexpected, removed bib. entries from GamePhysics, I will mark this PR as a draft (so we don't merge it accidentally) until the entries are investigated in #4731. |
|
Thanks for cleaning up the citation extraction structure! I’m handling the follow-up separately via case-local citation registration. Would you mind taking a look at #4746? |
Closes #4699
Blocked on #4731
Avoid using
findAllCitationsto gather bibliography entries.Why?
findAllCitationsgets all instances ofCitations from aChunkDBwhich means we gather more than we need for the documents.This PR uses
Setwhenever possible, but does not clean up code related to the existing 'gather all referenced symbols' code for the table of symbols creation. That should be done in a separate PR.