Skip to content

LLM-based freeform data coding #545

@SachaG

Description

@SachaG

We need two functions

generateCodebook

generateCodebook(question: QuestionMetadata, potentialEntities: Entity[], rawAnswers: RawDataAnswer[]) // returns {codes: Code[]}

// with

type Code = {
  id: string // id of the corresponding entity
  existsInRepo: boolean // whether the code already exists in the entities repo
  entity: Entity // when adding a new entity, should contain entity metadata
}
  • we need the question so that we can submit the question prompt to the LLM.
  • entities here are the existing potentially matching entities, as defined by the question's matchTags function. Ideally less than 1000 (or whatever the context cutoff is).
  • It can potentially be empty if this is a brand new question covering a totally new topic.
  • Any entities mentioned in previous years but not yet part of that set of entities should be manually included before the existing codebook is passed to the LLM.
  • Any entity with existsInRepo: true should then manually be added to the Entities repo in the appropriate file.
  • To save on time/data, it's ok if the LLM only returns the id for existing entities and not the entire entity object.

generateMatches

generateMatches(question: QuestionMetadata, matchingEntities: entities[], rawAnswers: RawDataAnswer[]) // return result: matches: AnswerMatch[]

// where AnswerMatch is defined as:

type AnswerMatch = {
  index: number;
  answer: string;
  answerId: string;
  tokenIds: string[];
};

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions