Using multi-sample/semantic-entropy for topic categorization (and for summarization?)

Dear Jigsaw team

If you are concerned about hallucinations in the topic categorization, I have just summarized various ideas that we have at polis to reduce them by exploiting compute at test-time with multi-sampling and/or semantic entropy: [https://github.com/compdemocracy/polis/issues/1880](https://github.com/compdemocracy/polis/issues/1880) .

You are already exploiting the stochasticity of the LLM  with the rejection/retry scheme when the LLM gives an empty answer, in [https://github.com/Jigsaw-Code/sensemaking-tools/blob/9c86f3bc38568e62eb4306679e0e479500c2b42b/src/tasks/categorization.ts\#L54-L89](https://github.com/Jigsaw-Code/sensemaking-tools/blob/9c86f3bc38568e62eb4306679e0e479500c2b42b/src/tasks/categorization.ts#L54-L89) 

Multi-sampling / Semantic entropy would go one step further by systematically calling multiple times, not just upon failure \[the two ideas can also be nested\]. You would have a for loop not to `MAX_RETRIES` but to `N_SAMPLES` (say, 5 or 10, ideally parallelized), then a processing step.

Yes it will take longer, thus a trade-off with [https://github.com/Jigsaw-Code/sensemaking-tools/issues/11](https://github.com/Jigsaw-Code/sensemaking-tools/issues/11) :) but hopefully that parallelization will more than compensate. Cost-wise, it does multiply the number of calls to the LLM over all the comments, thus the total cost, which might or might not be an issue depending on the current typical cost, and if topics are "good enough" right now.

	for (let attempts = 1; attempts <= MAX_RETRIES; attempts++) {
	// convert JSON to string representation that will be sent to the model
	const uncategorizedCommentsForModel: string[] = uncategorized.map((comment) =>
	JSON.stringify({ id: comment.id, text: comment.text })
	);
	const outputSchema: TSchema = Type.Array(
	includeSubtopics ? SubtopicCategorizedComment : TopicCategorizedComment
	);
	const newCategorized: CommentRecord[] = (await model.generateData(
	getPrompt(instructions, uncategorizedCommentsForModel, additionalInstructions),
	outputSchema
	)) as CommentRecord[];

	const newProcessedComments = processCategorizedComments(
	newCategorized,
	inputComments,
	uncategorized,
	includeSubtopics,
	topics
	);
	categorized = categorized.concat(newProcessedComments.commentRecords);
	uncategorized = newProcessedComments.uncategorizedComments;

	if (uncategorized.length === 0) {
	break; // All comments categorized successfully
	}

	if (attempts < MAX_RETRIES) {
	console.warn(
	`Expected all ${uncategorizedCommentsForModel.length} comments to be categorized, but ${uncategorized.length} are not categorized properly. Retrying in ${RETRY_DELAY_MS / 1000} seconds...`
	);
	await new Promise((resolve) => setTimeout(resolve, RETRY_DELAY_MS));
	} else {
	categorized = categorized.concat(assignDefaultCategory(uncategorized, includeSubtopics));
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using multi-sample/semantic-entropy for topic categorization (and for summarization?) #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using multi-sample/semantic-entropy for topic categorization (and for summarization?) #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions