[Share] After creating annotations, automatically remove spaces and garbled characters from the content #269
Replies: 7 comments 11 replies
-
介绍 Introduction本脚本可用于去除注释中的多余空格、换行符,替换全角字母、数字,并规范标点符号。 用法 Usage本脚本可用于自动触发(事件:新建注释)和手动触发(菜单项:注释菜单中)。 请将以下代码完整拷贝至“数据”中: 第一版:基于规则的处理(无需联网)Version 1: Rule-based Processing (No networking required)/**
* Format Chinese Annotations
* @author wakewon
* @usage Create Annotation & In Annotation Menu
* @link https://github.com/windingwind/zotero-actions-tags/discussions/269
* @see https://github.com/windingwind/zotero-actions-tags/discussions/269
*/
if (!item) return;
const topItem = Zotero.Items.getTopLevel([item])[0];
const formatLang = ["", "zh", "zh-CN", "zh_CN"];
const lang = topItem.getField("language");
if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language";
return await editAnnotation(item);
async function editAnnotation(annotationItem) {
if (!annotationItem.isAnnotation()) return "[Action: Format Chinese Annotations] Not an annotation item";
if (!annotationItem.annotationText) return "[Action: Format Chinese Annotations] No text found in this annotation";
annotationItem.annotationText = await formatText(annotationItem.annotationText);
return;
}
async function formatText(text) {
const punctuationMap = { '.': '。', ',': ',', '!': '!', '?': '?', ':': ':', ';': ';' };
const fullWidthToHalfWidth = s => String.fromCharCode(s.charCodeAt(0) - 0xFEE0);
const chineseCharacters = '[\u4e00-\u9fa5\u3400-\u4DBF\uF900-\uFAFF]'; // Chinese characters (simplified and traditional)
return text
.replace(/[\r\n]/g, '') // Remove all line breaks
.replace(/[\uE5D2\uE5CF\uE5CE\uE5E5]/g, '') // Remove special characters
.replace(/[A-Za-z0-9!"'()[]{}<>,.:;-]/g, fullWidthToHalfWidth) // Full-width to half-width
.replace(/\s+/g, ' ') // Replace consecutive spaces with a single space
.replace(/(?<=\d)\s+|\s+(?=\d)/g, '') // Remove spaces around digits
.replace(/\s*(?=[.,:;!?"()\[\]。?!,、;:“”‘’()《》【】])|(?<=[.,:;!?"()\[\]。?!,、;:“”‘’()《》【】])\s*/g, '') // Remove spaces around punctuation
.replace(new RegExp(`(\\S)\\s+(?=${chineseCharacters})|(?<=${chineseCharacters})\\s+(\\S)`, 'g'), '\$1\$2') // Remove spaces between Chinese characters
.replace(new RegExp(`(${chineseCharacters}+)([,.!?:;]+)`, 'g'), (m, c, p) => c + p.split('').map(p => punctuationMap[p]).join('')) // Replace English punctuation marks with Chinese ones
.replace(new RegExp(`([,.!?:;]+)(${chineseCharacters}+)`, 'g'), (m, p, c) => p.split('').map(p => punctuationMap[p]).join('') + c) // Replace English punctuation marks with Chinese ones
.replace(new RegExp(`\\(([^()]*${chineseCharacters}[^()]*)\\)|\\[([^\\[\\]]*${chineseCharacters}[^\\[\\]]*)\\]`, 'g'), (m, c1, c2) => c1 ? `(${c1})` : `【${c2}】`) // Replace full-width parentheses
.replace(/([0-9a-zA-Z])(/g, "\$1" + String.fromCharCode(0xFF08)) // Full-width parentheses around digits and letters
.replace(/)([0-9a-zA-Z])/g, String.fromCharCode(0xFF09) + "\$1") // Full-width parentheses around digits and letters
.replace(/([a-zA-Z]+)([,.!?:;]+)([a-zA-Z]+)/g, (m, w1, p, w2) => w1 + p + ' ' + w2) // Add space for English punctuations
.replace(/(\S)\(/g, '\$1 (') // Add space before parenthesis
.replace(new RegExp(`\\)(?=${chineseCharacters})`, 'g'), ') \$1') // Add space after parenthesis
.replace(/([,.!?:;)])(?!\s|(?<=\.)\d)/g, '\$1 ') // Add a space after punctuation if not followed by a space or a digit after '.'
.replace(/🔤(.*)/g, (match, p1) => p1.trim() ? `\n🔤${p1}` : '🔤'); // Add a newline before 🔤 if there is content after it
} 第二版:使用AI处理(需要联网,需要有效的OpenAI API)Version 2: Processing with AI (Requires networking and a valid OpenAI API)
/**
* AI Normalize Punctuation
* This script standardizes punctuation in the selected text, handling both Chinese and English punctuation.
* It uses the OpenAI API for text processing.
*
* @usage In Annotation Menu
* @link https://github.com/windingwind/zotero-actions-tags/discussions/269
* @see https://github.com/windingwind/zotero-actions-tags/discussions/269
*/
/** { 👍 "openai" } service provider */
const SERVICE = "openai";
// OpenAI API configuration
const OPENAI = {
API_KEY: "InputYourKeyHere", // 替换为你的OpenAI API密钥。 // Replace with your OpenAI API key.
MODEL: "gpt-3.5-turbo", // 默认模型名称,可以根据需要进行更改。 // Default model name, which can be changed as needed.
API_URL: "https://api.openai.com/v1/chat/completions", // 请求地址,可以根据需要进行更改。 // Request address, which can be changed as needed.
};
if (!item) return;
const topItem = Zotero.Items.getTopLevel([item])[0];
const formatLang = ["", "zh", "zh-CN", "zh_CN"];
if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language";
if (!formatLang.includes(lang)) return;
return await normalizePunctuation(item);
async function normalizePunctuation(annotationItem) {
if (!annotationItem.isAnnotation()) return "[Action: AI Normalize Punctuation] Not an annotation item";
if (!annotationItem.annotationText) return "[Action: AI Normalize Punctuation] No text found in this annotation";
const selectedText = annotationItem.annotationText;
let result;
let success;
switch (SERVICE) {
case "openai":
({ result, success } = await callOpenAI(selectedText));
break;
default:
result = "Service Not Found";
success = false;
}
if (success) {
annotationItem.annotationText = `${result}`;
return `Formatted Text: ${result}`;
} else {
return `Error: ${result}`;
}
}
async function callOpenAI(text) {
const prompt = `
As a text normalization processor, please format the following text according to these rules:
1. Punctuation rules:
- Use Chinese punctuation (。,;:!?) for Chinese content
- Use English punctuation (.,:;!?) for English content
- Remove duplicate punctuation marks
- Ensure proper spacing around punctuation
2. Space normalization:
- Remove leading/trailing whitespace
- Remove redundant spaces between words
- Maintain single space between English and Chinese text
- Keep single space between sentences
3. Mathematical formula conversion:
- When encountering ordinary text within a formula environment, convert the paragraph containing LaTeX formulas into a mixed form of regular text and in-line formulas, ensuring that mathematical formulas are represented using in-line formulas (i.e., enclosed by $ and $)
- Preserve LaTeX commands and symbols
Return only the processed text without any explanations. Input text:
${text}
`;
const data = {
model: OPENAI.MODEL,
messages: [
{ role: "system", content: "You are an automated language processing program." },
{ role: "user", content: prompt }
],
max_tokens: 1000,
temperature: 0.2,
};
try {
const xhr = await Zotero.HTTP.request(
"POST",
OPENAI.API_URL,
{
headers: {
'Authorization': `Bearer ${OPENAI.API_KEY}`,
'Content-Type': 'application/json; charset=utf-8',
},
body: JSON.stringify(data),
responseType: "json",
}
);
if (xhr && xhr.status && xhr.status === 200 && xhr.response.choices && xhr.response.choices.length > 0) {
return {
success: true,
result: xhr.response.choices[0].message.content.trim(),
};
} else {
return {
result: xhr.response.error ? xhr.response.error.message : 'Unknown error',
success: false,
};
}
} catch (error) {
console.error('Error calling OpenAI API:', error);
return {
result: error.message,
success: false,
};
}
} 定制化用法 Customized Usage跳过特定语言的文献 Skip the documentation of a specific language本脚本只处理语言字段为zh、zh-CN、zh_CN以及没有语言信息条目下的PDF文档。
This script only handles PDF documents with language fields zh, zh-CN, zh_CN and no language information entries.
关闭提醒弹窗 Turn off alert pop-ups如果希望关闭某一个弹窗提醒,你可以将代码中 If you wish to turn off a particular pop-up alert, you can remove the double quotes and the content inside the double quotes after the 致谢 Acknowledgements本脚本主要参考了 #107 和 #220 ,并借助gpt-4o完成了主要的代码编写工作。再次感谢原脚本作者的帮助以及GPT的强力支持! |
Beta Was this translation helpful? Give feedback.
-
感谢大佬!太牛了! |
Beta Was this translation helpful? Give feedback.
-
谢谢,太好了。终于能便捷地解决这个空格的问题了 |
Beta Was this translation helpful? Give feedback.
-
您好 ,想问一下这个 编辑动作 窗口该怎么打开呀 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
太感謝了,這個設置完美解決了OCR帶來的視覺痛點。 |
Beta Was this translation helpful? Give feedback.
-
谢谢大佬!! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is there an existing issue for this?
Environment
Describe the feature request
Is your feature request related to a problem? Please describe.
![291341708430127_ pic]()
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
感谢开发action&tag!自动化省了很多功夫。请问可否利用action&tag实现添加注释以后,自动去除注释内容中的空格以及乱码呢?
Why do you need this feature?
![CleanShot 2024-02-20 at 20 12 12@2x]()
A clear and concise description of why you need this feature.
pdf中的中文文本有时空格很多,即使一行内没有空格,换行也会造成空格。目前可以利用快捷指令、quicker等工具选中文本以后去除空格,但是是否可以利用action&tag的功能实现全自动去除空格呢?感谢开发者~
Describe the solution you'd like
The solution you'd like
A clear and concise description of what you want to happen.
Alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Anything else?
Beta Was this translation helpful? Give feedback.
All reactions