[Share] After creating annotations, automatically remove spaces and garbled characters from the content #269

yzy1228682367 · 2024-02-20T12:16:38Z

yzy1228682367
Feb 20, 2024

Is there an existing issue for this?

I have searched the existing issues

Environment

OS: macOS Sonoma 14.1.2
Zotero Version: zotero 7 beta 60
Plugin Version: 1.0.0-beta.34

Describe the feature request

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
感谢开发action＆tag！自动化省了很多功夫。请问可否利用action＆tag实现添加注释以后，自动去除注释内容中的空格以及乱码呢？

Why do you need this feature?
A clear and concise description of why you need this feature.
pdf中的中文文本有时空格很多，即使一行内没有空格，换行也会造成空格。目前可以利用快捷指令、quicker等工具选中文本以后去除空格，但是是否可以利用action＆tag的功能实现全自动去除空格呢？感谢开发者～

Describe the solution you'd like

The solution you'd like
A clear and concise description of what you want to happen.

Alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Anything else?

wakewon · 2024-02-25T16:53:19Z

wakewon
Feb 25, 2024

介绍 Introduction

本脚本可用于去除注释中的多余空格、换行符，替换全角字母、数字，并规范标点符号。
This script can be used to remove extra spaces, line breaks, replace full-width letters and numbers, and standardize punctuation in comments.

用法 Usage

本脚本可用于自动触发（事件：新建注释）和手动触发（菜单项：注释菜单中）。
This script can be used for automatic triggering (Event: Create Annotation) and manual triggering (Menu Label: In Annotation Menu).

请将以下代码完整拷贝至“数据”中：
Please copy the following script into "Data":

第一版：基于规则的处理（无需联网）

Version 1: Rule-based Processing (No networking required)

/**
 * Format Chinese Annotations
 * @author wakewon
 * @usage Create Annotation & In Annotation Menu
 * @link https://github.com/windingwind/zotero-actions-tags/discussions/269
 * @see https://github.com/windingwind/zotero-actions-tags/discussions/269
*/

if (!item) return;

const topItem = Zotero.Items.getTopLevel([item])[0];
const formatLang = ["", "zh", "zh-CN", "zh_CN"];
const lang = topItem.getField("language");

if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language";

return await editAnnotation(item);

async function editAnnotation(annotationItem) {
  if (!annotationItem.isAnnotation()) return "[Action: Format Chinese Annotations] Not an annotation item";
  if (!annotationItem.annotationText) return "[Action: Format Chinese Annotations] No text found in this annotation";
  
  annotationItem.annotationText = await formatText(annotationItem.annotationText);
  return;
}

async function formatText(text) {
  const punctuationMap = { '.': '。', ',': '，', '!': '！', '?': '？', ':': '：', ';': '；' };
  const fullWidthToHalfWidth = s => String.fromCharCode(s.charCodeAt(0) - 0xFEE0);
  const chineseCharacters = '[\u4e00-\u9fa5\u3400-\u4DBF\uF900-\uFAFF]'; // Chinese characters (simplified and traditional)

  return text
    .replace(/[\r\n]/g, '') // Remove all line breaks
    .replace(/[\uE5D2\uE5CF\uE5CE\uE5E5]/g, '') // Remove special characters
    .replace(/[Ａ-Ｚａ-ｚ０-９！＂＇（）［］｛｝＜＞，．：；－]/g, fullWidthToHalfWidth) // Full-width to half-width
    .replace(/\s+/g, ' ') // Replace consecutive spaces with a single space
    .replace(/(?<=\d)\s+|\s+(?=\d)/g, '') // Remove spaces around digits
    .replace(/\s*(?=[.,:;!?"()\[\]。？！，、；：“”‘’（）《》【】])|(?<=[.,:;!?"()\[\]。？！，、；：“”‘’（）《》【】])\s*/g, '') // Remove spaces around punctuation
    .replace(new RegExp(`(\\S)\\s+(?=${chineseCharacters})|(?<=${chineseCharacters})\\s+(\\S)`, 'g'), '\$1\$2') // Remove spaces between Chinese characters
    .replace(new RegExp(`(${chineseCharacters}+)([,.!?:;]+)`, 'g'), (m, c, p) => c + p.split('').map(p => punctuationMap[p]).join('')) // Replace English punctuation marks with Chinese ones
    .replace(new RegExp(`([,.!?:;]+)(${chineseCharacters}+)`, 'g'), (m, p, c) => p.split('').map(p => punctuationMap[p]).join('') + c) // Replace English punctuation marks with Chinese ones
    .replace(new RegExp(`\\(([^()]*${chineseCharacters}[^()]*)\\)|\\[([^\\[\\]]*${chineseCharacters}[^\\[\\]]*)\\]`, 'g'), (m, c1, c2) => c1 ? `（${c1}）` : `【${c2}】`) // Replace full-width parentheses
    .replace(/([0-9a-zA-Z])（/g, "\$1" + String.fromCharCode(0xFF08)) // Full-width parentheses around digits and letters
    .replace(/）([0-9a-zA-Z])/g, String.fromCharCode(0xFF09) + "\$1") // Full-width parentheses around digits and letters
    .replace(/([a-zA-Z]+)([,.!?:;]+)([a-zA-Z]+)/g, (m, w1, p, w2) => w1 + p + ' ' + w2) // Add space for English punctuations
    .replace(/(\S)\(/g, '\$1 (') // Add space before parenthesis
    .replace(new RegExp(`\\)(?=${chineseCharacters})`, 'g'), ') \$1') // Add space after parenthesis
    .replace(/([,.!?:;)])(?!\s|(?<=\.)\d)/g, '\$1 ') // Add a space after punctuation if not followed by a space or a digit after '.'
    .replace(/🔤(.*)/g, (match, p1) => p1.trim() ? `\n🔤${p1}` : '🔤'); // Add a newline before 🔤 if there is content after it
}

第二版：使用AI处理（需要联网，需要有效的OpenAI API）

Version 2: Processing with AI (Requires networking and a valid OpenAI API)

请注意，你需要拥有有效的OpenAI服务的密钥，同时拥有可以正常访问OpenAI服务的网络（或使用可用的服务URL）。请在使用前修改下方脚本中的API_KEY，将其后面双引号里的内容替换为你的OpenAI密钥。
Please note that you need to have a valid key for the OpenAI service, as well as a network that can access the OpenAI service properly (or use the available service URL). Please modify the API_KEY in the script below before using it and replace the double quotes after it with your OpenAI key.

这里的API_URL需填写如同脚本格式的完整请求地址。如果你的服务代理商仅提供了较短的地址（如api.chatanywhere.com.cn），你需要将脚本中的api.openai.com替换为代理商提供的短域名，并保留剩余的部分（即修改为https://api.chatanywhere.com.cn/v1/chat/completions）。
The API_URL here needs to be filled in as if it were the full request address in script format. If your service provider only provides a shorter address (e.g. api.chatanywhere.com.cn), you need to replace api.openai.com in the script with the short domain name provided by the provider and keep the remainder (i.e., change it to https://api.chatanywhere.com.cn/ v1/chat/completions).

/**
 * AI Normalize Punctuation
 * This script standardizes punctuation in the selected text, handling both Chinese and English punctuation.
 * It uses the OpenAI API for text processing.
 * 
 * @usage In Annotation Menu
 * @link https://github.com/windingwind/zotero-actions-tags/discussions/269
 * @see https://github.com/windingwind/zotero-actions-tags/discussions/269
 */

/** { 👍 "openai" } service provider */
const SERVICE = "openai";

// OpenAI API configuration
const OPENAI = {
  API_KEY: "InputYourKeyHere", // 替换为你的OpenAI API密钥。 // Replace with your OpenAI API key.
  MODEL: "gpt-3.5-turbo", // 默认模型名称，可以根据需要进行更改。 // Default model name, which can be changed as needed.
  API_URL: "https://api.openai.com/v1/chat/completions", // 请求地址，可以根据需要进行更改。 // Request address, which can be changed as needed.
};

if (!item) return;

const topItem = Zotero.Items.getTopLevel([item])[0];
const formatLang = ["", "zh", "zh-CN", "zh_CN"];
if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language";

if (!formatLang.includes(lang)) return;

return await normalizePunctuation(item);

async function normalizePunctuation(annotationItem) {
  if (!annotationItem.isAnnotation()) return "[Action: AI Normalize Punctuation] Not an annotation item";
  if (!annotationItem.annotationText) return "[Action: AI Normalize Punctuation] No text found in this annotation";


  const selectedText = annotationItem.annotationText;
  let result;
  let success;

  switch (SERVICE) {
    case "openai":
      ({ result, success } = await callOpenAI(selectedText));
      break;
    default:
      result = "Service Not Found";
      success = false;
  }

  if (success) {
    annotationItem.annotationText = `${result}`;
    return `Formatted Text: ${result}`;
  } else {
    return `Error: ${result}`;
  }
}

async function callOpenAI(text) {
  const prompt = `
  As a text normalization processor, please format the following text according to these rules:

  1. Punctuation rules:
    - Use Chinese punctuation (。，；：！？) for Chinese content
    - Use English punctuation (.,:;!?) for English content
    - Remove duplicate punctuation marks
    - Ensure proper spacing around punctuation

  2. Space normalization:
    - Remove leading/trailing whitespace
    - Remove redundant spaces between words
    - Maintain single space between English and Chinese text
    - Keep single space between sentences

  3. Mathematical formula conversion:
	- When encountering ordinary text within a formula environment, convert the paragraph containing LaTeX formulas into a mixed form of regular text and in-line formulas, ensuring that mathematical formulas are represented using in-line formulas (i.e., enclosed by $ and $)
    - Preserve LaTeX commands and symbols

  Return only the processed text without any explanations. Input text:
  
  ${text}
  `;

  const data = {
    model: OPENAI.MODEL,
    messages: [
      { role: "system", content: "You are an automated language processing program." },
      { role: "user", content: prompt }
    ],
    max_tokens: 1000,
    temperature: 0.2,
  };

  try {
    const xhr = await Zotero.HTTP.request(
      "POST",
      OPENAI.API_URL,
      {
        headers: {
          'Authorization': `Bearer ${OPENAI.API_KEY}`,
          'Content-Type': 'application/json; charset=utf-8',
        },
        body: JSON.stringify(data),
        responseType: "json",
      }
    );

    if (xhr && xhr.status && xhr.status === 200 && xhr.response.choices && xhr.response.choices.length > 0) {
      return {
        success: true,
        result: xhr.response.choices[0].message.content.trim(),
      };
    } else {
      return {
        result: xhr.response.error ? xhr.response.error.message : 'Unknown error',
        success: false,
      };
    }
  } catch (error) {
    console.error('Error calling OpenAI API:', error);
    return {
      result: error.message,
      success: false,
    };
  }
}

定制化用法 Customized Usage

跳过特定语言的文献 Skip the documentation of a specific language

本脚本只处理语言字段为zh、zh-CN、zh_CN以及没有语言信息条目下的PDF文档。

如果需要处理更多语言的文献，请在 const formatLang = ["", "zh", "zh-CN", "zh_CN"]; 中自行添加；
如果希望处理任意语言的文献，请在 if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language"; 前添加两个斜线 //。

This script only handles PDF documents with language fields zh, zh-CN, zh_CN and no language information entries.

If you need to process documents in more languages, please add them yourself in const formatLang = ["", "zh", "zh-CN", "zh_CN"];;
If you want to handle documents in any language, please add two slashes // before if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language";".

关闭提醒弹窗 Turn off alert pop-ups

如果希望关闭某一个弹窗提醒，你可以将代码中 return 后双引号及双引号内的内容删除。
例如：如果你希望关闭因语言跳过的提醒，你只需要将 return "[Action: Format Chinese Annotations] Skip due to language"; 改为 return; 即可。

If you wish to turn off a particular pop-up alert, you can remove the double quotes and the content inside the double quotes after the return in the code.
For example, if you want to turn off alerts that skip due to language, you just need to change return "[Action: Format Chinese Annotations] Skip due to language"; to return;.

致谢 Acknowledgements

本脚本主要参考了 #107 和 #220 ，并借助gpt-4o完成了主要的代码编写工作。再次感谢原脚本作者的帮助以及GPT的强力支持！
This script mainly references #107 and #220 , and the main coding is done with the help of gpt-4o. Thanks again to the original script authors for their help and GPT for their strong support!

0 replies

yzy1228682367 · 2024-02-26T06:27:49Z

yzy1228682367
Feb 26, 2024
Author

感谢大佬！太牛了！

0 replies

yslemmo · 2024-04-20T12:03:00Z

yslemmo
Apr 20, 2024

谢谢，太好了。终于能便捷地解决这个空格的问题了

3 replies

pengxy0 May 11, 2024

你好老哥，这个具体怎么操作呢，我这咋没有变化？能帮我看下嘛？这是我的一个设置和操作。

wakewon May 11, 2024

这个脚本是用来（自动）处理在PDF阅读中选中文字后创建高亮标记（新建注释）时，高亮标记记录下的文本里会有多余空格的问题的（详见我前面回答里的两张图），不会在选中文本后直接添加到条目笔记（通过已有的注释添加条目笔记）时被触发。你可以尝试修改代码看看能不能适用于你这样的工作流程。

从截图不太好看出这个漏掉的“空格”具体是什么，或许你可以上传下案例PDF方便测试？

pengxy0 May 11, 2024

我明白这个脚本的作用了。这个应该是每篇PDF做的条目笔记，都会存在这样的空格。我不知道怎么回事hh

QingDi0817 · 2024-09-13T06:31:08Z

QingDi0817
Sep 13, 2024

您好，想问一下这个编辑动作窗口该怎么打开呀

2 replies

wakewon Sep 13, 2024

这是zotero-actions-tags插件的自定义脚本，你需要先安装这个插件，然后按照插件主页的教程添加动作。

QingDi0817 Sep 14, 2024

好的，谢谢哥

Sinoftj · 2024-12-09T15:16:38Z

Sinoftj
Dec 9, 2024

不知道，无论“事件”选择哪一个，都不能自动执行，只能自己手动启动···

0 replies

rong-fei · 2025-02-19T05:06:24Z

a641324093 · 2025-07-16T09:01:31Z

a641324093
Jul 16, 2025

谢谢大佬！！

0 replies

Uh oh!

[Share] After creating annotations, automatically remove spaces and garbled characters from the content #269

Uh oh!

Uh oh!

Is there an existing issue for this?

Environment

Describe the feature request

Describe the solution you'd like

Anything else?

Replies: 7 comments · 11 replies

Uh oh!

Uh oh!

介绍 Introduction

用法 Usage

第一版：基于规则的处理（无需联网）

Version 1: Rule-based Processing (No networking required)

第二版：使用AI处理（需要联网，需要有效的OpenAI API）

Version 2: Processing with AI (Requires networking and a valid OpenAI API)

定制化用法 Customized Usage

跳过特定语言的文献 Skip the documentation of a specific language

关闭提醒弹窗 Turn off alert pop-ups

致谢 Acknowledgements

Uh oh!

yzy1228682367 Feb 26, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 7 comments 11 replies

yzy1228682367
Feb 26, 2024
Author