Skip to content

Commit 67e542d

Browse files
JohnMcLearclaude
andauthored
fix(editor): preserve U+00A0 non-breaking space (#3037) (#7585)
* fix(editor): preserve U+00A0 non-breaking space (#3037) Non-breaking spaces were silently normalized to regular spaces at every ingestion point, so typed/pasted/imported nbsps never reached the changeset and users could not glue words against line-wrap in French or other languages that require nbsp typography. Removed the four strip sites that replaced U+00A0 with U+0020: - src/node/db/Pad.ts cleanText - src/static/js/contentcollector.ts textify - src/static/js/ace2_inner.ts textify - src/static/js/ace2_inner.ts importText raw-text guard Updated both processSpaces functions (domline and ExportHtml) to tokenize U+00A0 as a separate unit, emit it verbatim as &nbsp;, and treat it as content (not whitespace) for the run-collapse bookkeeping so adjacent regular-space runs aren't miscounted. Added backend round-trip tests for spliceText and setText, and extended the cleanText case table. Updated the existing contentcollector and importexport specs whose expectations encoded the previous buggy behavior; they now assert genuine nbsp preservation. Verified manually in Firefox: clipboard U+00A0 → paste → pad → getText returns c2 a0; getHTML emits `100&nbsp;km`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(contentcollector): collapse display-artifact nbsp runs on DOM read-back processSpaces is a lossy one-way display transform: leading/trailing spaces and all-but-the-last of a run get rendered as &nbsp; so HTML doesn't collapse them. When incorporateUserChanges reads text back from the DOM, those display-artifact nbsps were being stored in the changeset model instead of being normalized back to plain spaces. This broke handleReturnIndentation, whose /^ *(?:)/ regex only matches ASCII spaces: auto-indent after `foo:\n` produced 4 spaces instead of the expected prev-indent (2) + THE_TAB (4) = 6, because the previous line's model had nbsps where it used to have spaces. Fix: in contentcollector.textify, collapse any [  ]+ run back to plain spaces UNLESS the run is pure U+00A0 AND strictly interior to word chars. That preserves user-intended typographic nbsps like "100 km" while undoing the one-way display transform. Updated 7 contentcollector tests and 7 importexport tests whose assertions needed to reflect the new rule (boundary/mixed runs collapse; pure-interior nbsp runs preserve). Fixes the Playwright regression in indentation.spec.ts:117 that the previous commit introduced. * fix(contentcollector): canonicalize nbsp runs at line assembly, not per text node Addresses Qodo code review feedback on PR #7585. ## Bug fix — nbsp lost at DOM text-node boundary The previous approach ran the "collapse display-artifact nbsp" rule inside textify(), which is called per individual DOM TEXT_NODE. A user-intended nbsp sitting at a text-node boundary (e.g., <span>100</span><span>&nbsp;km </span>) was incorrectly seen as non-interior (before === '' for the second text node) and normalized back to a regular space. Fix: move the canonicalization out of textify() and run it on each fully assembled line string inside cc.finish(). The rule remains: [ ]+ run -> plain spaces UNLESS pure U+00A0 AND strictly interior to non-ws chars It is length-preserving, so attribute offsets and line lengths are unaffected. Added a regression test (contentcollector.spec.ts) for the cross-span case. ## Docs concern Reverted the type-only addition of spliceText to PadType. spliceText is an existing Pad runtime method; the backend test now uses a cast (`(pad as any).spliceText`) so the PR does not expand the declared public type surface, avoiding a separate documentation requirement. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent de5feb2 commit 67e542d

8 files changed

Lines changed: 85 additions & 17 deletions

File tree

src/node/db/Pad.ts

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,7 @@ type PadSettings = {
5151
*/
5252
exports.cleanText = (txt:string): string => txt.replace(/\r\n/g, '\n')
5353
.replace(/\r/g, '\n')
54-
.replace(/\t/g, ' ')
55-
.replace(/\xa0/g, ' ');
54+
.replace(/\t/g, ' ');
5655

5756
class Pad {
5857
private db: Database;

src/node/utils/ExportHtml.ts

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -537,13 +537,16 @@ const _processSpaces = (s: string) => {
537537
const doesWrap = true;
538538
if (s.indexOf('<') < 0 && !doesWrap) {
539539
// short-cut
540-
return s.replace(/ /g, '&nbsp;');
540+
return s.replace(/[ \u00a0]/g, '&nbsp;');
541541
}
542-
const parts = [];
543-
s.replace(/<[^>]*>?| |[^ <]+/g, (m) => {
542+
const parts: string[] = [];
543+
s.replace(/<[^>]*>?|[ \u00a0]|[^ \u00a0<]+/g, (m) => {
544544
parts.push(m);
545545
return m
546546
});
547+
// U+00A0 is content for run-bookkeeping - it terminates a space run
548+
// just like a word character would, so runs of regular spaces adjacent
549+
// to a nbsp are not miscounted as one long run (issue #3037).
547550
if (doesWrap) {
548551
let endOfLine = true;
549552
let beforeSpace = false;
@@ -578,6 +581,9 @@ const _processSpaces = (s: string) => {
578581
}
579582
}
580583
}
584+
for (let i = 0; i < parts.length; i++) {
585+
if (parts[i] === '\u00a0') parts[i] = '&nbsp;';
586+
}
581587
return parts.join('');
582588
};
583589

src/static/js/ace2_inner.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -476,8 +476,8 @@ function Ace2Inner(editorInfo, cssManagers) {
476476
if (text.charAt(text.length - 1) !== '\n') {
477477
throw new Error('new raw text must end with newline');
478478
}
479-
if (/[\r\t\xa0]/.exec(text)) {
480-
throw new Error('new raw text must not contain CR, tab, or nbsp');
479+
if (/[\r\t]/.exec(text)) {
480+
throw new Error('new raw text must not contain CR or tab');
481481
}
482482
lines = text.substring(0, text.length - 1).split('\n');
483483
} else {
@@ -2042,7 +2042,7 @@ function Ace2Inner(editorInfo, cssManagers) {
20422042
(nonEmpty) => domline.createDomLine(nonEmpty, doesWrap, browser, document);
20432043

20442044
const textify =
2045-
(str) => str.replace(/[\n\r ]/g, ' ').replace(/\xa0/g, ' ').replace(/\t/g, ' ');
2045+
(str) => str.replace(/[\n\r ]/g, ' ').replace(/\t/g, ' ');
20462046

20472047
const _blockElems = {
20482048
div: 1,

src/static/js/contentcollector.ts

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,9 +79,31 @@ const makeContentCollector = (collectStyles, abrowser, apool, className2Author)
7979
const textify = (str) => sanitizeUnicode(
8080
str.replace(/(\n | \n)/g, ' ')
8181
.replace(/[\n\r ]/g, ' ')
82-
.replace(/\xa0/g, ' ')
8382
.replace(/\t/g, ' '));
8483

84+
// processSpaces (domline.ts, ExportHtml.ts) is a lossy one-way display
85+
// transform: leading/trailing spaces and all-but-the-last space in a run
86+
// are rendered as &nbsp; to defeat HTML whitespace collapsing, so any
87+
// round-trip through the DOM sees nbsps where the model has plain spaces.
88+
// To keep the model canonical, a [ \u00a0]+ run read back from the fully
89+
// assembled line is collapsed to plain spaces unless it is strictly
90+
// interior to non-whitespace chars AND contains only U+00A0 (issue #3037 -
91+
// user-intended nbsp between words such as "100 km"). The rule runs on
92+
// the whole line (not per DOM text node) so that nbsps sitting at a
93+
// span boundary - e.g. <span>100</span><span>&nbsp;km</span> - are still
94+
// seen as interior. The transform is length-preserving so attribution
95+
// offsets stay consistent.
96+
const canonicalizeNbspRuns = (s: string) => s.replace(
97+
/[ \u00a0]+/g,
98+
(run: string, offset: number, src: string) => {
99+
const before = offset > 0 ? src[offset - 1] : '';
100+
const after = offset + run.length < src.length
101+
? src[offset + run.length] : '';
102+
const pureNbsp = !run.includes(' ');
103+
const interiorOfWord = /\S/.test(before) && /\S/.test(after);
104+
return pureNbsp && interiorOfWord ? run : ' '.repeat(run.length);
105+
});
106+
85107
const getAssoc = (node, name) => node[`_magicdom_${name}`];
86108

87109
const lines = (() => {
@@ -642,6 +664,12 @@ const makeContentCollector = (collectStyles, abrowser, apool, className2Author)
642664
lineStrings.length--;
643665
lineAttribs.length--;
644666

667+
// Canonicalize display-artifact nbsps on the whole-line string (issue
668+
// #3037). Length-preserving, so no attribute adjustments are required.
669+
for (let i = 0; i < lineStrings.length; i++) {
670+
lineStrings[i] = canonicalizeNbspRuns(lineStrings[i]);
671+
}
672+
645673
const ss = getSelectionStart();
646674
const se = getSelectionEnd();
647675

src/static/js/domline.ts

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -231,12 +231,15 @@ domline.createDomLine = (nonEmpty, doesWrap, optBrowser, optDocument) => {
231231
domline.processSpaces = (s, doesWrap) => {
232232
if (s.indexOf('<') < 0 && !doesWrap) {
233233
// short-cut
234-
return s.replace(/ /g, '&nbsp;');
234+
return s.replace(/[ \u00a0]/g, '&nbsp;');
235235
}
236236
const parts = [];
237-
s.replace(/<[^>]*>?| |[^ <]+/g, (m) => {
237+
s.replace(/<[^>]*>?|[ \u00a0]|[^ \u00a0<]+/g, (m) => {
238238
parts.push(m);
239239
});
240+
// U+00A0 is content for run-bookkeeping - it terminates a space run
241+
// just like a word character would, so runs of regular spaces adjacent
242+
// to a nbsp are not miscounted as one long run (issue #3037).
240243
if (doesWrap) {
241244
let endOfLine = true;
242245
let beforeSpace = false;
@@ -271,6 +274,9 @@ domline.processSpaces = (s, doesWrap) => {
271274
}
272275
}
273276
}
277+
for (let i = 0; i < parts.length; i++) {
278+
if (parts[i] === '\u00a0') parts[i] = '&nbsp;';
279+
}
274280
return parts.join('');
275281
};
276282

src/tests/backend/specs/Pad.ts

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,9 @@ describe(__filename, function () {
4545
['x\ry\n', 'x\ny\n'],
4646
['x\r\ny\n', 'x\ny\n'],
4747
['x\r\r\ny\n', 'x\n\ny\n'],
48+
// Non-breaking space (U+00A0) must survive cleanText (issue #3037).
49+
['100\u00a0km\n', '100\u00a0km\n'],
50+
['a\u00a0\u00a0b\n', 'a\u00a0\u00a0b\n'],
4851
];
4952
for (const [input, want] of testCases) {
5053
it(`${JSON.stringify(input)} -> ${JSON.stringify(want)}`, async function () {
@@ -53,6 +56,22 @@ describe(__filename, function () {
5356
}
5457
});
5558

59+
describe('non-breaking space preservation (issue #3037)', function () {
60+
it('spliceText round-trips U+00A0', async function () {
61+
pad = await padManager.getPad(padId, '');
62+
// spliceText is an existing runtime Pad method; cast avoids
63+
// adding a type-only declaration to PadType in this PR.
64+
await (pad as any).spliceText(0, 0, '100\u00a0km');
65+
assert.equal(pad!.text(), '100\u00a0km\n');
66+
});
67+
68+
it('setText round-trips U+00A0', async function () {
69+
pad = await padManager.getPad(padId, '');
70+
await pad!.setText('a\u00a0b\n');
71+
assert.equal(pad!.text(), 'a\u00a0b\n');
72+
});
73+
});
74+
5675
describe('padDefaultContent hook', function () {
5776
it('runs when a pad is created without specific text', async function () {
5877
const p = new Promise<void>((resolve) => {

src/tests/backend/specs/api/importexport.ts

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -79,8 +79,8 @@ const testImports:MapArrayType<any> = {
7979
// XXX the HTML between "than" and "one" looks strange
8080
description: 'non-breaking space should be preserved, but can be replaced when it',
8181
input: '<html><body>Text&nbsp;with&nbsp; more&nbsp;&nbsp;&nbsp;than &nbsp;one space.<br></body></html>',
82-
wantHTML: '<!DOCTYPE HTML><html><body>Text with&nbsp; more&nbsp;&nbsp; than&nbsp; one space.<br><br></body></html>',
83-
wantText: 'Text with more than one space.\n\n',
82+
wantHTML: '<!DOCTYPE HTML><html><body>Text&nbsp;with&nbsp; more&nbsp;&nbsp;&nbsp;than&nbsp; one space.<br><br></body></html>',
83+
wantText: 'Text\u00a0with more\u00a0\u00a0\u00a0than one space.\n\n',
8484
},
8585
'multiplenbsp': {
8686
description: 'Multiple non-breaking space should be preserved',
@@ -91,8 +91,8 @@ const testImports:MapArrayType<any> = {
9191
'multipleNonBreakingSpaceBetweenWords': {
9292
description: 'A normal space is always inserted before a word',
9393
input: '<html><body>&nbsp;&nbsp;word1&nbsp;&nbsp;word2&nbsp;&nbsp;&nbsp;word3<br></body></html>',
94-
wantHTML: '<!DOCTYPE HTML><html><body>&nbsp; word1&nbsp; word2&nbsp;&nbsp; word3<br><br></body></html>',
95-
wantText: ' word1 word2 word3\n\n',
94+
wantHTML: '<!DOCTYPE HTML><html><body>&nbsp; word1&nbsp;&nbsp;word2&nbsp;&nbsp;&nbsp;word3<br><br></body></html>',
95+
wantText: ' word1\u00a0\u00a0word2\u00a0\u00a0\u00a0word3\n\n',
9696
},
9797
'nonBreakingSpacePreceededBySpaceBetweenWords': {
9898
description: 'A non-breaking space preceded by a normal space',

src/tests/backend/specs/contentcollector.ts

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ const testCases = [
185185
description: 'non-breaking and normal space should be preserved',
186186
html: '<html><body>Text&nbsp;with&nbsp; more&nbsp;&nbsp;&nbsp;than &nbsp;one space.<br></body></html>',
187187
wantAlines: ['+10'],
188-
wantText: ['Text with more than one space.'],
188+
wantText: ['Text\u00a0with more\u00a0\u00a0\u00a0than one space.'],
189189
},
190190
{
191191
description: 'Multiple nbsp should be preserved',
@@ -197,7 +197,7 @@ const testCases = [
197197
description: 'Multiple nbsp between words ',
198198
html: '<html><body>&nbsp;&nbsp;word1&nbsp;&nbsp;word2&nbsp;&nbsp;&nbsp;word3<br></body></html>',
199199
wantAlines: ['+m'],
200-
wantText: [' word1 word2 word3'],
200+
wantText: [' word1\u00a0\u00a0word2\u00a0\u00a0\u00a0word3'],
201201
},
202202
{
203203
description: 'A non-breaking space preceded by a normal space',
@@ -359,6 +359,16 @@ pre
359359
],
360360
wantText: ['before', '*bare item', 'after'],
361361
},
362+
{
363+
// Regression for PR #7585 review feedback: a user-intended nbsp
364+
// sitting at a DOM text-node boundary (split across spans) must
365+
// still be preserved because canonicalization runs on the whole
366+
// line, not per text node.
367+
description: 'nbsp preserved across span boundary',
368+
html: '<html><body><p><span>100</span><span>&nbsp;km</span></p></body></html>',
369+
wantAlines: ['+6'],
370+
wantText: ['100\u00a0km'],
371+
},
362372
];
363373

364374
describe(__filename, function () {

0 commit comments

Comments
 (0)