Skip to content

Conversation

@MoonL79
Copy link
Contributor

@MoonL79 MoonL79 commented Jan 22, 2026

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances HTML format detection in TeXmacs by implementing multiple statistical and heuristic methods to identify HTML content. The changes add sophisticated detection capabilities beyond simple tag matching.

Changes:

  • Implemented density-based detection algorithms (angle brackets, HTML tags, and attributes)
  • Added line-by-line HTML feature detection with configurable thresholds
  • Added div tag balance checking and short text detection logic
  • Extended html-recognizes-at? function with comprehensive tag checks and statistical fallbacks

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 16 comments.

File Description
devel/222_34.md Documentation describing the enhanced HTML detection features and testing instructions
TeXmacs/tests/222_34.scm Comprehensive test suite with 15 HTML test cases and 16 non-HTML test cases covering various edge cases
TeXmacs/plugins/html/progs/data/html.scm Implementation of enhanced HTML detection algorithms including density calculations and the updated recognition function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +70 to +72
(/ (+ (charactor-from-string substr #\<)
(charactor-from-string substr #\>))
len))))
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic error in density calculation. The function sums the densities returned by charactor-from-string (which are already ratios count/len) and then divides by len again. This double division produces incorrect results. The correct approach is to sum the character counts and then divide once by the total length, or use the already-computed density values without further division.

Suggested change
(/ (+ (charactor-from-string substr #\<)
(charactor-from-string substr #\>))
len))))
(+ (charactor-from-string substr #\<)
(charactor-from-string substr #\>))))

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +128
(substr (substring s 0 limit)))
(/ (+ (charactor-from-string substr #\=)
(charactor-from-string substr #\"))
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic error in density calculation. The function sums the densities returned by charactor-from-string (which are already ratios count/len) and then divides by len again. This double division produces incorrect results. The correct approach is to sum the character counts and then divide once by the total length, or use the already-computed density values without further division.

Copilot uses AI. Check for mistakes.
(determine-short-html-string s)
#f))

(define (is-html-string? s)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation. This line has a leading space before the opening parenthesis, while all other function definitions in the file start at column 1. Remove the leading space for consistency.

Suggested change
(define (is-html-string? s)
(define (is-html-string? s)

Copilot uses AI. Check for mistakes.
(determine-short-html-string s)
#f))

(define (is-html-string? s)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation. This line has a leading space before the opening parenthesis, while all other function definitions in the file start at column 1. Remove the leading space for consistency.

Suggested change
(define (is-html-string? s)
(define (is-html-string? s)

Copilot uses AI. Check for mistakes.
Comment on lines +81 to +117
(let ((count (+ (html-string-count-substring lc-substr "<div")
(html-string-count-substring lc-substr "<span")
(html-string-count-substring lc-substr "<p")
(html-string-count-substring lc-substr "<a")
(html-string-count-substring lc-substr "<img")
(html-string-count-substring lc-substr "<ul")
(html-string-count-substring lc-substr "<ol")
(html-string-count-substring lc-substr "<li")
(html-string-count-substring lc-substr "<table")
(html-string-count-substring lc-substr "<tr")
(html-string-count-substring lc-substr "<td")
(html-string-count-substring lc-substr "<th")
(html-string-count-substring lc-substr "<h1")
(html-string-count-substring lc-substr "<h2")
(html-string-count-substring lc-substr "<h3")
(html-string-count-substring lc-substr "<h4")
(html-string-count-substring lc-substr "<h5")
(html-string-count-substring lc-substr "<h6")
(html-string-count-substring lc-substr "<form")
(html-string-count-substring lc-substr "<input")
(html-string-count-substring lc-substr "<button")
(html-string-count-substring lc-substr "<textarea")
(html-string-count-substring lc-substr "<select")
(html-string-count-substring lc-substr "<option")
(html-string-count-substring lc-substr "<style")
(html-string-count-substring lc-substr "<script")
(html-string-count-substring lc-substr "<meta")
(html-string-count-substring lc-substr "<link")
(html-string-count-substring lc-substr "</div")
(html-string-count-substring lc-substr "</ul")
(html-string-count-substring lc-substr "</ol")
(html-string-count-substring lc-substr "</table")
(html-string-count-substring lc-substr "</tr")
(html-string-count-substring lc-substr "</form")
(html-string-count-substring lc-substr "</style")
(html-string-count-substring lc-substr "</script"))))
(/ count len)))))
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance concern: This function performs multiple linear scans of the same string, calling html-string-count-substring 34 times. Each call scans the entire substring. For better performance, consider combining these checks into a single pass through the string, using a state machine or regex pattern matching to identify all tag types in one scan.

Suggested change
(let ((count (+ (html-string-count-substring lc-substr "<div")
(html-string-count-substring lc-substr "<span")
(html-string-count-substring lc-substr "<p")
(html-string-count-substring lc-substr "<a")
(html-string-count-substring lc-substr "<img")
(html-string-count-substring lc-substr "<ul")
(html-string-count-substring lc-substr "<ol")
(html-string-count-substring lc-substr "<li")
(html-string-count-substring lc-substr "<table")
(html-string-count-substring lc-substr "<tr")
(html-string-count-substring lc-substr "<td")
(html-string-count-substring lc-substr "<th")
(html-string-count-substring lc-substr "<h1")
(html-string-count-substring lc-substr "<h2")
(html-string-count-substring lc-substr "<h3")
(html-string-count-substring lc-substr "<h4")
(html-string-count-substring lc-substr "<h5")
(html-string-count-substring lc-substr "<h6")
(html-string-count-substring lc-substr "<form")
(html-string-count-substring lc-substr "<input")
(html-string-count-substring lc-substr "<button")
(html-string-count-substring lc-substr "<textarea")
(html-string-count-substring lc-substr "<select")
(html-string-count-substring lc-substr "<option")
(html-string-count-substring lc-substr "<style")
(html-string-count-substring lc-substr "<script")
(html-string-count-substring lc-substr "<meta")
(html-string-count-substring lc-substr "<link")
(html-string-count-substring lc-substr "</div")
(html-string-count-substring lc-substr "</ul")
(html-string-count-substring lc-substr "</ol")
(html-string-count-substring lc-substr "</table")
(html-string-count-substring lc-substr "</tr")
(html-string-count-substring lc-substr "</form")
(html-string-count-substring lc-substr "</style")
(html-string-count-substring lc-substr "</script"))))
(/ count len)))))
(letrec* ((string-prefix-at?
(lambda (s prefix idx)
(let* ((s-len (string-length s))
(p-len (string-length prefix)))
(if (> (+ idx p-len) s-len)
#f
(let loop ((j 0))
(if (= j p-len)
#t
(if (char=? (string-ref s (+ idx j))
(string-ref prefix j))
(loop (+ j 1))
#f)))))))
(tags '("<div"
"<span"
"<p"
"<a"
"<img"
"<ul"
"<ol"
"<li"
"<table"
"<tr"
"<td"
"<th"
"<h1"
"<h2"
"<h3"
"<h4"
"<h5"
"<h6"
"<form"
"<input"
"<button"
"<textarea"
"<select"
"<option"
"<style"
"<script"
"<meta"
"<link"
"</div"
"</ul"
"</ol"
"</table"
"</tr"
"</form"
"</style"
"</script")))
(substr-len (string-length lc-substr)))
(let loop ((i 0) (count 0))
(if (>= i substr-len)
(/ count len)
(let ((new-count
(let tag-loop ((ts tags) (c count))
(if (null? ts)
c
(if (string-prefix-at? lc-substr (car ts) i)
(tag-loop (cdr ts) (+ c 1))
(tag-loop (cdr ts) c))))))
(loop (+ i 1) new-count)))))))

Copilot uses AI. Check for mistakes.
(check (html-recognizes-at? non-html-text15 0) => #f)
(check (html-recognizes-at? non-html-text16 0) => #f))

(tm-define (test_222_33)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name is inconsistent with the test ID. The function is named "test_222_33" but the file and module are for "222_34". This should be "test_222_34" to match the correct test identifier.

Suggested change
(tm-define (test_222_33)
(tm-define (test_222_34)

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +71
(/ (+ (charactor-from-string substr #\<)
(charactor-from-string substr #\>))
len))))
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic error in density calculation. The function sums the densities returned by charactor-from-string (which are already ratios count/len) and then divides by len again. This double division produces incorrect results. The correct approach is to sum the character counts and then divide once by the total length, or use the already-computed density values without further division.

Suggested change
(/ (+ (charactor-from-string substr #\<)
(charactor-from-string substr #\>))
len))))
(+ (charactor-from-string substr #\<)
(charactor-from-string substr #\>)))))

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +128
(/ (+ (charactor-from-string substr #\=)
(charactor-from-string substr #\"))
len))))
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic error in density calculation. The function sums the densities returned by charactor-from-string (which are already ratios count/len) and then divides by len again. This double division produces incorrect results. The correct approach is to sum the character counts and then divide once by the total length, or use the already-computed density values without further division.

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +118
(define (html-tag-density s)
(if (string-null? s)
0
(let* ((len (string-length s))
(limit (if (>= len 1000) 1000 len))
(substr (substring s 0 limit))
(lc-substr (string-downcase substr)))
(let ((count (+ (html-string-count-substring lc-substr "<div")
(html-string-count-substring lc-substr "<span")
(html-string-count-substring lc-substr "<p")
(html-string-count-substring lc-substr "<a")
(html-string-count-substring lc-substr "<img")
(html-string-count-substring lc-substr "<ul")
(html-string-count-substring lc-substr "<ol")
(html-string-count-substring lc-substr "<li")
(html-string-count-substring lc-substr "<table")
(html-string-count-substring lc-substr "<tr")
(html-string-count-substring lc-substr "<td")
(html-string-count-substring lc-substr "<th")
(html-string-count-substring lc-substr "<h1")
(html-string-count-substring lc-substr "<h2")
(html-string-count-substring lc-substr "<h3")
(html-string-count-substring lc-substr "<h4")
(html-string-count-substring lc-substr "<h5")
(html-string-count-substring lc-substr "<h6")
(html-string-count-substring lc-substr "<form")
(html-string-count-substring lc-substr "<input")
(html-string-count-substring lc-substr "<button")
(html-string-count-substring lc-substr "<textarea")
(html-string-count-substring lc-substr "<select")
(html-string-count-substring lc-substr "<option")
(html-string-count-substring lc-substr "<style")
(html-string-count-substring lc-substr "<script")
(html-string-count-substring lc-substr "<meta")
(html-string-count-substring lc-substr "<link")
(html-string-count-substring lc-substr "</div")
(html-string-count-substring lc-substr "</ul")
(html-string-count-substring lc-substr "</ol")
(html-string-count-substring lc-substr "</table")
(html-string-count-substring lc-substr "</tr")
(html-string-count-substring lc-substr "</form")
(html-string-count-substring lc-substr "</style")
(html-string-count-substring lc-substr "</script"))))
(/ count len)))))
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance concern: This function performs multiple linear scans of the same string, calling html-string-count-substring 34 times. Each call scans the entire substring. For better performance, consider combining these checks into a single pass through the string, using a state machine or regex pattern matching to identify all tag types in one scan.

Suggested change
(define (html-tag-density s)
(if (string-null? s)
0
(let* ((len (string-length s))
(limit (if (>= len 1000) 1000 len))
(substr (substring s 0 limit))
(lc-substr (string-downcase substr)))
(let ((count (+ (html-string-count-substring lc-substr "<div")
(html-string-count-substring lc-substr "<span")
(html-string-count-substring lc-substr "<p")
(html-string-count-substring lc-substr "<a")
(html-string-count-substring lc-substr "<img")
(html-string-count-substring lc-substr "<ul")
(html-string-count-substring lc-substr "<ol")
(html-string-count-substring lc-substr "<li")
(html-string-count-substring lc-substr "<table")
(html-string-count-substring lc-substr "<tr")
(html-string-count-substring lc-substr "<td")
(html-string-count-substring lc-substr "<th")
(html-string-count-substring lc-substr "<h1")
(html-string-count-substring lc-substr "<h2")
(html-string-count-substring lc-substr "<h3")
(html-string-count-substring lc-substr "<h4")
(html-string-count-substring lc-substr "<h5")
(html-string-count-substring lc-substr "<h6")
(html-string-count-substring lc-substr "<form")
(html-string-count-substring lc-substr "<input")
(html-string-count-substring lc-substr "<button")
(html-string-count-substring lc-substr "<textarea")
(html-string-count-substring lc-substr "<select")
(html-string-count-substring lc-substr "<option")
(html-string-count-substring lc-substr "<style")
(html-string-count-substring lc-substr "<script")
(html-string-count-substring lc-substr "<meta")
(html-string-count-substring lc-substr "<link")
(html-string-count-substring lc-substr "</div")
(html-string-count-substring lc-substr "</ul")
(html-string-count-substring lc-substr "</ol")
(html-string-count-substring lc-substr "</table")
(html-string-count-substring lc-substr "</tr")
(html-string-count-substring lc-substr "</form")
(html-string-count-substring lc-substr "</style")
(html-string-count-substring lc-substr "</script"))))
(/ count len)))))
;; Helper: check whether string `s` has prefix `prefix` starting at index `start`.
(define (html-string-prefix-at? s prefix start)
(let* ((slen (string-length s))
(plen (string-length prefix)))
(and (<= (+ start plen) slen)
(let loop ((i 0))
(if (= i plen)
#t
(and (char=? (string-ref s (+ start i))
(string-ref prefix i))
(loop (+ i 1))))))))
;; Helper: count all occurrences of any tag prefix in a single pass over `s`.
(define (html-count-tag-prefixes s prefixes)
(let ((len (string-length s)))
(let loop ((i 0) (count 0))
(if (>= i len)
count
(let ((new-count
(if (char=? (string-ref s i) #\<)
(+ count
(let scan-prefixes ((ps prefixes) (c 0))
(if (null? ps)
c
(let ((p (car ps)))
(if (html-string-prefix-at? s p i)
(scan-prefixes (cdr ps) (+ c 1))
(scan-prefixes (cdr ps) c))))))
count)))
(loop (+ i 1) new-count)))))
(define (html-tag-density s)
(if (string-null? s)
0
(let* ((len (string-length s))
(limit (if (>= len 1000) 1000 len))
(substr (substring s 0 limit))
(lc-substr (string-downcase substr))
(tag-prefixes '("<div"
"<span"
"<p"
"<a"
"<img"
"<ul"
"<ol"
"<li"
"<table"
"<tr"
"<td"
"<th"
"<h1"
"<h2"
"<h3"
"<h4"
"<h5"
"<h6"
"<form"
"<input"
"<button"
"<textarea"
"<select"
"<option"
"<style"
"<script"
"<meta"
"<link"
"</div"
"</ul"
"</ol"
"</table"
"</tr"
"</form"
"</style"
"</script")))
(let ((count (html-count-tag-prefixes lc-substr tag-prefixes)))
(/ count len)))))

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +159
;; 这一行文本是否包含html标签
(define (html-line-contains-features? line)
(let ((lc-line (string-downcase line)))
(or
(> (html-string-count-substring lc-line "<div") 0)
(> (html-string-count-substring lc-line "<span") 0)
(> (html-string-count-substring lc-line "<p") 0)
(> (html-string-count-substring lc-line "<a") 0)
(> (html-string-count-substring lc-line "<img") 0)
(> (html-string-count-substring lc-line "<ul") 0)
(> (html-string-count-substring lc-line "<ol") 0)
(> (html-string-count-substring lc-line "<li") 0)
(> (html-string-count-substring lc-line "<table") 0)
(> (html-string-count-substring lc-line "<tr") 0)
(> (html-string-count-substring lc-line "<td") 0)
(> (html-string-count-substring lc-line "<th") 0)
(> (html-string-count-substring lc-line "<h1") 0)
(> (html-string-count-substring lc-line "<h2") 0)
(> (html-string-count-substring lc-line "<h3") 0)
(> (html-string-count-substring lc-line "<h4") 0)
(> (html-string-count-substring lc-line "<h5") 0)
(> (html-string-count-substring lc-line "<h6") 0)
(> (html-string-count-substring lc-line "</div") 0)
(> (html-string-count-substring lc-line "</span") 0)
(> (html-string-count-substring lc-line "</p") 0)
(> (html-string-count-substring lc-line "</a") 0)
(> (html-string-count-substring lc-line "/>") 0)
(> (html-string-count-substring lc-line "<!doctype") 0)
(> (html-string-count-substring lc-line "<?xml") 0))))
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance concern: This function performs multiple linear scans of the same string, calling html-string-count-substring 25 times. Each call scans the entire line. For better performance, consider combining these checks into a single pass through the line, or using a more efficient pattern matching approach.

Suggested change
;; 这一行文本是否包含html标签
(define (html-line-contains-features? line)
(let ((lc-line (string-downcase line)))
(or
(> (html-string-count-substring lc-line "<div") 0)
(> (html-string-count-substring lc-line "<span") 0)
(> (html-string-count-substring lc-line "<p") 0)
(> (html-string-count-substring lc-line "<a") 0)
(> (html-string-count-substring lc-line "<img") 0)
(> (html-string-count-substring lc-line "<ul") 0)
(> (html-string-count-substring lc-line "<ol") 0)
(> (html-string-count-substring lc-line "<li") 0)
(> (html-string-count-substring lc-line "<table") 0)
(> (html-string-count-substring lc-line "<tr") 0)
(> (html-string-count-substring lc-line "<td") 0)
(> (html-string-count-substring lc-line "<th") 0)
(> (html-string-count-substring lc-line "<h1") 0)
(> (html-string-count-substring lc-line "<h2") 0)
(> (html-string-count-substring lc-line "<h3") 0)
(> (html-string-count-substring lc-line "<h4") 0)
(> (html-string-count-substring lc-line "<h5") 0)
(> (html-string-count-substring lc-line "<h6") 0)
(> (html-string-count-substring lc-line "</div") 0)
(> (html-string-count-substring lc-line "</span") 0)
(> (html-string-count-substring lc-line "</p") 0)
(> (html-string-count-substring lc-line "</a") 0)
(> (html-string-count-substring lc-line "/>") 0)
(> (html-string-count-substring lc-line "<!doctype") 0)
(> (html-string-count-substring lc-line "<?xml") 0))))
;; HTML 特征前缀列表(全部小写,以配合 string-downcase)
(define html-feature-prefixes
'("<div" "<span" "<p" "<a" "<img" "<ul" "<ol" "<li"
"<table" "<tr" "<td" "<th"
"<h1" "<h2" "<h3" "<h4" "<h5" "<h6"
"</div" "</span" "</p" "</a"
"/>" "<!doctype" "<?xml"))
;; 判断字符串 s 在位置 idx 是否以 prefix 开始
(define (string-starts-with-at? s idx prefix)
(let* ((s-len (string-length s))
(p-len (string-length prefix)))
(if (> (+ idx p-len) s-len)
#f
(let loop ((j 0))
(cond
((= j p-len) #t)
((char=? (string-ref s (+ idx j))
(string-ref prefix j))
(loop (+ j 1)))
(else #f))))))
;; 当前字符串是否包含任意一个给定前缀
(define (string-contains-any-prefix? s prefixes)
(let ((s-len (string-length s)))
(let loop ((i 0))
(if (>= i s-len)
#f
(let check-prefixes ((ps prefixes))
(cond
((null? ps) (loop (+ i 1)))
((string-starts-with-at? s i (car ps)) #t)
(else (check-prefixes (cdr ps)))))))))
;; 这一行文本是否包含html标签
(define (html-line-contains-features? line)
(let ((lc-line (string-downcase line)))
(string-contains-any-prefix? lc-line html-feature-prefixes)))

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants