Skip to content

Commit 7192d60

Browse files
committed
Fix encoding issues with the mbox import
We force-encoded to ascii, because early emails either used that or base64. This didn't work properly with non english characters and also resulted in artifacts with incorrect utf8 characters. Instead we treat it as utf8 and ignore unknown characters. It is possible that we do have some emails that use another encoding internally - in this case we can fix that later and reimport again. The commit also updates the import script so that it can reimport the email bodies, and adds another script that only imports one single email, for testing import changes on specific emails showing artifacts.
1 parent 333f363 commit 7192d60

File tree

3 files changed

+72
-6
lines changed

3 files changed

+72
-6
lines changed

app/services/email_ingestor.rb

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,19 @@ def ingest_raw(raw_message, fallback_threading: false)
99
return nil unless message_id
1010
sent_at = sanitize_email_date(m.date, m[:date], message_id)
1111

12-
return nil if message_id && Message.find_by_message_id(message_id)
12+
body = normalize_body(extract_body(m))
13+
existing_message = Message.find_by_message_id(message_id)
14+
if existing_message
15+
existing_message.update_columns(body: body)
16+
return existing_message
17+
end
1318

1419
import_log = ''
1520

1621
from = build_from_aliases(m, sent_at)
1722
to = create_users(m[:to], sent_at)
1823
cc = create_users(m[:cc], sent_at)
1924

20-
body = extract_body(m)
21-
2225
subject = m.subject || 'No title'
2326

2427
reply_to_msg, import_log = resolve_threading(m, import_log)
@@ -30,8 +33,6 @@ def ingest_raw(raw_message, fallback_threading: false)
3033
topic = reply_to_msg ? reply_to_msg.topic : Topic.create!(creator: from[0], title: subject, created_at: sent_at)
3134
import_log = nil if import_log == ''
3235

33-
body = normalize_body(body)
34-
3536
msg = Message.create!(
3637
topic: topic,
3738
sender: from[0],

script/mbox_import.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ def parse_message(message)
116116
f.each_line do |line|
117117

118118
# Some old lines contain illegal characters
119-
line = line.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
119+
line = line.encode("utf-8", :invalid => :replace)
120120

121121
# all new messages refer to lists.postgresql.org, but not old emails
122122
# And we can't simply check for From, as it also matches inline attachments containing git diffs

script/mbox_single_import.rb

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
require_relative "../config/environment"
2+
require_relative "../app/services/email_ingestor"
3+
4+
if ARGV.length != 2
5+
puts "Usage: #{$PROGRAM_NAME} /path/to/mbox <message-id>"
6+
exit 1
7+
end
8+
9+
mbox_path = ARGV[0]
10+
target_id = MessageIdNormalizer.normalize(ARGV[1])
11+
12+
if target_id.nil? || target_id.empty?
13+
puts "ERROR: message-id is blank after normalization"
14+
exit 1
15+
end
16+
17+
def normalize_message_id(message)
18+
MessageIdNormalizer.normalize(Mail.new(message).message_id)
19+
rescue => e
20+
warn "WARN: failed to parse message id (#{e.class}: #{e.message})"
21+
''
22+
end
23+
24+
def process_message(message, target_id)
25+
return false if message.empty?
26+
27+
message_id = normalize_message_id(message)
28+
return false if message_id.empty?
29+
return false unless message_id == target_id
30+
31+
msg = EmailIngestor.new.ingest_raw(message, fallback_threading: true)
32+
if msg
33+
puts "Reimported #{msg.message_id}"
34+
else
35+
puts "Message #{target_id} not imported (invalid message id?)"
36+
end
37+
true
38+
end
39+
40+
found = false
41+
message = ""
42+
43+
puts "Scanning #{mbox_path} for #{target_id}..."
44+
45+
File.open(mbox_path, "r") do |f|
46+
f.each_line do |line|
47+
line = line.encode("utf-8", :invalid => :replace)
48+
49+
if line.match(/^From [^@]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+/i)
50+
if process_message(message, target_id)
51+
found = true
52+
break
53+
end
54+
message = ""
55+
else
56+
message << line
57+
end
58+
end
59+
end
60+
61+
if !found
62+
found = process_message(message, target_id)
63+
end
64+
65+
puts "Message not found in #{mbox_path}" unless found

0 commit comments

Comments
 (0)