Skip to content

Commit f7b5ff6

Browse files
TaoKgitster
authored andcommitted
git-p4: improve encoding handling to support inconsistent encodings
git-p4 is designed to run correctly under python2.7 and python3, but its functional behavior wrt importing user-entered text differs across these environments: Under python2, git-p4 "naively" writes the Perforce bytestream into git metadata (and does not set an "encoding" header on the commits); this means that any non-utf-8 byte sequences end up creating invalidly-encoded commit metadata in git. Under python3, git-p4 attempts to decode the Perforce bytestream as utf-8 data, and fails badly (with an unhelpful error) when non-utf-8 data is encountered. Perforce clients (especially p4v) encourage user entry of changelist descriptions (and user full names) in OS-local encoding, and store the resulting bytestream to the server unmodified - such that different clients can end up creating mutually-unintelligible messages. The most common inconsistency, in many Perforce environments, is likely to be utf-8 (typical in linux) vs cp-1252 (typical in windows). Make the changelist-description- and user-fullname-handling code python-runtime-agnostic, introducing three "strategies" selectable via config: - 'passthrough', behaving as previously under python2, - 'strict', behaving as previously under python3, and - 'fallback', favoring utf-8 but supporting a secondary encoding when utf-8 decoding fails, and finally escaping high-range bytes if the decoding with the secondary encoding also fails. Keep the python2 default behavior as-is ('legacy' strategy), but switch the python3 default strategy to 'fallback' with default fallback encoding 'cp1252'. Also include tests exercising these encoding strategies, documentation for the new config, and improve the user-facing error messages when decoding does fail. Signed-off-by: Tao Klerks <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 6cd33dc commit f7b5ff6

File tree

5 files changed

+572
-18
lines changed

5 files changed

+572
-18
lines changed

Documentation/git-p4.txt

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -636,7 +636,42 @@ git-p4.pathEncoding::
636636
Git expects paths encoded as UTF-8. Use this config to tell git-p4
637637
what encoding Perforce had used for the paths. This encoding is used
638638
to transcode the paths to UTF-8. As an example, Perforce on Windows
639-
often uses "cp1252" to encode path names.
639+
often uses "cp1252" to encode path names. If this option is passed
640+
into a p4 clone request, it is persisted in the resulting new git
641+
repo.
642+
643+
git-p4.metadataDecodingStrategy::
644+
Perforce keeps the encoding of a changelist descriptions and user
645+
full names as stored by the client on a given OS. The p4v client
646+
uses the OS-local encoding, and so different users can end up storing
647+
different changelist descriptions or user full names in different
648+
encodings, in the same depot.
649+
Git tolerates inconsistent/incorrect encodings in commit messages
650+
and author names, but expects them to be specified in utf-8.
651+
git-p4 can use three different decoding strategies in handling the
652+
encoding uncertainty in Perforce: 'passthrough' simply passes the
653+
original bytes through from Perforce to git, creating usable but
654+
incorrectly-encoded data when the Perforce data is encoded as
655+
anything other than utf-8. 'strict' expects the Perforce data to be
656+
encoded as utf-8, and fails to import when this is not true.
657+
'fallback' attempts to interpret the data as utf-8, and otherwise
658+
falls back to using a secondary encoding - by default the common
659+
windows encoding 'cp-1252' - with upper-range bytes escaped if
660+
decoding with the fallback encoding also fails.
661+
Under python2 the default strategy is 'passthrough' for historical
662+
reasons, and under python3 the default is 'fallback'.
663+
When 'strict' is selected and decoding fails, the error message will
664+
propose changing this config parameter as a workaround. If this
665+
option is passed into a p4 clone request, it is persisted into the
666+
resulting new git repo.
667+
668+
git-p4.metadataFallbackEncoding::
669+
Specify the fallback encoding to use when decoding Perforce author
670+
names and changelists descriptions using the 'fallback' strategy
671+
(see git-p4.metadataDecodingStrategy). The fallback encoding will
672+
only be used when decoding as utf-8 fails. This option defaults to
673+
cp1252, a common windows encoding. If this option is passed into a
674+
p4 clone request, it is persisted into the resulting new git repo.
640675

641676
git-p4.largeFileSystem::
642677
Specify the system that is used for large (binary) files. Please note

git-p4.py

Lines changed: 107 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
# pylint: disable=too-many-statements,too-many-instance-attributes
1616
# pylint: disable=too-many-branches,too-many-nested-blocks
1717
#
18+
import struct
1819
import sys
1920
if sys.version_info.major < 3 and sys.version_info.minor < 7:
2021
sys.stderr.write("git-p4: requires Python 2.7 or later.\n")
@@ -54,6 +55,9 @@
5455
# The block size is reduced automatically if required
5556
defaultBlockSize = 1<<20
5657

58+
defaultMetadataDecodingStrategy = 'passthrough' if sys.version_info.major == 2 else 'fallback'
59+
defaultFallbackMetadataEncoding = 'cp1252'
60+
5761
p4_access_checked = False
5862

5963
re_ko_keywords = re.compile(br'\$(Id|Header)(:[^$\n]+)?\$')
@@ -203,6 +207,70 @@ def decode_text_stream(s):
203207
def encode_text_stream(s):
204208
return s.encode('utf_8') if isinstance(s, unicode) else s
205209

210+
211+
class MetadataDecodingException(Exception):
212+
def __init__(self, input_string):
213+
self.input_string = input_string
214+
215+
def __str__(self):
216+
return """Decoding perforce metadata failed!
217+
The failing string was:
218+
---
219+
{}
220+
---
221+
Consider setting the git-p4.metadataDecodingStrategy config option to
222+
'fallback', to allow metadata to be decoded using a fallback encoding,
223+
defaulting to cp1252.""".format(self.input_string)
224+
225+
226+
encoding_fallback_warning_issued = False
227+
encoding_escape_warning_issued = False
228+
def metadata_stream_to_writable_bytes(s):
229+
encodingStrategy = gitConfig('git-p4.metadataDecodingStrategy') or defaultMetadataDecodingStrategy
230+
fallbackEncoding = gitConfig('git-p4.metadataFallbackEncoding') or defaultFallbackMetadataEncoding
231+
if not isinstance(s, bytes):
232+
return s.encode('utf_8')
233+
if encodingStrategy == 'passthrough':
234+
return s
235+
try:
236+
s.decode('utf_8')
237+
return s
238+
except UnicodeDecodeError:
239+
if encodingStrategy == 'fallback' and fallbackEncoding:
240+
global encoding_fallback_warning_issued
241+
global encoding_escape_warning_issued
242+
try:
243+
if not encoding_fallback_warning_issued:
244+
print("\nCould not decode value as utf-8; using configured fallback encoding %s: %s" % (fallbackEncoding, s))
245+
print("\n(this warning is only displayed once during an import)")
246+
encoding_fallback_warning_issued = True
247+
return s.decode(fallbackEncoding).encode('utf_8')
248+
except Exception as exc:
249+
if not encoding_escape_warning_issued:
250+
print("\nCould not decode value with configured fallback encoding %s; escaping bytes over 127: %s" % (fallbackEncoding, s))
251+
print("\n(this warning is only displayed once during an import)")
252+
encoding_escape_warning_issued = True
253+
escaped_bytes = b''
254+
# bytes and strings work very differently in python2 vs python3...
255+
if str is bytes:
256+
for byte in s:
257+
byte_number = struct.unpack('>B', byte)[0]
258+
if byte_number > 127:
259+
escaped_bytes += b'%'
260+
escaped_bytes += hex(byte_number)[2:].upper()
261+
else:
262+
escaped_bytes += byte
263+
else:
264+
for byte_number in s:
265+
if byte_number > 127:
266+
escaped_bytes += b'%'
267+
escaped_bytes += hex(byte_number).upper().encode()[2:]
268+
else:
269+
escaped_bytes += bytes([byte_number])
270+
return escaped_bytes
271+
272+
raise MetadataDecodingException(s)
273+
206274
def decode_path(path):
207275
"""Decode a given string (bytes or otherwise) using configured path encoding options
208276
"""
@@ -702,11 +770,12 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
702770
if bytes is not str:
703771
# Decode unmarshalled dict to use str keys and values, except for:
704772
# - `data` which may contain arbitrary binary data
705-
# - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text
773+
# - `desc` or `FullName` which may contain non-UTF8 encoded text handled below, eagerly converted to bytes
774+
# - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text, handled by decode_path()
706775
decoded_entry = {}
707776
for key, value in entry.items():
708777
key = key.decode()
709-
if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
778+
if isinstance(value, bytes) and not (key in ('data', 'desc', 'FullName', 'path', 'clientFile') or key.startswith('depotFile')):
710779
value = value.decode()
711780
decoded_entry[key] = value
712781
# Parse out data if it's an error response
@@ -716,6 +785,10 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
716785
if skip_info:
717786
if 'code' in entry and entry['code'] == 'info':
718787
continue
788+
if 'desc' in entry:
789+
entry['desc'] = metadata_stream_to_writable_bytes(entry['desc'])
790+
if 'FullName' in entry:
791+
entry['FullName'] = metadata_stream_to_writable_bytes(entry['FullName'])
719792
if cb is not None:
720793
cb(entry)
721794
else:
@@ -1435,7 +1508,13 @@ def getUserMapFromPerforceServer(self):
14351508
for output in p4CmdList(["users"]):
14361509
if "User" not in output:
14371510
continue
1438-
self.users[output["User"]] = output["FullName"] + " <" + output["Email"] + ">"
1511+
# "FullName" is bytes. "Email" on the other hand might be bytes
1512+
# or unicode string depending on whether we are running under
1513+
# python2 or python3. To support
1514+
# git-p4.metadataDecodingStrategy=fallback, self.users dict values
1515+
# are always bytes, ready to be written to git.
1516+
emailbytes = metadata_stream_to_writable_bytes(output["Email"])
1517+
self.users[output["User"]] = output["FullName"] + b" <" + emailbytes + b">"
14391518
self.emails[output["Email"]] = output["User"]
14401519

14411520
mapUserConfigRegex = re.compile(r"^\s*(\S+)\s*=\s*(.+)\s*<(\S+)>\s*$", re.VERBOSE)
@@ -1445,26 +1524,28 @@ def getUserMapFromPerforceServer(self):
14451524
user = mapUser[0][0]
14461525
fullname = mapUser[0][1]
14471526
email = mapUser[0][2]
1448-
self.users[user] = fullname + " <" + email + ">"
1527+
fulluser = fullname + " <" + email + ">"
1528+
self.users[user] = metadata_stream_to_writable_bytes(fulluser)
14491529
self.emails[email] = user
14501530

1451-
s = ''
1531+
s = b''
14521532
for (key, val) in self.users.items():
1453-
s += "%s\t%s\n" % (key.expandtabs(1), val.expandtabs(1))
1533+
keybytes = metadata_stream_to_writable_bytes(key)
1534+
s += b"%s\t%s\n" % (keybytes.expandtabs(1), val.expandtabs(1))
14541535

1455-
open(self.getUserCacheFilename(), 'w').write(s)
1536+
open(self.getUserCacheFilename(), 'wb').write(s)
14561537
self.userMapFromPerforceServer = True
14571538

14581539
def loadUserMapFromCache(self):
14591540
self.users = {}
14601541
self.userMapFromPerforceServer = False
14611542
try:
1462-
cache = open(self.getUserCacheFilename(), 'r')
1543+
cache = open(self.getUserCacheFilename(), 'rb')
14631544
lines = cache.readlines()
14641545
cache.close()
14651546
for line in lines:
1466-
entry = line.strip().split("\t")
1467-
self.users[entry[0]] = entry[1]
1547+
entry = line.strip().split(b"\t")
1548+
self.users[entry[0].decode('utf_8')] = entry[1]
14681549
except IOError:
14691550
self.getUserMapFromPerforceServer()
14701551

@@ -3020,7 +3101,8 @@ def make_email(self, userid):
30203101
if userid in self.users:
30213102
return self.users[userid]
30223103
else:
3023-
return "%s <a@b>" % userid
3104+
userid_bytes = metadata_stream_to_writable_bytes(userid)
3105+
return b"%s <a@b>" % userid_bytes
30243106

30253107
def streamTag(self, gitStream, labelName, labelDetails, commit, epoch):
30263108
""" Stream a p4 tag.
@@ -3043,9 +3125,10 @@ def streamTag(self, gitStream, labelName, labelDetails, commit, epoch):
30433125
email = self.make_email(owner)
30443126
else:
30453127
email = self.make_email(self.p4UserId())
3046-
tagger = "%s %s %s" % (email, epoch, self.tz)
30473128

3048-
gitStream.write("tagger %s\n" % tagger)
3129+
gitStream.write("tagger ")
3130+
gitStream.write(email)
3131+
gitStream.write(" %s %s\n" % (epoch, self.tz))
30493132

30503133
print("labelDetails=",labelDetails)
30513134
if 'Description' in labelDetails:
@@ -3138,12 +3221,12 @@ def commit(self, details, files, branch, parent = "", allow_empty=False):
31383221
self.gitStream.write("commit %s\n" % branch)
31393222
self.gitStream.write("mark :%s\n" % details["change"])
31403223
self.committedChanges.add(int(details["change"]))
3141-
committer = ""
31423224
if author not in self.users:
31433225
self.getUserMapFromPerforceServer()
3144-
committer = "%s %s %s" % (self.make_email(author), epoch, self.tz)
31453226

3146-
self.gitStream.write("committer %s\n" % committer)
3227+
self.gitStream.write("committer ")
3228+
self.gitStream.write(self.make_email(author))
3229+
self.gitStream.write(" %s %s\n" % (epoch, self.tz))
31473230

31483231
self.gitStream.write("data <<EOT\n")
31493232
self.gitStream.write(details["desc"])
@@ -4055,6 +4138,14 @@ def run(self, args):
40554138
if self.useClientSpec_from_options:
40564139
system(["git", "config", "--bool", "git-p4.useclientspec", "true"])
40574140

4141+
# persist any git-p4 encoding-handling config options passed in for clone:
4142+
if gitConfig('git-p4.metadataDecodingStrategy'):
4143+
system(["git", "config", "git-p4.metadataDecodingStrategy", gitConfig('git-p4.metadataDecodingStrategy')])
4144+
if gitConfig('git-p4.metadataFallbackEncoding'):
4145+
system(["git", "config", "git-p4.metadataFallbackEncoding", gitConfig('git-p4.metadataFallbackEncoding')])
4146+
if gitConfig('git-p4.pathEncoding'):
4147+
system(["git", "config", "git-p4.pathEncoding", gitConfig('git-p4.pathEncoding')])
4148+
40584149
return True
40594150

40604151
class P4Unshelve(Command):

t/lib-git-p4.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,10 +142,11 @@ start_p4d () {
142142

143143
p4_add_user () {
144144
name=$1 &&
145+
fullname="${2:-Dr. $1}"
145146
p4 user -f -i <<-EOF
146147
User: $name
147148
Email: $name@example.com
148-
FullName: Dr. $name
149+
FullName: $fullname
149150
EOF
150151
}
151152

0 commit comments

Comments
 (0)