Skip to content

Commit eb97a81

Browse files
authored
Merge pull request #33 from jecisc/speed-up-start-end
Speed up TSFASTImporter (from 7sec to 115ms for a python file of 900 LoC)
2 parents dcd6573 + 4684e03 commit eb97a81

File tree

2 files changed

+80
-6
lines changed

2 files changed

+80
-6
lines changed

src/TreeSitter-FAST-Utils/TSFASTImporter.class.st

Lines changed: 38 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,29 @@
11
"
2+
## Description
3+
24
I am a generic importer for a FAST model.
35
46
I will create all the nodes and relations of the FAST model taking a root node as parameter.
57
68
I will do an exact match to the Tree Sitter AST but I have a subclass that can allow to tweak the model to generate.
79
8-
Implementation details:
10+
# Implementation details
11+
12+
## Context
13+
914
- The context contains the stack of all elements ""parent"" to the node that is currently been visited.
1015
- The #currentFMProperty can either be nil or a FMProperty. If it is a property, it means that the nodes been visited are part of a field of their parent that has the same name as a contained entities property of the fast entity. Thus we save it to save the children in this property instead of the generic one.
1116
- #containedEntitiesPropertiesMap will save for each kind of FAST class the possible children properties for perf reasons.
17+
18+
### Source positions management
19+
20+
TreeSitter is providing the positions of the nodes in the parsed string in number of bytes but the current implementation of FAST requires the positions in number of characters.
21+
In the origin implementation we were computing for each nodes the number of characters from the start and end positions in number of bytes.
22+
23+
Now we are taking a different direction. We know that we provide the source code to tree sitter encoded un UTF8.
24+
With this information we build a map cached in #bytesToCharactersMap that will associate to the index of each leading bytes, the index of the corresponding character.
25+
26+
This allows to build once the index map and to just use it to convert bytes positions into characters positions which is speeding up a lot the import.
1227
"
1328
Class {
1429
#name : 'TSFASTImporter',
@@ -19,12 +34,28 @@ Class {
1934
'originString',
2035
'containedEntitiesPropertiesMap',
2136
'context',
22-
'currentFMProperty'
37+
'currentFMProperty',
38+
'bytesToCharactersMap'
2339
],
2440
#category : 'TreeSitter-FAST-Utils',
2541
#package : 'TreeSitter-FAST-Utils'
2642
}
2743

44+
{ #category : 'private' }
45+
TSFASTImporter >> bytesToCharacterMap [
46+
"We consider that the string is UTF8 encoded in the FAST importer. If we parse a file in UTF16 or another encoding, we should decode it and encode it in UTF8.
47+
48+
In Famix we cannot do that since the source code is in files. But in FAST we keep the source code in a Pharo string allowing to do this."
49+
50+
^ bytesToCharactersMap ifNil: [ bytesToCharactersMap := ZnUTF8Encoder default mapBytesToCharactersFor: self originString ]
51+
]
52+
53+
{ #category : 'private' }
54+
TSFASTImporter >> characterPositionAtByte: aNumber [
55+
56+
^ self bytesToCharacterMap at: aNumber ifAbsent: [ SubscriptOutOfBounds signalFor: aNumber ]
57+
]
58+
2859
{ #category : 'accessing' }
2960
TSFASTImporter >> classesPrefix [
3061

@@ -62,8 +93,9 @@ TSFASTImporter >> instantiateFastEntityFrom: aTSNode [
6293
fastEntity := self newInstanceOfClassNamed: self classesPrefix , aTSNode type pascalized.
6394

6495
model add: fastEntity.
65-
fastEntity startPos: (aTSNode startPositionFromSourceText: self originString).
66-
fastEntity endPos: (aTSNode endPositionFromSourceText: self originString).
96+
97+
fastEntity startPos: (self characterPositionAtByte: aTSNode startByte) + 1.
98+
fastEntity endPos: (self characterPositionAtByte: aTSNode endByte).
6799

68100
^ fastEntity
69101
]
@@ -87,9 +119,9 @@ TSFASTImporter >> originString [
87119
]
88120

89121
{ #category : 'accessing' }
90-
TSFASTImporter >> originString: anObject [
122+
TSFASTImporter >> originString: aString [
91123

92-
originString := anObject
124+
originString := aString
93125
]
94126

95127
{ #category : 'accessing' }
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
Extension { #name : 'ZnUTF8Encoder' }
2+
3+
{ #category : '*TreeSitter-FAST-Utils' }
4+
ZnUTF8Encoder >> mapBytesToCharactersFor: aString [
5+
"I take as parameter a ByteArray and for each character I will fill a dictionary associating the index of the byte with the index of the character corresponding."
6+
7+
| byteStream byteCount characterCount result |
8+
result := IdentityDictionary new: aString size.
9+
byteStream := (self encodeString: aString) readStream.
10+
byteCount := 0.
11+
characterCount := 0.
12+
result at: byteCount put: characterCount.
13+
14+
[ byteStream atEnd ] whileFalse: [
15+
| firstByte byteLenght |
16+
firstByte := byteStream next.
17+
18+
"In UTF8, if a byte lead by::
19+
- 0xxxxxxx, it means it is an ascii character on 1 byte.
20+
- 110xxxxx it means we have a 2 bytes character.
21+
- 1110xxxx it means we have a 3 bytes character.
22+
- 11110xxx it means we have a 4 bytes character."
23+
byteLenght := (firstByte bitAnd: 2r10000000) = 0
24+
ifTrue: [ 1 ]
25+
ifFalse: [
26+
(firstByte bitAnd: 2r11100000) = 2r11000000
27+
ifTrue: [ 2 ]
28+
ifFalse: [
29+
(firstByte bitAnd: 2r11110000) = 2r11100000
30+
ifTrue: [ 3 ]
31+
ifFalse: [
32+
(firstByte bitAnd: 2r11111000) = 2r11110000
33+
ifTrue: [ 4 ]
34+
ifFalse: [ self errorIllegalLeadingByte ] ] ] ].
35+
36+
byteStream skip: byteLenght - 1.
37+
byteCount := byteCount + byteLenght.
38+
characterCount := characterCount + 1.
39+
result at: byteCount put: characterCount ].
40+
41+
^ result
42+
]

0 commit comments

Comments
 (0)