Skip to content

Commit b590b8c

Browse files
put rendered code into vignettes
1 parent 299ab26 commit b590b8c

File tree

3 files changed

+440
-261
lines changed

3 files changed

+440
-261
lines changed

vignettes/data_structures.Rmd

Lines changed: 192 additions & 140 deletions
Original file line numberDiff line numberDiff line change
@@ -9,145 +9,197 @@ vignette: >
99
%\VignetteEncoding{UTF-8}
1010
---
1111

12-
This vignette illustrates how the core of `styler` currently^[at commit `e6ddee0f510d3c9e3e22ef68586068fa5c6bc140`] works, i.e. how
13-
rules are applied to a parse table and how limitations of this approach can be
14-
overcome with a refined approach.
15-
16-
## Status quo - the flat approach
17-
18-
Roughly speaking, a string containing code to be formatted is parsed with `parse`
19-
and the output is passed to `getParseData` in order to obtain a parse
20-
table with detailed information about every token. For a simple example string
21-
"`a <- function(x) { if(x > 1) { 1+1 } else {x} }`" to be formatted, the parse
22-
table on which `styler` performs the manipulations looks similar to the one
23-
presented below.
24-
25-
```{r, message = FALSE}
26-
library("styler")
27-
library("dplyr")
28-
29-
code <- "a <- function(x) { if(x > 1) { 1+1 } else {x} }"
30-
31-
(parse_table <- styler:::compute_parse_data_flat_enhanced(code))
32-
```
33-
The column `spaces` was computed from the columns `col1` and `col2`, `newlines`
34-
was computed from `line1` and `line2` respectively.
35-
36-
So far, styler can set the spaces around the operators correctly. In our example,
37-
that involves adding spaces around `+`, so in the `spaces` column, element nine
38-
and ten must be set to one. This means that a space is added after `1` and after `+`.
39-
To get the spacing right and cover the various cases, a set of functions has to
40-
be applied to the parse table subsequently (and in the right order),
41-
which is essentially done via `Reduce()`.
42-
After all modifications on the table are completed, `serialize_parse_data()`
43-
collapses the `text` column and adds the number of spaces and
44-
line breaks specified in `spaces` and `newlines` in between the elements of
45-
`text`. If we serialize our table and don't perform any modification, we
46-
obviously just get back what we started with.
47-
```{r}
48-
styler:::serialize_parse_data_flat(parse_table)
49-
```
50-
51-
## Refining the flat approach - nesting the parse table
52-
53-
Although the flat approach is good place to start, e.g. for fixing spaces
54-
between operators, it has its limitations. In particular, it treats each token
55-
the same way in the sense that it does not account for the context of the token,
56-
i.e. in which sub-expression it appears.
57-
To set the indention correctly, we need a hierarchical view on the parse data,
58-
since all tokens in a sub-expression have the same indention level. Hence,
59-
a natural approach would be to create a nested parse table instead of a flat
60-
parse table and then take a recursion over all elements in the table, so for
61-
each sub(-sub etc.)-expression, a separate parse table would be created and the
62-
modifications would be applied to this table before putting everything back
63-
together. A function to create a nested parse table already exists in `styler`.
64-
Let's have a look at the top level:
65-
66-
```{r}
67-
(l1 <- styler:::compute_parse_data_nested(code)[-1])
68-
69-
```
70-
71-
The tibble contains the column `child`, which itself contains a tibble.
72-
If we "enter" the first child, we can see that the expression was split up
73-
further.
74-
75-
```{r}
76-
l1$child[[1]] %>%
77-
select(text, terminal, child, token)
78-
```
12+
This vignette illustrates how the core of `styler` currently[1] works,
13+
i.e. how rules are applied to a parse table and how limitations of this
14+
approach can be overcome with a refined approach.
15+
16+
Status quo - the flat approach
17+
------------------------------
18+
19+
Roughly speaking, a string containing code to be formatted is parsed
20+
with `parse` and the output is passed to `getParseData` in order to
21+
obtain a parse table with detailed information about every token. For a
22+
simple example string
23+
"`a <- function(x) { if(x > 1) { 1+1 } else {x} }`" to be formatted, the
24+
parse table on which `styler` performs the manipulations looks similar
25+
to the one presented below.
26+
27+
library("styler")
28+
library("dplyr")
29+
30+
code <- "a <- function(x) { if(x > 1) { 1+1 } else {x} }"
31+
32+
(parse_table <- styler:::compute_parse_data_flat_enhanced(code))
33+
34+
## # A tibble: 24 x 14
35+
## line1 col1 line2 col2 token text terminal short newlines
36+
## <int> <int> <int> <int> <chr> <chr> <lgl> <chr> <int>
37+
## 1 1 0 1 0 START NA <NA> 0
38+
## 2 1 1 1 1 SYMBOL a TRUE a 0
39+
## 3 1 3 1 4 LEFT_ASSIGN <- TRUE <- 0
40+
## 4 1 6 1 13 FUNCTION function TRUE funct 0
41+
## 5 1 14 1 14 '(' ( TRUE ( 0
42+
## 6 1 15 1 15 SYMBOL_FORMALS x TRUE x 0
43+
## 7 1 16 1 16 ')' ) TRUE ) 0
44+
## 8 1 18 1 18 '{' { TRUE { 0
45+
## 9 1 20 1 21 IF if TRUE if 0
46+
## 10 1 22 1 22 '(' ( TRUE ( 0
47+
## # ... with 14 more rows, and 5 more variables: lag_newlines <int>,
48+
## # spaces <int>, multi_line <lgl>, indention_ref_id <lgl>, indent <dbl>
49+
50+
The column `spaces` was computed from the columns `col1` and `col2`,
51+
`newlines` was computed from `line1` and `line2` respectively.
52+
53+
So far, styler can set the spaces around the operators correctly. In our
54+
example, that involves adding spaces around `+`, so in the `spaces`
55+
column, element nine and ten must be set to one. This means that a space
56+
is added after `1` and after `+`. To get the spacing right and cover the
57+
various cases, a set of functions has to be applied to the parse table
58+
subsequently (and in the right order), which is essentially done via
59+
`Reduce()`. After all modifications on the table are completed,
60+
`serialize_parse_data()` collapses the `text` column and adds the number
61+
of spaces and line breaks specified in `spaces` and `newlines` in
62+
between the elements of `text`. If we serialize our table and don't
63+
perform any modification, we obviously just get back what we started
64+
with.
65+
66+
styler:::serialize_parse_data_flat(parse_table)
67+
68+
## [1] "a <- function(x) { if(x > 1) { 1+1 } else {x} }"
69+
70+
Refining the flat approach - nesting the parse table
71+
----------------------------------------------------
72+
73+
Although the flat approach is good place to start, e.g. for fixing
74+
spaces between operators, it has its limitations. In particular, it
75+
treats each token the same way in the sense that it does not account for
76+
the context of the token, i.e. in which sub-expression it appears. To
77+
set the indention correctly, we need a hierarchical view on the parse
78+
data, since all tokens in a sub-expression have the same indention
79+
level. Hence, a natural approach would be to create a nested parse table
80+
instead of a flat parse table and then take a recursion over all
81+
elements in the table, so for each sub(-sub etc.)-expression, a separate
82+
parse table would be created and the modifications would be applied to
83+
this table before putting everything back together. A function to create
84+
a nested parse table already exists in `styler`. Let's have a look at
85+
the top level:
86+
87+
(l1 <- styler:::compute_parse_data_nested(code)[-1])
88+
89+
## # A tibble: 1 x 13
90+
## col1 line2 col2 id parent token terminal text short token_before
91+
## <int> <int> <int> <int> <int> <chr> <lgl> <chr> <chr> <chr>
92+
## 1 1 1 47 49 0 expr FALSE <NA>
93+
## # ... with 3 more variables: token_after <chr>, internal <lgl>,
94+
## # child <list>
95+
96+
The tibble contains the column `child`, which itself contains a tibble.
97+
If we "enter" the first child, we can see that the expression was split
98+
up further.
99+
100+
l1$child[[1]] %>%
101+
select(text, terminal, child, token)
102+
103+
## # A tibble: 3 x 4
104+
## text terminal child token
105+
## <chr> <lgl> <list> <chr>
106+
## 1 FALSE <tibble [1 x 14]> expr
107+
## 2 <- TRUE <NULL> LEFT_ASSIGN
108+
## 3 FALSE <tibble [5 x 14]> expr
79109

80110
And further...
81-
```{r}
82-
l1$child[[1]]$child[[3]]$child[[5]]
83-
```
84-
85-
... and so on. Every child that is not a terminal contains another tibble where
86-
the sub-expression is split up further - until we are left with tibbles that
87-
only contain terminals.
88-
89-
90-
Recall the above example. `a <- function(x) { if(x > 1) { 1+1 } else {x} }`.
91-
In the last printed parse table, we can see that see that the whole if condition
92-
is a sub-expression of `code`, surrounded by two curly brackets. Hence,
93-
one would like to set the indention level for this sub-expression before
94-
doing anything with it in more detail. Later, when we progressed deeper into
95-
the nested table, we hit a similar pattern:
96-
97-
```{r}
98-
l1$child[[1]]$child[[3]]$child[[5]]$child[[2]]$child[[5]]
99-
```
100-
Again, we have two curly brackets and an expression inside. We would like to
101-
set the indention level for the expression `1+1` in the same way as for the
102-
whole if condition.
103-
104-
The simple example above makes it evident that a recursive approach to this
105-
problem would be the most natural.
106-
107-
The code for a function that kind of sketches the idea and illustrates such a
108-
recursion is given below.
109-
110-
It takes a nested parse table as input and then does the recursion over all
111-
children. If the child is a terminal, it returns the text, otherwise,
112-
it "enters" the child to find the terminals inside of the child and returns them.
113-
114-
```{r}
115-
serialize <- function(x) {
116-
out <- Map(
117-
function(terminal, text, child) {
118-
if (terminal)
119-
text
120-
else
121-
serialize(child)
122-
},
123-
x$terminal, x$text, x$child
124-
)
125-
out
126-
}
127-
128-
x <- styler:::compute_parse_data_nested(code)
129-
serialize(x) %>% unlist
130-
```
131-
132-
How to exactly implement a similar recursion to not just return each text
133-
token separately, but
134-
the styled text as one string (or one string per line) is subject to future work,
135-
so would be the functions to be
136-
applied to a sub-expression parse table that create correct indention.
137-
Similar to `compute_parse_data_flat_enhanced`, the column `spaces` and `newlines`
138-
would be required to be computed by `compute_parse_data_nested` as well as a
139-
new column `indention`.
140-
141-
142-
## Final Remarks
143-
144-
Although a flat structure would possibly also allow us to solve the problem of
145-
indention, it is a less elegant and flexible solution to the problem. It would
146-
involve looking for an opening curly bracket in the parse table, set the
147-
indention level for all subsequent rows in the parse table until the next
148-
opening or closing curly bracket is hit and then intending one level further or
149-
setting indention back to where it was at the beginning of the table.
150-
151-
Note that the vignette just addressed the question of indention caused by
152-
curly brackets and has not dealt with other operators that would trigger
153-
indention, such as `(` or `+`.
111+
112+
l1$child[[1]]$child[[3]]$child[[5]]
113+
114+
## # A tibble: 3 x 14
115+
## line1 col1 line2 col2 id parent token terminal text short
116+
## <int> <int> <int> <int> <int> <int> <chr> <lgl> <chr> <chr>
117+
## 1 1 18 1 18 9 45 '{' TRUE { {
118+
## 2 1 20 1 45 42 45 expr FALSE
119+
## 3 1 47 1 47 40 45 '}' TRUE } }
120+
## # ... with 4 more variables: token_before <chr>, token_after <chr>,
121+
## # internal <lgl>, child <list>
122+
123+
... and so on. Every child that is not a terminal contains another
124+
tibble where the sub-expression is split up further - until we are left
125+
with tibbles that only contain terminals.
126+
127+
Recall the above example.
128+
`a <- function(x) { if(x > 1) { 1+1 } else {x} }`. In the last printed
129+
parse table, we can see that see that the whole if condition is a
130+
sub-expression of `code`, surrounded by two curly brackets. Hence, one
131+
would like to set the indention level for this sub-expression before
132+
doing anything with it in more detail. Later, when we progressed deeper
133+
into the nested table, we hit a similar pattern:
134+
135+
l1$child[[1]]$child[[3]]$child[[5]]$child[[2]]$child[[5]]
136+
137+
## # A tibble: 3 x 14
138+
## line1 col1 line2 col2 id parent token terminal text short
139+
## <int> <int> <int> <int> <int> <int> <chr> <lgl> <chr> <chr>
140+
## 1 1 30 1 30 20 30 '{' TRUE { {
141+
## 2 1 32 1 34 27 30 expr FALSE
142+
## 3 1 36 1 36 26 30 '}' TRUE } }
143+
## # ... with 4 more variables: token_before <chr>, token_after <chr>,
144+
## # internal <lgl>, child <list>
145+
146+
Again, we have two curly brackets and an expression inside. We would
147+
like to set the indention level for the expression `1+1` in the same way
148+
as for the whole if condition.
149+
150+
The simple example above makes it evident that a recursive approach to
151+
this problem would be the most natural.
152+
153+
The code for a function that kind of sketches the idea and illustrates
154+
such a recursion is given below.
155+
156+
It takes a nested parse table as input and then does the recursion over
157+
all children. If the child is a terminal, it returns the text,
158+
otherwise, it "enters" the child to find the terminals inside of the
159+
child and returns them.
160+
161+
serialize <- function(x) {
162+
out <- Map(
163+
function(terminal, text, child) {
164+
if (terminal)
165+
text
166+
else
167+
serialize(child)
168+
},
169+
x$terminal, x$text, x$child
170+
)
171+
out
172+
}
173+
174+
x <- styler:::compute_parse_data_nested(code)
175+
serialize(x) %>% unlist
176+
177+
## [1] "a" "<-" "function" "(" "x" ")"
178+
## [7] "{" "if" "(" "x" ">" "1"
179+
## [13] ")" "{" "1" "+" "1" "}"
180+
## [19] "else" "{" "x" "}" "}"
181+
182+
How to exactly implement a similar recursion to not just return each
183+
text token separately, but the styled text as one string (or one string
184+
per line) is subject to future work, so would be the functions to be
185+
applied to a sub-expression parse table that create correct indention.
186+
Similar to `compute_parse_data_flat_enhanced`, the column `spaces` and
187+
`newlines` would be required to be computed by
188+
`compute_parse_data_nested` as well as a new column `indention`.
189+
190+
Final Remarks
191+
-------------
192+
193+
Although a flat structure would possibly also allow us to solve the
194+
problem of indention, it is a less elegant and flexible solution to the
195+
problem. It would involve looking for an opening curly bracket in the
196+
parse table, set the indention level for all subsequent rows in the
197+
parse table until the next opening or closing curly bracket is hit and
198+
then intending one level further or setting indention back to where it
199+
was at the beginning of the table.
200+
201+
Note that the vignette just addressed the question of indention caused
202+
by curly brackets and has not dealt with other operators that would
203+
trigger indention, such as `(` or `+`.
204+
205+
[1] at commit `e6ddee0f510d3c9e3e22ef68586068fa5c6bc140`

0 commit comments

Comments
 (0)