Structured Typesetting (STS) generation#9
Conversation
wlammen
left a comment
There was a problem hiding this comment.
Why did you add the makefile? The readme instructions note several ways to compile metamath, including gcc m*.c -o metamath, and using automake.
wlammen
left a comment
There was a problem hiding this comment.
metamath.c line 3032 duplicated code from line 2393
|
Thank you Wolf for your review!
It actually does, see changes in
That's very verbose, though. Maybe just "STRUCTURED" ?
Sure, I can remove the makefile. |
|
How do you render this "structured typesetting"? Is this MathML, MathJax or something like this ? For texts, and in particular help texts, Norm used the double-space convention between sentences, and I admit I like that. |
Yes, the
Your wish has come true with the last commit! |
mmhlpb.c
Outdated
| H("Syntax: VERIFY STS <format>"); | ||
| H(""); | ||
| H("This command error-checks that the STS rules definition covers all syntax"); | ||
| H("defined in the Metamath source file loaded. It runs through all non"); |
There was a problem hiding this comment.
Double space here ! And I would write "non-definitional" with a hyphen, or even "nondefinitional", since "non" is not a word.
|
Thanks. So maybe you can use the option "/ MATHML" ? I'm fine with "/ STRUCTURED" too. I cannot approve this PR since I am not fluent in C :-( |
|
The option STS or /STRUCTURED is the second alternative to /HTML and /ALT_HTML. Are you able to roughly tell in what way these options differ by just looking at the name? If not, people will have to guess at what the result of invocations are. Do you expect this option to be typed by users of metamath directly? Or will it be likely be part of a script rarely changing? If you type the option frequently, a short name is handy, otherwise even a very long option name will hardly annoy anyone, but is easily understandable to casual readers of the script. I am still busy with Xmas, have just skimmed your PR. I currently cannot look into this further. |
I wanted to keep it generic because one could generate anything with it, not just MathML. |
Yes, and it is itself followed by the name of the production to use. A command would typically be: If you had, say, a
Yes, that's roughly the process. So if I follow you, we could use e.g. /STS_HTML as an option tag? |
|
What about EXPAND_HTML? Or HTML_EXPAND? |
|
When running metamath, it is enough to type any non-ambiguous prefix, so here, |
There was a problem hiding this comment.
.gitignore: OK, see here (https://stackoverflow.com/questions/6626136/best-practice-for-adding-gitignore-to-repo)
Maybe one should put this info into the README?
wlammen
left a comment
There was a problem hiding this comment.
metamath.c line 60 suggest a version number and add a changelog entry.
wlammen
left a comment
There was a problem hiding this comment.
parsetSTSRules type: why parset...? Looks like a typo
There was a problem hiding this comment.
meamath.c line 2977: Suspicious value of i. The other options / TIME or / NO_VERSIONING add 2 to the number of options, seemingly counting the slash and the tag as different entries. Why is this not done for / STS? If the count is correct, IMO a comment should clarify the underlying logic.
to be continued...
|
@wlammen I'm going to copy your remarks directly into a code review, it creates a sub-thread per remark and I think it's much easier to follow like that. |
Ok, I've updated the history and proposed a version 0.199. |
| @@ -0,0 +1,2 @@ | |||
| *.o | |||
| metamath | |||
There was a problem hiding this comment.
@wlammen writes:
OK, see here (https://stackoverflow.com/questions/6626136/best-practice-for-adding-gitignore-to-repo)
Maybe one should put this info into the README?
I'm not sure which info you would like to add to the README.
There is a standard .gitignore file for C projects here, we could give it a try.
Maybe better in a separate PR?
There was a problem hiding this comment.
The reason why I added this comment is, because the metamath sources are distributed via a tar archive as well. Obviously you wont need the gitignore in this case. I think, some kind of distribution how-to is finally called for. That are my thoughts. I was curious why you added the file. Separate PR is fine with me.
| /* 19-Jul-2017 tar Added for STS/MathML output */ | ||
| extern flag stsFlag; /* STS output (for "structural typesetting") */ | ||
| extern vstring stsOutput; /* output mode chosen for STS (follows STS flag) */ | ||
| extern vstring postProcess; /* command to pipe the output into (used for MathJax prerendering) */ |
There was a problem hiding this comment.
@wlammen wrote:
stsFlag (and possibly following) should be prefixed with g_ (g_stsFlag, matching g_altHtmlFlag for example), see changelog 0.187 metamath.c, line 88
Yes, this global variable naming convention has been introduced after I programmed the STS module. As an improvement, I could retrofit this to follow it too.
There was a problem hiding this comment.
These are the current rules.
| if (switchPos("/ ALT_HTML") != 0 || switchPos("/ STS") != 0 ) { | ||
| print2("?Please specify only one of / HTML , / ALT_HTML and / STS.\n"); |
There was a problem hiding this comment.
@wlammen wrote:
optimization: The function switchPos("/ STS") is called three times more or less in succession. Consider evaluating the result once and use a local variable to recall the result within the if conditions.
I don't think this is called 3 times. There are 3 else..if branches, and it is called one time in each, to ensure only one HTML output formatting option is chosen.
There was a problem hiding this comment.
You're right. It still looks awkward since the call and the parameter is coded several times.
metamath.c
Outdated
| /* 7-Jul-2017 added MathML/STS */ | ||
| if (switchPos("/ STS")) i = i + 1; |
There was a problem hiding this comment.
@wlammen wrote:
Suspicious value of i. The other options / TIME or / NO_VERSIONING add 2 to the number of options, seemingly counting the slash and the tag as different entries. Why is this not done for / STS? If the count is correct, IMO a comment should clarify the underlying logic.
Interesting. So, a typical SHOW STATEMENT command looks like this:
SHOW STATEMENT syl / HTML
That would be g_rawArgs = 5, which already includes 2 arguments for the / HTML.
As you correctly guessed, for / NO_VERSIONING and / TIME, there are two more arguments, therefore the +2.
In the case of STS, the / STS would already be accounted for in the 5 (taking the place of the / HTML), so that's not why the +1 is for. Rather, in the case of STS, there is one more argument, namely the name of the output processing, e.g. mathml. That is what the +1 is for.
I'll add a comment to make that clearer.
There was a problem hiding this comment.
I am subject to a learning curve as well while reviewing, wrt both to your source code and the review style.
| if (lastArgMatches("STS")) { | ||
| i++; | ||
| if (strlen(stsOutput)) { | ||
| if (!getFullArg(i,cat("* Using which output mode <",stsOutput,">? ",NULL))) |
| print2("?No source file has been read in. Use READ first.\n"); | ||
| goto pclbad; | ||
| } | ||
| if (strlen(stsOutput)) { |
There was a problem hiding this comment.
Duplicated code from line 586 (with i == 2)
mmcmds.c
Outdated
| let(&str2, ""); | ||
| str2 = tokenToTex(g_MathToken[nmbrTmpPtr2[k]].tokenName, showStmt); | ||
| /* 27 Jul 2017 tar For MathML/STS */ | ||
| if(stsFlag) str2 = stsToken(nmbrTmpPtr2[k], showStmt); |
There was a problem hiding this comment.
Duplicated code from line 357
mmcmds.c
Outdated
| let(&str2, ""); | ||
| str2 = tokenToTex(g_MathToken[nmbrTmpPtr2[k]].tokenName, showStmt); | ||
| /* 27 Jul 2017 tar For MathML/STS */ | ||
| if(stsFlag) str2 = stsToken(nmbrTmpPtr2[k], showStmt); |
There was a problem hiding this comment.
Duplicated code from line 377
| long i; | ||
| long hash = 0; | ||
| i = -1; | ||
| while (i < 13 && s[i] != -1) { |
There was a problem hiding this comment.
Suspicious i in first loop: -1, which is at least out of bounds for salt, likely for s, too.
There was a problem hiding this comment.
I'll answer this one now as it is interesting, useful for the rest of the review... and definitely deserves more information in the comments!
Actually, indices -1, -2 and -3 are valid and used in nmbrString:
- Index
-1is the length of the number string. SeenmbrLen(mmdata.c line 1023) - Index
-2is the allocated length, i.e. how many numbers are available totally (could be more than the actual current length of the string). SeenmbrAllocLen(mmdata.c line 1032) - Index
-3is the location inmemUsedPool(for memory management)
In this specific case, I wanted to include the length of the string in the hash, which I think makes sense but shall have been explained.
In any case, you clearly spotted a mistake here because in the salt, index -1 is clearly invalid!
There was a problem hiding this comment.
Any use of negative indices should be wrapped behind a macro to provide more safety and clarity. (That is, there should be a macro like #define nmbrLen(p) p[-1] where p has type nmbrString which is a typedef for int* or what have you.
| static long salt[] = { 4938, 48977, 6897, 7293, 2663, 7925, 2999, 12238, | ||
| 40033, 14038, 10699, 29746, 56108, 34526, 63576, 52053, 61949, 41177, 43740, 22822 | ||
| }; | ||
| long i; |
There was a problem hiding this comment.
Declaration is Initialization rule (section 4.5 in https://docs.oracle.com/cd/E17984_01/doc.898/e14699/variables_data_structs.htm) Initialize with value from line 1113
| /* This simply computes a XOR of the first numbers */ | ||
| int nmbrHash(nmbrString *s) | ||
| { | ||
| static long salt[] = { 4938, 48977, 6897, 7293, 2663, 7925, 2999, 12238, |
There was a problem hiding this comment.
Only 13 (?) values actually needed. Remove/comment out unneeded ones.
Where is the hash algorithm explained? Add a comment/link.
to be continued...
| long i; | ||
| if (sstart - 1 + len > nmbrLen(s) | ||
| || tstart - 1 + len > nmbrLen(t)) return 0; | ||
| for (i = 0; s[sstart-1+i] == t[tstart-1+i] && i<len; i++); |
There was a problem hiding this comment.
Suspicious index in first loop: -1, if sstart == 0. (1) Out of bounds access to s and t, or (2) Parameters sstart and tstart must be > 0.
Either provide parameter checks (cf. line 1211), or (minimum) state limitations in comment line 1122
| long nmbrInstrN(long start_position, long occ, nmbrString *string1, | ||
| nmbrString *string2, long start2, long length2) | ||
| { | ||
| if (start_position < 1) start_position = 1; |
There was a problem hiding this comment.
This means: If garbage is provided in parameter start_position, then I sanitize it to something more useful, hoping for the best. This kind of fault tolerance supports a sloppy programming on the caller's side. Better throw an exception, or call bug()
| return (sout); | ||
| } | ||
|
|
||
| /* Search for the nth occurrence of string2 in string1 */ |
There was a problem hiding this comment.
Explain parameters (semantics and limitations) in a comment. For example, what does occ mean? n would match the functions name.
| start_position--; | ||
| for(; occ > 0;occ--) { | ||
| long ls1, i, j; | ||
| ls1 = nmbrLen(string1); |
There was a problem hiding this comment.
Pull constant evaluations out of the loop.
| if (start_position < 1) start_position = 1; | ||
| start_position--; | ||
| for(; occ > 0;occ--) { | ||
| long ls1, i, j; |
There was a problem hiding this comment.
declare i and j in for commands
mmdata.c
Outdated
| } | ||
| } | ||
| if (found) { | ||
| start_position = i+1; |
| for (i = start_position - 1; i <= ls1 - length2; i++) { | ||
| flag found = 1; | ||
| for (j = 0; j < length2; j++) { | ||
| if (string1[i+j] != string2[start2-1+j]) { |
There was a problem hiding this comment.
This condition is usually part of the for-command.
| /* Add a single number to start of a nmbrString - faster than nmbrCat */ | ||
| nmbrString *nmbrUnshiftElement(nmbrString *g, long element) | ||
| { | ||
| long length; |
There was a problem hiding this comment.
declaration should be initialization
| } | ||
|
|
||
|
|
||
| /* Add a single number to start of a nmbrString - faster than nmbrCat */ |
There was a problem hiding this comment.
The return value is pushed on an internal variable stack with an implicit memory management. This important detail is not mentioned in the comment.
| long nmbrAllocLen(nmbrString *s); | ||
| void nmbrZapLen(nmbrString *s, long length); | ||
|
|
||
| /* Search for the nth occurrence of string2 in string1 */ |
There was a problem hiding this comment.
Should explain parameters and specify their limitations.
| #define NO_LINKEDITEM -1 | ||
| linked *linkedItems; | ||
| int free_linkedItem; | ||
| flag htinit_done = 0; |
There was a problem hiding this comment.
This is private to htinit() so declare it there as a static variable
| /* Static buffer for the linked lists */ | ||
| #define NB_LINKEDITEMS 50000L | ||
| #define NO_LINKEDITEM -1 | ||
| linked *linkedItems; |
There was a problem hiding this comment.
It is safer to initialize this to NULL and 0
|
|
||
| /* Create and fill the structure */ | ||
| hashtable ht; | ||
| ht.name = name; |
There was a problem hiding this comment.
The problem here is that you copy a pointer, not its value, from a parameter to the result. The caller handles the parameter's contents with a series of let() instructions, finally freeing it. All these operations are done without ht.name in mind. This opens up all kinds of memory allocation/access failures. Even if you know, this won't happen in the foreseeable future, you need a guarantee, or a contract, here to decouple caller and callee.
| /* Found it, free the object and remove it from the chain */ | ||
| hashtable->freeFunc(&linkedItems[*pli].key, &linkedItems[*pli].object); | ||
| int old_free = free_linkedItem; | ||
| free_linkedItem = *pli; |
There was a problem hiding this comment.
DRY duplicated code from line 76
to be continued...
|
@wlammen thank you very much for your careful review! |
|
New Year's Eve is closing in. I need some time for it. |
| /* The structure containing information about the STS variable tokens */ | ||
| struct stsVar_struct { | ||
| long stsType; /* type of the token in the STS (must be a constant tokenId) */ | ||
| long stsSchemeId; /* number of the schemed in which this variable is defined + 1. */ |
| }; | ||
|
|
||
| /* Current output format for STS */ | ||
| vstring stsFormat = ""; |
There was a problem hiding this comment.
g_ prefix missing from global variables (here and elsewhere in this file)
| /* Math symbol comparison for bsearch */ | ||
| /* Here, key is pointer to a character string. */ | ||
| /* Here we search only the global tokens, those | ||
| * which endStatement is the last statement */ |
There was a problem hiding this comment.
English: where endStatement...
| /* Here, key is pointer to a character string. */ | ||
| /* Here we search only the global tokens, those | ||
| * which endStatement is the last statement */ | ||
| int mathSrchGlbCmp(const void *key, const void *data) |
There was a problem hiding this comment.
is there a reason to avoid specific C types in parameter declarations? And so dodge C type checking?
key: char const *
data: mathToken_struct const *
| if(g_MathToken[ *((long *)data) ].endStatement == g_statements) return 0; | ||
|
|
||
| /* Find the direction in which the target token is */ | ||
| for(long *ptr = (long*)data; !strcmp(key, g_MathToken[ *ptr ].tokenName); ptr++) |
There was a problem hiding this comment.
Optimization: The first loop seems to check element data again, and the outcome is known to be 0. So I suggest to initialize with long* ptr = (long*) data + 1;
| /* Cache to speed up conversions */ | ||
| hashtable stsCache; | ||
|
|
||
| /* Math symbol comparison for bsearch */ |
There was a problem hiding this comment.
The comment should explain the result, that is limited to -1, 0 and +1, and what is delivered when.
It seems possible that the token list contains the same key multiple times in succession. Since this is part of a binary search, data may point somewhere in the middle of such a series. Is it guaranteed to hold just one global element (there is a suspicious active flag in the structure, that may allow a disabled and enabled element in the same series)? Is it guaranteed there is always a global element present? If either assumption is missed, the binary search may fail.
Clarify the so-called pre-/postconditions in a comment.
| long i; | ||
| char *fbPtr; | ||
| long textLen, tokenLen_; | ||
| long *g_mathKeyPtr; /* bsearch returned value */ |
| /* Make sure that g_mathTokens has been initialized */ | ||
| if (!g_mathTokens) bug(1717); | ||
|
|
||
| textLen = (long)strlen(text); |
There was a problem hiding this comment.
definition is initialization long textLen = (long) strlen(text); same for wrklen etc.
| #include "mmvstr.h" | ||
| #include "mmdata.h" | ||
| #include "mminou.h" | ||
| #include "mmpars.h" /* For rawSourceError and mathSrchCmp and lookupLabel */ |
| wrkNmbrPtr[mathStringLen] = *g_mathKeyPtr; | ||
| mathStringLen++; | ||
| fbPtr = fbPtr + tokenLen_ + 1; /* Move on to next token */ | ||
| if(fbPtr >= text + textLen) break; |
There was a problem hiding this comment.
This should be the while condition
| return NULL_NMBRSTRING; | ||
| } | ||
| wrkNmbrPtr[mathStringLen] = *g_mathKeyPtr; | ||
| mathStringLen++; |
There was a problem hiding this comment.
tip: wrkNmbrPtr[mathStringLen++] = ... saves the following line and can reduce code size
same for fbPtr += tokenLen_ + 1;
| return mathString; | ||
| } | ||
|
|
||
| /* Store a couple key/object into the cache */ |
There was a problem hiding this comment.
Document Pre/Postconditions
mmwsts.c
Outdated
| } | ||
|
|
||
| /* Dump a couple key/object from the cache */ | ||
| void stsDumpCache(nmbrString *key, vstring object) { |
There was a problem hiding this comment.
suspicious cast to eqFunc in line 235: The signature of this function does not match that of eqFunc: int (eqFunc)(void *, void *)
mmwsts.c
Outdated
| if(stsUseCache) { | ||
| stsCache = htcreate(format, STS_CACHE_BUCKETS, "", (hashFunc*)&nmbrHash, (eqFunc*)&nmbrEq, | ||
| (letFunc*)&stsStoreCache, (freeFunc*)&stsFreeCache, | ||
| (eqFunc*)&stsDumpCache); |
There was a problem hiding this comment.
suspicious cast: signature of stsDumpCache does not match eqFunc.
| } | ||
|
|
||
| /* Parse a file containing the structured typesetting rules. */ | ||
| int parseSTSRules(vstring format) { |
There was a problem hiding this comment.
Coding style: This function is way too long. It covers more than 300 lines. This exceeds the recommended max length of 20 lines by far, and its code is easily broken down into steps that can be moved into helper functions. https://stackoverflow.com/questions/475675/when-is-a-function-too-long
document pre/postconditions: example: stsFormat is both an input and an output variable, but that is not easily seen.
| g_outputToString = 0; | ||
|
|
||
| /* If the same format was already parsed, nothing to do. */ | ||
| if(strcmp(stsFormat, format) == 0) { |
There was a problem hiding this comment.
move the early out code to the beginning of the function where parameter checks usually take place
…amath-exe into tirix-structured-typesetting
| /* Dumpts the whole table */ | ||
| void htdump(hashtable *hashtable) { | ||
| print2("Hashtable %s:\n", hashtable->name); | ||
| //for(int bucket=0;bucket<hashtable->bucket_count;bucket++) { |
There was a problem hiding this comment.
This type of comment is not ANSI C compatible (see section 3.1.9). Use /* ... */ instead
There was a problem hiding this comment.
I thought we were agreed on C99 now? I don't mind seeing // creep in, even if we don't do any bulk conversions.
There was a problem hiding this comment.
Have we? Can you point me to where the decision took place?
There was a problem hiding this comment.
I'm thinking of #8 (comment) . In short: we already compile only on C99, even before taking into account the recent refactors. (I'm not opposed to pushing the minimum beyond C99 (i.e. C11), but I don't think we should consider ANSI C (C89) any more.)
There was a problem hiding this comment.
In C99 this issue is not relevant and can be ignored.
| { | ||
| if (start_position < 1) start_position = 1; | ||
| start_position--; | ||
| for(; occ > 0;occ--) { |
There was a problem hiding this comment.
Optimization: Use a dedicated loop variable in for loops. A compiler will allocate a CPU register for that.
for (long o = occ + 1; --o > 0;) {... (replace occ with o in loop)...}
is how I would write it.
Info: There is a tiny semantic change in my example: If the memory model of the computer supports signed magnitude instead of two's complement AND occ is MAX_LONG then (occ + 1)-1 == occ may not hold. We can safely ignore this nowadays.
There was a problem hiding this comment.
long is a signed type, and signed overflow is UB, so I think that this change is legal by the spec.
There was a problem hiding this comment.
@digama0 Exactly. Because occ+1 is UB when occ==MAX_LONG, --o may contain anything, and the loop starts with a random value. This cannot happen with the original code. We can safely ignore this semantic difference here.
| } | ||
|
|
||
| /* Search for the nth occurrence of string2 in string1 */ | ||
| long nmbrInstrN(long start_position, long occ, nmbrString *string1, |
There was a problem hiding this comment.
why is occ defined as long? Do you really expect more than two billion occurrences (or even 32000 should int be only 16 bit wide) of a substring in a string?
mmdata.c
Outdated
| } | ||
| } | ||
| if (found) { | ||
| start_position = i+1; |
| #define NO_LINKEDITEM -1 | ||
| linked *linkedItems; | ||
| int free_linkedItem; | ||
| flag htinit_done = 0; |
There was a problem hiding this comment.
make this a static variable within htinit(). It is completely private implementation detail in this function.
| return mmlLine; | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
change this to
#if 0
test code
#endif
to comply with ANSI C
|
I think there are already lots of ideas and issues for refactoring the source. In particular extracting code from long functions into auxiliary sub-function and documenting pre/postconditions can help with further review, since the source becomes a lot more readable then. In addition merge conflicts have to be resolved. |
|
Thank you very much @wlammen for your efforts reviewing my code! Indeed now, before this can be merged, I'll have to solve the conflicts with all the refactoring that's going on. |
|
I took care of merging this with master. @tirix , you should double check the last commit, which fixes a few warnings I was getting with the original version. |
|
Thank you Mario! |
I've tried to make as few changes as possible. The changes are in the following source files:
metamath.cfor the changes to theSHOW STATEMENTcommand and the newVERIFY STScommand, as well as the output post-processing,mmcmdl.cfor the newHELPcommand options, and the new commands,mmcmds.cfor the changes to theSHOW STATEMENTcommand at top level (distinct and dummy variables, syntax hints),mmdata.cfor some utility functions,mmhlpa.candmmhlpb.cfor the new built-inHELPoptions,mmhtbl.c, a new file with a hash table implementation,mmwsts.c, a new file with the main STS implementation,mmwtex.c, for the main hook into the STS formula output and for in-line comment formulas.This also adds a
.gitignorefile to ignore object files.