diff --git a/courses/RascalAmendmentProposals/RAP1/RAP1.md b/courses/RascalAmendmentProposals/RAP1/RAP1.md new file mode 100644 index 000000000..cef5c5d48 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP1/RAP1.md @@ -0,0 +1,183 @@ +--- +title: RAP 1 - Deployment of Rascal Packages +sidebar_position: 1 +--- + +| RAP | 1 | +| :---- | :---- | +| Title | Deployment of Rascal Packages | +| Author | Paul Klint, Jurgen Vinju | +| Status | Draft | +| Type | Rascal Infrastructure | + +## Abstract + +Rascal needs a mechanism to work with third-party Rascal packages (which may be standalone Rascal programs or Rascal libraries). This document describes the required structure of such packages, the required meta-data, and the standard way in which they are made available in a given Rascal installation. + +## Specification + +A *Rascal Package* is a self-contained entity that contains compiled Rascal code that can be distributed and used. Optionally it may also contain source code and documentation. + +A *Rascal Package Server* is a server that allows uploading, searching and downloading of Rascal Packages. There can be more than one server, but the list of servers has to be known at the moment of package installation. Initially, we provide a server at [r2d2.usethesource.io](http://r2d2.usethesource.io).Note that “r2d2” stands for “rascal release delivery and deployment”. + +A *Rascal Installation* is a computer with a file system containing all installed *Rascal Packages* which can depend on each other and on which *Rascal Source Projects* can depend. + +A *Rascal Source Project* is a directory folder structure with the source code of a Rascal based project and enough metadata to be able to compile and run this source code on the developer's machine. + +The life cycle of a Rascal Package consists of the following phases: + +* \[**Build**\]: Convert a *Rascal Source Project* to a deployable *Rascal Package* +* \[**Local Installation**\]: Installing a local \[**Build**\] of a *Rascal Package* into the local *Rascal Installation* +* \[**Deployment**\]: The *Rascal Package* is uploaded to a (public) *Rascal Package Server*. +* \[**Discovery**\]: The name and the keywords defined in the *Rascal Package* make it findable. +* \[**Installation**\]: Once found, the *Rascal Package* can be installed in a *Rascal Installation* +* \[**Local Installation\]:** Installs a *Rascal Source Project* directly as a *Rascal Package* in the local *Rascal Installation,* i.e. without first deploying it. This is for simulating and testing release, and for being able to work on collections of dependent packages without having to release them officially all the time. +* \[**Configuration\]:** Once installed a Rascal Package optionally needs to be configured with knowledge about the system it is installed in. An example configuration for a “gitar” package (git analysis in Rascal) would be the absolute location of the git binary tool. Configuration is a way to deal with non-Rascal, non-Java dependencies, but it can also be used to set preferences or tweak run-time parameters of a Rascal package. +* \[**Use**\]: From then on, the *Rascal Package* can be used as if it is an integral part of Rascal. +* \[**Update**\]: update to a newer version (is a nice-to-have combination of steps from \[**Discovery**\], \[**Installation**\] and \[**Use**\] for a new version of an already installed package) +* \[**Uninstall**\]: remove Rascal Package from the current Rascal installation +* \[**Deprecate**\]: mark the Rascal Package as “obsolete” on the *Package Server* + +### Rascal Source Project Layout + +This is what Rascal source projects look like. The structure is standardized such that it is easy to package the required compiled information into a *Rascal Package* jar file later. + +* Have the structure as described below. +* Have a name that starts with a lowercase letter, say mylib. + +The following directory structure is required: + +| MyLib/src/mylib | All Rascal source files (possibly organized in a hierarchy) that form the library. This is optional, programs can also be deployed without source code | +| :---- | :---- | +| MyLib/META-INF | Directory with meta-data | +| MANIFEST.MF | Java-related (we depend on Maven for java dependencies) | +| RASCAL.MF | Rascal-related | +| MyLib/courses | Sources of courses and documentation | +| README.md | Brief README for the program | +| LICENSE | LICENSE file for the program | +| bin | A folder for compiled binaries | +| bin/\* | The compiled Rascal files are right there at the root | +| bin/courses | The compiled courses files are nested in \`courses\` | + +### Rascal Package Layout + +A Rascal Package is a jar file, with the following internal structure, with a name \-\.jar. The package layout reflects the layout of a *Rascal Source Project*, except that it contains binaries at the root to make them as quickly accessible as possible, and of course the source code is optionally there. + +The following directory structure is required inside of the jar file: + +| src | Optional folder containing all Rascal files | +| :---- | :---- | +| courses | Optional folder containing compiled course material | +| META-INF | Directory with metadata | +| MANIFEST.MF | Java related metadata file | +| RASCAL.MF | Rascal-related metadata file | +| README.md | Brief README for the program | +| LICENSE | LICENSE file for the program | +| .\* | All compiled binary files are nested here right at the root | +| CITATION | Optional example of how to cite this software release, i.e. a DOI or a simple ACM-style citation example to the paper which belongs to this software, or the software itself as published on Zenodo or Arxiv | + +### RASCAL.MF File + +The RASCAL.MF contains the following attributes that are relevant for program deployment: + +* Name: The name of this program +* Version: The version of this program. +* Synopsis: A brief description what the package does. +* Web: An optional URL to point to the website of the project (perhaps the github home page) +* Authors: The authors of the package. +* Maintainers: The current maintainers of the package. +* Dependencies: A list of other Packages this program depends upon together with their required version number. +* Java-Dependencies: TBD +* Keywords: A list of keywords that can be used to discover this program. +* Sources: points to which nested folder contain root source folders, “src” by default +* Include-Source: boolean to indicate whether source code is to be included in the Package +* Configuration: TBD, place to declare \[**Configuration**\] values, e.g. loc pathToGit=|file:///usr/bin/git|, str heap-size \= “1G” + +### Building and Deploying of a Package File + +Each library will be deployed as a single jar and adding a library to a given Rascal installation starts with this jar. + +* \[**Build**\] A shell command to wrap the current working project into a deployable jar file. + 1. Compiles all Rascal files in the project + 2. Compiles all documentation in the project + 3. **TBD**: what about Java class files? + 4. Copies the bin files to the target jar in the right place + 5. If source is distributed, URI locations pointing to the source code are adapted in the binaries to point to the jar (for the debugger) + 6. RASCAL.MF from the source project is adapted to reflect the content of the jar + 7. When the jar is finished a checksum is computed and stored next to the jar + + +* \[**Deployment**\] The packaged jar will be uploaded to a Rascal Package Server, e.g. r2d2.rascal-mpl.org/\-\.jar. This is simply a http file server. It could also be our Nexus (which is indeed a file server), but we should not force people to use nexus as their file servers. Http should be the only requirement. Next to the jar should be a checksum file with the same name, i.e. \-\.sha1, and a synopsis (description) file, i.e. \-\.descr which could be the first sentence of the README.md or something like that (see \[**Discovery**\]) +* \[**Discovery**\] The Rascal shell provides a means to query the currently available libraries on a given server, i.e. by default r2d2.usethesource.io. It should also be possible to query the list of currently installed Rascal packages. E.g. “pkg find \[string\]” will find all packages matching the optional string name on the server or list them all, and “pkg list \[string\]” does the same for the local packages. The discover and list comments also present the single line synopsis next to the name and all available version numbers for the package. Discover will also label already installed packages by a (\*) or something to make sure the user can see what already is there and what is available for download. +* \[**Installation**\] The Rascal shell offers an “install” command to download a library and install it in a local machine repository, i.e. \/.r2d2/myLibrary-1.0.0.jar. There is also a staging directory to make sure installation is an atomic operation, i.e. \~/.r2d2/staging. “pkg install \ \[version\]” will go through the following steps: + 1. Clear the staging directory (for possible left-overs from a failed previous install) + 2. Push this package name and server on the installation stack file in the staging directory file “TODO”. + 3. Download a package from the **server**, put it in the staging directory. + 4. Download the checksum file. + 5. Compute the checksum independently. + 6. Check the checksum, and bail out if it is broken. + 7. Check for \-\.deprecated file on the server. If deprecated show the contents on the file and warn the user. Ask if this can be ignored, if \[yes\] continue, if \[no\] bail out (default is “no”) + 8. Download the LICENSE file from the library and put it in the staging directory + 9. Extract the Dependencies from the RASCAL.MF file of the project + 10. For each of the Dependencies: + * Check if already installed. + * Check if part of the current TODO stack, if yes bail out due to cyclic dependency. + * If not, go to step ‘a’ for the current package and finish the process recursively + 11. When the current package and all of its dependencies are downloaded as above: + * Collect the LICENSE files and ask the user to agree with them + * If \[no\] on one of the licenses, bail out. + * Check for LICENSE compatibility? + * If \[yes\] move all packages from staging directory to local repository directory. + * Unpack all jars in directories \-\; this will make nested jars more easily available on the java run-time classpath and loading modules is a lot faster from a file system than out of a jar. + 12. If one of the dependencies bailed out due to checksum failure or failure to agree with the license, or failure to ignore deprecation, then report the reason for the abortion and clear all files in the staging directory. +* \[**Local Installation**\] a variant of the above, where a *Rascal Source Project* is packaged and deployed directly in the local *Rascal Installation*. + 1. In this way larger projects can be split into reusable packages without having to publicly release ongoing work + 2. In this way we simulate the use of a deployed package as faithfully as possible on the developers machine +* \[**Configuration**\] triggered after \[**Installation**\] but also re-configuration is possible after installation. + 1. The shell provides a way to trigger a number of configuration questions (e.g. as defined in RASCAL.MF (TDB), e.g. “pkg config \” + 2. The result of configuration is stored in a file \~/.r2d2/\-\.config + 3. A Rascal standard library module provides access to the values in the config file +* \[**Dependency**\] by editing RASCAL.MF (or via a shell command to add it to the RASCAL.MF? “pkg add \ \”?), the name of a library is added to a project at Used-Libraries. This dependency declaration contains a version number optionally. If this is not the case, then the latest version (semantic versioning) is always added to the search path. + 1. This effectively adds the contents of jars to the libraryPath of the project in the order of occurrence in Used-Libraries. + 2. The process of adding to the libraryPath should check existence of the library and bail out with an error message if it is not there anymore (see \[**Uninstall**\]), or propose to “pkg install” it again automatically? +* \[**Use**\] the programmer imports modules from the libraries on the library path as if they were part of the current project. + 1. When the shell starts or when a dependency is added, the documentation of each library is also added to the help index. + 2. When the shell starts or when a dependency is added (see \[**Dependency**\]), the libraryPath of the shell is extended to make the library indeed importable. If the dependency does not exist in the local Rascal installation, then an error message should be reported +* \[**Uninstall**\] a “pkg uninstall \ \[version\] \[“rec”\]” removes a package and its dependencies unless used by another package from the local repository. + 1. This can generate dangling references in source projects (see \[**Dependency**\]) + 2. This should fail if any of the installed packages in the local repository still depends on the package, unless they are on the list of currently being removed as well. + 3. The process should mimic the inverse of the \[**Install**\] process and be made atomic as a whole, i.e not take place at all if any of the packages can not be uninstalled. + 4. The process should also remove local \[**Configuration**\] files (if successful) +* \[**Deprecate**\] Label an already deployed (versioned) package as “obsolete” or “deprecated”, with a reason (security issue, bug, license issue, not maintained anymore, superseded by another project, etc) + 1. The file will not be removed from the server but... + 2. An additional file will be uploaded \-\.deprecated, which contains the reason for the deprecation. + 3. The \[**Install**\] comment will check for the deprecated file and show its contents, then ask if the user wants to force the installation or bail out. + +## Motivation + +* It is necessary for Rascal developers and users to deploy their Rascal code as library for others to use in other Rascal and Java projects. This proposal tries to solve this problem. +* Rascal projects, especially library projects, need dependency on Java libraries. Currently we solve this by including binary jars and this situation needs resolution and can be done in the same proposal + +## Rationale + +## Backwards Compatibility + +This is a new feature. + +## Reference implementation + +TBD + +* Can these features be build on top of an existing build and deployment system which is easy and ubiquitous enough (e.g. reuse mvn?), or should we simply make the download features from scratch? It’s not that much work to upload or download a jar.. + +### References + +[https://www.haskell.org/cabal/proposal/index.html](https://www.haskell.org/cabal/proposal/index.html) + +[https://www.python.org/dev/peps/](https://www.python.org/dev/peps/) + +[https://medium.com/@sdboyer/so-you-want-to-write-a-package-manager-4ae9c17d9527\#.gxzx0aqv0](https://medium.com/@sdboyer/so-you-want-to-write-a-package-manager-4ae9c17d9527#.gxzx0aqv0) + +[http://yt-project.org/](http://yt-project.org/) + +[^1]: RAP is at the moment following Pyhton’s PEP ([https://www.python.org/dev/peps/](https://www.python.org/dev/peps/)). We need to look at other projects to see what is best. See for instance, [http://yt-project.org/](http://yt-project.org/) diff --git a/courses/RascalAmendmentProposals/RAP10/RAP10.md b/courses/RascalAmendmentProposals/RAP10/RAP10.md new file mode 100644 index 000000000..593155d66 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP10/RAP10.md @@ -0,0 +1,139 @@ +--- +title: RAP 10 - Concurrent Source Location Access +sidebar_position: 10 +--- + +| RAP | 10 | +| :---- | :---- | +| Title | Concurrent Source Location Access | +| Author | Jurgen Vinju, Davy Landman | +| Status | Draft | +| Type | Rascal Language | + +## Abstract + +Rascal can be executed in JVM threads (as in the Eclipse context for example) and also we plan to add concurrency features to Rascal itself (RAP 8). This puts a lot more pressure on our IO mechanism than before, leading to races on disk and on other external resources identified by values of type \`loc\`. + +We propose to extend the URIResolverRegistry (which is Rascal’s generic resource access mechanism) with a cross-cutting “locking” feature that is safe (up to *unpredicted* aliasing of location URIs). + +A second part of the proposal is to expose this locking feature on the language level in Rascal using a structured programming concept. + +## Motivation + +* Many use cases of Rascal involve file IO + * Often in a dynamic context where multiple file processors read and write concurrently, such as the Eclipse IDE or an LSP server. + * More and more in a concurrent and even parallel context, where multi-core architectures are used to speed up larger computations +* File IO is hazardous in a concurrent context, due to race conditions +* So, we need some form of locking mechanism on file IO. +* The \`loc\` data-type and its uniform functionality across all kinds of external resources provides an ideal opportunity to solve locking in a uniform and safe manner. + +## Specification + +* Understanding what locking means for parts of files identified by offset/length, line/column indexing is **not in scope** of this proposal: + * For now we assume the entire location is locked even if only a part of the file is specified + * A whole pandora’s box of **aliasing** opens when considering offsets and lengths, and simply tracking this may provide already a lot of overhead. Let’s leave this for now. +* File locking semantics is implemented on “physical” source locations, not logical ones: + * \`file\`, \`project\`, \`http\`, \`cwd\`, \`home\`, etc are physical locations on which locking can be applied directly + * \`java+method\`, is a logical source location. If file locking is applied to it, it will first be resolved to a physical location. + * This avoids aliasing which would break the locking semantics (i.e. locking a physical location while somebody else accesses the same resource using a logical location, or vice versa) +* Physical source locations also may have aliases, and **file locking will not be safe** in the presence of such aliases: + * Consider |home:///.bashrc| and |file:///Users/jurgenv/.bashrc|, + * Consider |project://rascal/src| and |file:///Users/jurgenv/Workspace/rascal/src| + * The proposed locking **may break on this** simply because the URIs are used to identify a file and not the “actual” file. That’s due to the whole design of abstracting from actual resource implementations via the URI schemes. + * To remedy this issue URI scheme implementations which “know” they are aliases (such as tmp, home and cwd, possibly “project”) **should resolve the problem internally** (see below) + * The designer of a scheme must realize if the URI they use are indeed “Unique Resource Identifiers” and if not they should consider all alternative names a resource might have, and make sure to lock those as well. + 1. **TODO**: this aliasing problem generates a lot of complexity. Alias graphs can be insidiously complex and lead to unexpected deadlocks. Below more consideration for this problem, but for now it is not completely clear how to deal with this issue safely. +* To implement locking across different file systems we propose an extensible mechanism, with a core locking feature and a plugin mechanism for additional effects: + * URIResolverRegistry will offer + 1. boolean lock(Object owner, ISourceLocation l) and + 2. void unlock(Object owner, ISourceLocation l) + * “Owner” identifies a computation “somehow”, i.e. via the identifier of a Thread or a ThreadPool computation name. + 1. The owner object is assumed to have a correct and complete implementation of the \`equals\` and \`hashCode\` methods which follow the JVM equals/hashCode contract. + * ISourceLocationLockProvider is a new (optional) interface to be implemented by classes which also implement ISourceLocationInputResolver or ISourceLocationOutputProvider (or both): + 1. lock(Object owner, ISourceLocation l) + 2. unlock(Object owner, ISourceLocation l) + * When \`lock\` is called on URIResolverRegistry, + 1. Logical locations will be resolved to physical locations first + 2. the registry will add the location to a global registry + * the registry itself is a singleton already + * It will do this in a thread-safe way (via a lock-free data-structure) + * The computation which is locking is identified somehow (see below “repeated locks”), for example using the Thread identity? DAVY? How to do this? + 3. All other read and write operations will guard for the lock by using the location as a semaphore. + 4. If a resource is a directory/folder, then it will guard all recursive children as well. + 5. After the global semaphore, and if an ISourceLocationLockProvider exists for the scheme in question, the \`lock\` method of this interface is also called + * For example, in the Eclipse context the \`project\` scheme will start locking files in the Eclipse way, such that other processes in Eclipse which do not use the URIResolverRegistry, will also not race. + * Aliased locations the scheme knows of will be locked as well by the ISourceLocationLockProvider + * URIResolverRegistry will guard for **infinite aliasing loops** + * **TODO: how?** + * **TODO: complex scenarios involving aliasing can introduce deadlocks here because the locking of aliases is now implemented as a side-effect, and so it does not have a specified ordering among aliases. This can trigger (rare and hard-to-diagnose) deadlocks by out-of-order parallel locking. How to resolve this problem?** + * Add \`getAliases\` to ISourceLocationLockProvider? + * This could add the locations to the list to be locked and included in alphabetical order + * Still this can be messed up by subsequent calls to lock with different locs aliasing the same files in complex ways + * Maybe lock should receive a list of files, for which first all aliases are resolved, and then ordered alphabetically? + * When \`unlock\` is called on URIResolverRegistry: + 1. Logical locations will be resolved to physical locations first + 2. If an ISourceLocationProvider exists for the scheme in question the \`unlock\` method for this interface is invoked to retract external lock mechanisms + * Aliased locks should also be unlocked + * **TODO: deadlock alert (see above)** + 3. The registry will remove the location from the global lock registry in a thread-safe manner + + * Repeated locks: Locking an already locked file is allowed, also via recursive directory locks (outside in) + 1. The computation which has the lock can lock again with no effect + 2. Computations can only unlock locations they have locked, otherwise an exception is thrown. + 3. If a nested location is locked because a parent location is already locked, the nested location is registered separately and must also be unlocked separately. + 4. If an exact location is locked twice, the lock method will return \`false\` because the lock is ignored. This is to help terminate **infinite alias loops**. + * The **file** scheme and all of its derived schemes uses the JVM file lock mechanism: + 1. This provides protection from different JVM processes fighting for the same file + 2. It’s not good enough for concurrent threads inside a single JVM, but this is covered by the URIResolverRegistry + * The file scheme has many derivatives, namely \`cwd\`, \`home\`, \`tmp\`, etc. + 1. Their ISourceLocationLockProviders can resolve the aliasing issue by locking the absolute file resource they are aliasing. + 2. We might change these to logical file locations, but that results in less portable code. For example, \`|home:///|\` would work on any machine after deployment, while if we always rewrite it to a \`file\` location the code using such as location would not be portable anymore to another machine. Same for \`cwd\` and \`project\` and \`tmp\`. It would defeat the purpose of these schemes, which is to provide transparent portable URIs for common locations on user machines. The aliasing issue *must* be resolved however, for these schemes to continue to be be useful. + * To access the locking feature from Rascal, we will not expose the lock and unlock functions, but provide a structured programming concept: + 1. Because locking always requires balanced lock/unlock and Rascal should be a “safe” language to program in. + 2. Because locking interacts heavily with control flow jumps (break, continue, throw, return) it requires language-level integration. + * The alternative would be programmers have to write a lot of try-finally blocks and introduce variables to manage the state. + 3. syntax Statement \= sync: Label label "sync" "(" {Expression ","}+ locks ")" Statement body + 4. Static semantics: the locks are all expressions of type \`loc\` + 5. Dynamic semantics: + * The loc(k)s will be ordered alphabetically first into a list + * Then URIResolverRegistry.lock will be called on each lock from left to right in the list + * Then the \`block\` is executed + * Then URIResolverRegistry.unlock will be called on each lock from right to left in the list + 6. The **return** and **throw** statements break out of the sync block and they **always unlock the resources** as if the block was fully completed. + 7. If **break** and **continue** jump out of a sync block (i.e. when sync is nested in a loop) then they also always unlock the resources as if the block was fully completed. + 8. **Special convenience syntax for locking on file collections**: + * sync(\*setOfLocs, myFileLoc, \*listOfLocs) + * The \* splices sets and lists of locations into the argument list for sync resulting in the elements of \[\*setOfLocs, myFileLoc, \*listOfLocs\]. + * After this the sync semantics is unchanged + * The syntax is *very* necessary: + * To avoid having to write a recursive function for every unbounded set of files to loc + * This recursive function would be **brittle** due to having to be careful in ordering the files consistently, even when the set of files is not exactly the same as another part of the program which is locking an extended subset of the same files. + * The spliced list is ordered consistently using a full and stable ordering mechanism to avoid all kinds of hideous implicit deadlocks. + * **Note that** locking lists of locs interacts with the URI **aliasing** issues, since different elements of the list might point to the same location, possibly via indirect aliasing paths. + 9. **Special convenience syntax for reading and writing locs**: + * \`loc ()\` will return a \`str\` with the contents of the file + * Example: str x \= \`|home:///.bashrc|();\` + * \`loc (str x, bool append=false)\` will write a \`str\` to the file (replacing it, or appending to it) + * Example: \`|home:///.bashrc|(“\#\! /bin/bash”);\` + * Motivation: if \`**sync**\` is builtin, and does not require importing an IO library, it would be inconsistent to have to import a module to read/write from/to source locations. + * See also **\[RAP 2\]** which avoids importing util::ValueUI and ParseTree by making types simulate parsing functions using the CallOrTree syntax. + * The CallOrTree semantics would be overloaded with one more feature, letting a location “act” as a read/write function. + * However, locations are **not** suddenly or accidentally also sub-typed of functions\! + * This is \_only\_ about overloading CallOrTree, unambiguously + * CallOrTree is already overloaded for locations with the offset-length and line/column notation, so no big deal here. +* Dealing with global variables and closures, the other sources of “side-effect” in Rascal can be taken care of by extending the \`sync\` statement a bit: + * syntax Statement \= sync: Label label "sync" Statement body; + * Semantics: \`sync\` without any location parameter synchronizes on the module instance which contains the sync block. + * This can be used in function bodies which use global variables to make these thread safe, as well as closures. + * Perhaps this should be a separate RAP, since other mechanisms could also be used to fix races on captured variables and globals, and there might be no need to add a feature for that in the language. Locations however are out of control of the language implementation and must be protected via a locking mechanism. + + +## Examples + +## Backwards Compatibility + +## Implementation + +## References + +* diff --git a/courses/RascalAmendmentProposals/RAP11/RAP11.md b/courses/RascalAmendmentProposals/RAP11/RAP11.md new file mode 100644 index 000000000..06cb2f391 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP11/RAP11.md @@ -0,0 +1,80 @@ +--- +title: RAP 11 - Better Datetime +sidebar_position: 11 +--- + +| RAP[^1] | 11 | +| :---- | :---- | +| Title | Re-implementation of Rascal Datetime functionality based on Java time, adding support for incomplete datetime information, and replacing offsets with zone ids as primary encoding | +| Author | Davy Landman, Jurgen Vinju | +| Status | Draft | +| Type | Rascal Language | + +### Issue + +1. The current date-time feature in Rascal (and vallang) is based on com.ibm.icu. It has been caught up with by the Java standard library which now features excellent support via a fork of jodatime +2. The current implementation does not support partial datetime information, for example missing a timezone offset. But a lot of data does not have this information and so it must be representable. Programmers should be able to choose how and when to complete the missing data. +3. Rascal does not support \`datetime\` without a date field, and it does not allow the programmer to test for this missing information either. +4. We currently only have offset information, this should be replaced with Zone information and only zone offsets in case of disambiguation for duplicate local date times., since offsets can change, especially for dates in the future. [More details](https://codeblog.jonskeet.uk/2019/03/27/storing-utc-is-not-a-silver-bullet). +5. We have limited libraries (or language support) to mutate zone / offset information. + +Example why [time zoneid](https://stackoverflow.com/tags/timezone/info)’s are preferred over zone-offsets: + +User reads data from a csv, it contains local date-time for which they know they should be interpreted in `Brazil/East` time zone. User has outdated time zone database installed (an 3y old java8 installation [for example](https://hi.service-now.com/kb_view.do?sysparm_article=KB0622033)). Since Brazil stopped with daylight savings time in November 2019, mapping a local date to a zone offset will be incorrect on the users computer. It will compute `2020-02-02 10:00:00` to `2020-02-02T12:00-02:00`. On this computer all will go well but running the same script on a different computer (with updated java version) will translate it to `2020-02-02T13:00:00-03:00`. Now you could argue that locally this isn’t much of a problem, but as soon as you start exporting this data (and importing it somewhere else) problems start to emerge. If you would encode dates as `2020-02-02T10:00[Brazil/East]`, no information is lost, and you can always (with the local best available knowledge) translate this to different time zones or do date math on it. + +### Analysis + +1. Com.ibm.icu has bugs that Java time does not have; +2. And Java time is more “standard” +3. ICU is quite a big dependency, since it carries a copy of all zone information. +4. So it makes sense to move to Java time, as both a preventive and corrective maintenance task + +1. If one gets this datetime information in, say, a CSV file: $2020-01-01T10:00$ the information is **incomplete** + 1. We don’t know which absolute point in time this is, because a timezone is lacking + 2. Also it is imprecise, because the milliseconds offset is missing +2. Currently Rascal does not support such incomplete datetime information. We produce a parse error +3. We do want to be able to represent incomplete information about datetime +4. We do not want to heuristically fill in the missing data without the programmer’s intervention + 1. Downstream metrics (say time measurements) may become inaccurate (noisy) or even imprecise (off) if arbitrary offsets are introduced. + 2. Rascal/vallang is wysiwyg and filling in the missing offsets would not honor that design element +5. Much of the Java time library **needs complete information** to even work correctly. +6. There is now a question of + 1. How to represent incomplete datetime data + 2. How to fill in the missing offsets + 3. When to fill in the missing offsets + + + +### Solution proposal + +#### Complexity (why UTC or zone offsets are not enough) + +Date time is complex, please read the blog by Jon Skeet (StackOverflow fame & author of nodatime, a .net version of jodatime) why storing UTC is not the solution for future dates: [https://codeblog.jonskeet.uk/2019/03/27/storing-utc-is-not-a-silver-bullet/](https://codeblog.jonskeet.uk/2019/03/27/storing-utc-is-not-a-silver-bullet/). + +Rough summary: for datetimes in the past, it’s okay to convert them to UTC and store them that way, but for dates in the future, dates should be stored with the time zone code and the local date\&time for that zone. Note, not the zone offset, nor the utc, but the [time zone code](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) (like “Europe/Amsterdam”) since for example countries sometimes decide to move the day-light savings time a week earlier or later, so then the mapping to UTC changes. + +Note that even with a time zone code you need an zone-offset to disambiguate the overlapping times around the transition of summer to winter time. See the manual of [ZonedDateTime](https://docs.oracle.com/javase/8/docs/api/java/time/ZonedDateTime.html) for a good explanation. In general this + +#### Solution + +##### Types + +* Add \`time\` type for values that represent (absolute) time instants on any given day +* Add \`date\` type for values that represent a given day without a specific time +* Keep \`datetime\` type to represent values that have both absolute date and time + +* ##### Literals + +Change rascal literals to also include a zone code, sadly ISO8601 doesn’t contain a standard for it yet, so there exists different encoding (sometimes as a separate field next to the datetime). [Java’s approach](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_DATE_TIME) is quite compact: + +`$2020-01-01T20:00:00+02:00[Europe/Amsterdam]$` + +Both the offset and the time zone are optional. You can also just give the timezone information +Caveats: + +* `$2020-01-01T20:00:00[Europe/Amsterdam]$` is the same as `$2020-01-01T20:00:00+02:00[Europe/Amsterdam]$` +* `$2020-01-01T20:00:00+04:00[Europe/Amsterdam]$` will be corrected to `$2020-01-01T20:00:00+02:00[Europe/Amsterdam]$` + + + +[^1]: RAP is at the moment following Pyhton’s PEP ([https://www.python.org/dev/peps/](https://www.python.org/dev/peps/)). We need to look at other projects to see what is best. See for instance, [http://yt-project.org/](http://yt-project.org/) diff --git a/courses/RascalAmendmentProposals/RAP12/RAP12.md b/courses/RascalAmendmentProposals/RAP12/RAP12.md new file mode 100644 index 000000000..558a96f30 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP12/RAP12.md @@ -0,0 +1,58 @@ +--- +title: RAP 12 - Separate String edit from Visit functionality +sidebar_position: 12 +--- + +| RAP | 12 | +| :---- | :---- | +| Title | Separate String edit from Visit functionality | +| Author | Jurgen Vinju | +| Status | Draft | +| Type | Rascal Language | + +### Issue + +1. Visit behaves very differently on strings than on other values. Namely when a value of String is reached, it generates all tails of the string and executes all patterns on these tails. + 1. This is considerably different from the visit behavior on other “containers” such as lists, sets and maps. Visit simply traverses each element in isolation and not “all tails”. + 2. The fact that a string is treated as a container is inconsistent with other language feature such as the generator \`\<-\` syntax. There we do not iterate over all tails of a list, for example. + 3. The implementation in the interpreter switches to this special string behavior if the **dynamic typ**e of the \_subject\_ is a string. This contradicts compiler behavior which should select which kind of visit is necessary based on static information. + 4. If you are simply using visit to visit all sub-values, it can be very surprising that the string tail visiting behavior is triggered. For example, when a \`case value x: i+=1;\` simply counts all values in the tree, the total count would be augmented with an increment for every tail of every string (\!\!\!). This is counterintuitive. + + +### Analysis + +2. There seems to be a case of “too much overloading” of the visit syntax. We do need this kind of power to edit strings concisely, but editing strings is not simply a special case of recursively visiting a data-structure. For strings we need additional semantics (a cursor inside of the string) +3. It would be easy to separate the functionality into one \`edit\` statement and a \`visit\` statement, where \`visit\` would behave as before but not dive into tails of strings, and \`edit\` would behave as \`visit\` behaves now on strings (namely to visit all tails). + +### Solution proposal + +Introduce a new type of statement/expression for editing strings: + +result \= **edit**(subject) { + **case** “string” \=\> “string” + **case** /regex/ \=\> regex +} + +* The subject should (statically) be a string +* Cases of an edit statement should be either literal strings or regular expressions +* Each pattern is applied to the string starting from a cursor which moves from left to right through the string +* Substitutions replace only the matched substring, and editing is continued on the string that continues after the matched substring + +This is exactly the same as how visit currently works on strings. + +This example shows what would happen: + +result \= **edit**(“aaa”) { + **case** “a” \=\> “b” +} + +This would return “bbb”, since “a” matches at all cursor positions. + +Contrarily the following visit statement would behave differently: + +result \= **visit** (“aaa”) { + **case** “a” \=\> “b” +} + +Would return “aaa” since no value matched the “a” pattern. Only entire strings “a” would be replaced by “b” and strings such as “aaa” will remain unchanged. + diff --git a/courses/RascalAmendmentProposals/RAP13/RAP13.md b/courses/RascalAmendmentProposals/RAP13/RAP13.md new file mode 100644 index 000000000..97ad24b20 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP13/RAP13.md @@ -0,0 +1,94 @@ +--- +title: RAP 13 - Name-parametrized syntax role modifiers +sidebar_position: 13 +--- + +| RAP | 13 | +| :---- | :---- | +| Title | Name-parametrized syntax role modifiers | +| Author | Jurgen Vinju, Paul Klint, Tijs van der Storm | +| Status | Draft | +| Type | Rascal Language, Static & Dynamic Type System | + +## Abstract + +We want to introduce *syntax role modifiers* like `data[Statement]` , `syntax[Statement]` , `layout[Statement]` , and also “name-parameterized” modifiers like `data[&Name]`. + +## Motivation + +This RAP solves 3 problems in one go: + +* The five syntax types \-**data**, **syntax**, **lexical**, **layout**, and **keywords**\- are *indistinguishable* from each other in Rascal source code because they all use identifier names (like `Statement`, `Expression` and `Declaration`). This leads to confusion since the types internally are different, more technically: they are *type*\-*incomparable* even if the names are equal*.* So, there exist equally looking types which are semantically different. Does not compute. +* Transformations between syntax types can not be expressed as function types, unless the types are defined in different modules (then module prefixes can be used). + * E.g. `Statement implode(Statement x)` **does** **not** make sense, while `data[Statement] implode(syntax[Statement] pt)` **does** make sense. + * Implicitly defined syntax types, such as by `implode` or `explode` can not be used at all; while with syntax role modifiers we could implement “name preserving” function types. + * `data[&N] implode(syntax[&N] pt)` would express a transformation from syntax tree to data tree while **preserving the name** of the type and changing its role modifier. +* These issues are made worse by the plan to introduce “concrete syntax for external parsers”, namely for every `data` type there will be an implicit `syntax` type. Before with “implode” we could still pass the grammar of the target type as a parameter \- `&T implode(type[&T], &U <: Tree input)` \- but with “explode” there will be no such grammar typed in by the user. Hence *the concrete syntax feature for external parser is impossible* without a solution for the above. It is not expressible in Rascal momentarily. +* This was already a hole in the type system, but we worked around it using modules and module prefixes, or judiciously not having two definitions of the same name in scope at the same time. It would be good to have this problem addressed, also independently of the concrete syntax urgency. + +## Specification + +No new **syntax** is necessary for the new feature, we reuse the parametrized alias notation: + +* Explicit notation: **`syntax`**`[Identifier]` , **`lexical`**`[Identifier]` , **`layout`**`[Identifier]` , **`keywords`**`[Identifier]` and **`data`**`[Identifier]` +* Parameterized notation: **`syntax`**`[&Identifier]` , **`lexical`**`[&Identifier]` , **`layout`**`[&Identifier]` , **`keywords`**`[&Identifier]` and `data[&Identifier]` +* Have to allow the use of these reserved keywords as “type names” before the parameters syntactically in the grammar. +* Do not have to reserve any new keywords + +The **semantics** of the **explicit** **notation** is “type alias”, where: + +* **`data`**`[Id]` is an alias for `Id` where `Id` is declared as **`data`** `Id =` ... +* **`syntax`**`[Id]` is an alias for `Id` where `Id` is declared as **`syntax`** `Id =` ... +* **`lexical`**`[Id]` is an alias for `Id` where `Id` is declared as **`lexical`** `Id =` ... +* **`layout`**`[Id]` is an alias for `Id` where `Id` is declared as **`layout`** `Id =` ... +* **`keywords`**`[Id]` is an alias for `Id` where `Id` is declared as **`keywords`** `Id =` ... + +With such explicit aliases, in case of an ambiguous name resolution for `Statement` where it is both in scope as a **`data`** and in scope as **`syntax`** type, the programmer can explicitly disambiguate by choosing either **`data`**`[Statement]` or **`syntax`**`[Statement]`. + +Since the above semantics is a simple matter of desugaring, there are no implications for the downstream type system. + +The **semantics** of the **parametrized** **notation** is as follows: + +* No other parameters are allowed instead of identifier names and type parameter names. I.e. **`data`**`[list[int]]` does not exist and it is not allowed. +* In pattern matching positions with **free variable \&N**, the type **`data`**`[&N]` matches only with abstract data-types, and binds `&N` to the subject type. We have analogous rules for syntax, lexical, layout and keywords. +* In pattern matching positions with **bound variable** **\&N**, or in the return type of functions where bound variables are instantiated, the data-type **`syntax`**`[&N]` instantiates to a syntax type **with the same name** as the type that `&N` was bound to. + * `&N` is statically known to have been bound by one of the syntax types, data, syntax, lexical, layout or keywords. Other possibilities are not allowed to prevent such cases as **`data`**`[list[int]]` at run-time. + * Analogously for the other syntax roles, of course. + +## Backwards Compatibility + +* The additional type-checking rules are not necessary for run-time execution, unless a program actually breaks the rules. This means we can add the feature to the interpreter and compiler and then independently to the type-checker. +* There are no syntactic incompatibilities. +* All existing programs will work since they do not use this new feature +* There may be an urge to change the type of `implode` to a “name-preserving” variant, but that would break a lot of existing code. If we want to write a “name-safe” implode function we better come up with a different name or a different amount of parameters. + +## Implementation + +Internally this could be implements by a set of rewrite rules such as: + +* `parametrized-adt(“data”, [sort(“X”)]) = adt(“X”)` +* `parametrized-adt(“syntax”, [adt(“X”)]) = sort(“X”)` +* The semantics always ignores the modifier of the internal syntax type, but reuses its name to create an instance of the external type. + +And there are (at least) these details to consider: + +* Allow `data[..]` , i.e. reserved keywords as names of parametrized types. +* Implement syntactic sugar, where people use the notation rewrite it immediately to the respective type representation, using the above rules. +* Add additional checks for open modifiers like `data[&N]` during pattern matching: it may only match data-types with that specific role and not just the generic `node` or any other syntax type. +* Add specific instantiation semantics to parametrized data-types (i.e. the implementation of `Type.instantiate`), that implement the above rewrite rules (“lazy desugaring”). I.e. when `data[sort(“S”)]` is constructed reduce it to `adt(“S”)`. +* That’s all. + +## Alternative solutions + +* Could disallow identifiers for type names of syntax completely, and only allow `data[X]` instead of `X` . + * Pro: removes all ambiguity and introduces a one-to-one mapping from type literals to type instances + * Against: this makes a lot of code **longer and wider**, unnecessarily. I.e. `data Bool = and(data[Bool] lhs, data[Bool] rhs)` would not be so nice. +* Could make all concrete syntax untyped. So all instances of concrete trees would become of type `Tree` + * Pro: although this simplifies the introduction of concrete syntax, + * Against: it is also **messy** and the client code would not be very self-documenting. +* Could create Rascal code generators that produce modules with quasi syntax rules for every abstract data-type in a reified grammar, and then use module prefix disambiguation to select the concrete type over the abstract types in client code where needed. + * Pro: this could work, + * Against: but it is **ugly**, and we’d never be able to get rid of it anymore. Also it depends on how people accidentally write and import other modules. + +## References + diff --git a/courses/RascalAmendmentProposals/RAP14/RAP14.md b/courses/RascalAmendmentProposals/RAP14/RAP14.md new file mode 100644 index 000000000..59c440d64 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP14/RAP14.md @@ -0,0 +1,157 @@ +--- +title: RAP 14 - Module Compatibility +sidebar_position: 14 +--- + +| RAP | 14 | +| :---- | :---- | +| Title | Module Compatibility | +| Author | Jurgen Vinju, Paul Klint, Tijs van der Storm, Davy Landman | +| Status | Draft | +| Type | Rascal Language, Static & Dynamic Type System, Runtime System | + +## Abstract + +The question this RAP answers is this: given two modules; L and C, where L is a “library module” and C is a “client” module that imports or extends L, **which changes in L do not break C**? + +In other words, under which circumstances is L’ (the next version of L) backwards compatible with L for *any* C’ that depends on L where C’ is equal to C only that it is configured now to depend on L’? + +## Motivation + +Backwards compatibility is important: + +* It prevents big-bang releases of an entire ecosystem. If a low utility project is updated with new features we need the old features to keep working for a while. Case in point is the Rascal standard library, which currently breaks everything (and always). +* It enables bootstrapping (it is a hard precondition), otherwise newer versions of the compiler could not ever run on older versions of the run-time. +* It allows clients to release their updates on their own time schedule. Otherwise every library would force their clients to update immediately, or become useless to their own clients. + +The urgency of this RAP is that: + +* We diagnosed that the current binary forms of TPL files are not backward compatible (yet). This will be fixed, and to provide the framework for testing these fixes, this document describes the semantics of backwards compatibility. The current document does not describe the current situation with the .TPL files but rather the envisioned situation of a correctly working Rascal compiler. +* The code generator part of the compiler is not in use yet; hence solidifying this part of the semantics is a good idea now. +* The number of Rascal library projects is currently exploding due to the integration with maven. +* Binary compatibility is a strong enabler for (faster) incremental compilation. It means that modules that are imported or extended can be recompiled, while their clients in binary form can remain the same. The new functionality will be merged at **link-time**, in our case by the JVM class loader mechanism and the Rascal module loader on top of that. + * Detecting binary incompatibility correctly is thus required for correct incremental compilation. The “damage” of an update can be limited to the client libraries that are impacted by a non-backward compatible change. + * There is also another “link-time” moment, when binary tpls are loaded while compiling sources of the depending project. This is also a moment where incompatibilities can be detected. +* Binary compatibility is particularly important in the *package ecosystem,* when in a dependency graph multiple versions of the same package appear. With binary compatibility, the run-time configuration can choose their latest versions as long as they are all binary compatible. Inversely, we can not force a clients “C” to re-compile a library L1 that they use because L1 accidentally uses L2’ which is only source compatible. + +## Specification + +This specification is inspired by the [Java Language Specification](https://docs.oracle.com/javase/specs/jls/se7/html/jls-13.html). Since the back-end of Rascal is generated Java code, we will inherit compatibility aspects from that. + +* Since we generate a **very specific kind of Java**, backward incompatibilities can both be **extra** (on top of what is incompatible with Java) and **less** (some Java incompatibilities are unreachable from Rascal). +* In other words, we have to define Rascal module compatibility from scratch; the JVM semantics will come into play only on the implementation level of these concepts. + +**Definitions** + + +* **Breaking module changes:** For a module to “break” means that clients do not compile anymore without errors, or throws exceptions during linking or loading. + * Semantics may change, for example new exceptions may be thrown at run-time, but we still accept the change as non-breaking. + * The “breaking” is an aspect of the *external interface of a module*. It should still work well enough to be called and interacted with, following the interfaces of the previous version. + * No more demands are made on non-breaking module updates. +* A **source module** M is represented by the Rascal source code of a module in a **.rsc file** +* A **binary module** M is represented by these files: + * A **.tpl file** in a target folder or jar file + * A **.class file** in a target folder or jar file + * A **.constants** file in a target folder or jarfile + * It is assumed that Rascal guarantees these 3 files to always be in-sync. +* One binary module is always generated from one source module +* A library module L’ is either **import-compatible** or **extend-compatible** (or both) with its previous version L. + * Import-compatibility implies that all client modules C that *import* L do not break with L’. + * Extend compatibility implies that all client modules C that *extend* L do not break + * The concepts are different because `extend` looks inside almost every aspect of a library module, while import is (more of) a black box reuse mechanism. Also extend is transitive and import is not, which is reflected in the semantics of compatibility below. +* A library module L’ is **source-compatible** or **binary compatible** (or both) with its previous version L: + * A client module C does not have to be re-compiled to work with L’ if L’ is binary-compatible + * Every client module C has (only) to be recompiled (against the binary module) L’, without any changes to the source module C, for L’ to be source-compatible with any C. +* x-Incompatibility is the logical inverse of x-compatibility, with x from {import, extend, source, binary} +* In the cartesian product *{source,binary} x {import, extend}* are four semantically relevant compatibility situations. + * It is good to remember that an library writer can not decide whether the client will extend or import their module, + * and vice versa, a client writer can not decide whether the next version of the library they use is binary or source compatible. +* **Formal compatibility** versus **actual compatibility** + * **Formal x-compatibility** of L/L’ reasons about all the possible (hypothetical) clients C that would upgrade to L’ + * **Actual x-compatibility** of L/L’ reasons about the actual real uses of L in all existing real-world clients of L that would upgrade to L’ + * The current document is exclusively about **formal x-compatibility\!** +* The link between compatibility and **semantic versioning** is that incompatible projects must update their major versions, and 0.x projects their minor versions *always* when they have incompatible modules. + * This will allow checkers to warn early for breaking combinations of packages and their versions. + * This will allow the run-time system to assume the latest minor/patch version within each major version to always be compatible with each other and load these in case of conflicts in the search path. + +**Implications** + +Given the syntax and semantics of Rascal, here we list concrete changes to library modules that would imply source or binary incompatibility. + +* Source-incompatibility always implies binary incompatibility for Rascal. +* L and L’ are always assumed to be 100% statically correct when reading what comes after. All situations in which L or L’ are not statically correct are filtered by the static checker and thus mute any discussion on compatibility. Correctness means that info and warning messages are allowed but not errors. +* L is extend-x-incompatible if one of its own imports is import-x-incompatible or extend-x-incompatible; + * hence extend-x-incompatibility is transitive over the inverse extend relation. +* L is extend-x-incompatible (at least) if L is import-x-incompatible. +* L is import-x-incompatible if: + * At least one alternative of an overloaded function is removed.[^1] Removing alternatives breaks the dispatch function inside the client module C for the overloaded functions. After RAP 6 this would only happen in modules that *extend* the broken module. + * At least one alternative of an overloaded function is added..[^2] Adding alternatives breaks the dispatch function inside the client module C for the overloaded functions. After RAP 6 this would only happen in modules that *extend* the broken module. + * An alternative constructor of an ADT is removed (adding is fine) + * An alias definition is removed (adding is fine) + * A public global variable is removed (adding is fine) + * A java function changed; it’s @javaClass tag points to another JVM class + * A rascal function is changed to a java function with a @javaClass tag. + * A positional parameter is added to or removed from a function + * A keyword parameter is removed from a function (adding is ok) + * A positional field is added or removed from a constructor + * A keyword field is removed from a constructor, but only if a common keyword field with the same name does not still exist + * A common keyword field is removed from a data declaration. (adding is ok) + * Renaming keyword parameters or positional a parameters in functions, constructors and syntax definitions (since they are part of the API of their respective type/function) + * Module name changes + * Fully qualified module name changes. +* L is additionally extend-x-incompatible (not simply due to import-x-incompatible or incompatibility of imported or extended modules) if: + * An alternative (or more) is added to an overloaded function via + * A normal addition typed into the current module + * Extending a new module that has the same (overloaded) function as in the current module, or another extended module + * Importing a new module (as above).. *Note that this behavior would change if we apply the simplifications or RAP 6* + * An alternative (or more) are removed from an overloaded function via: + + + + * Having been removed from an extended module, or not extending said module anymore. + * Having been removed from an imported module, or not importing said module anymore (see also RAP 6\) +* Conversely, this is a list of changes that should be import-compatible and extend-compatible: + * Adding non-functional tags to functions (like @synopsis) + * Changes to private functions are always import-compatible, but not extend-compatible + * Changing the values of non-functional tags (like [Design Sketch for Nescio](https://docs.google.com/document/u/0/d/1p931tsUqSMZAvh77Vx_FXQCaVP_rjXu4nak6_zJPVwQ/edit) + * Changing the order of declarations in a module (functions, data, syntax, aliases, imports, extends) + * Adding new declarations + * functions (with new names or different arities; this excludes overloading changes from above), + * data-types, + * constructors, + * aliases + * tags + * syntax definitions + * Imports (unless overloads are affected, see also RAP 6\) + * Extends (unless overloads are affected) + * Changing the bodies (expressions and statements) of functions + * From statements to expression or back + * Throwing exceptions or not + * Adding or removing comments + * Changing algorithms + * Changing the order of keyword fields of constructors, unless there is a data-dependency from one field to another. The fields must depend on earlier (left) fields (they do due to the above 100% correctness guarantee of L’). + * Changing the order of keyword parameters of functions, unless there is a data-dependency from one parameter to another. The parameters must depend on earlier (left) parameters (they do due to the above 100% correctness guarantee of L’). + * Changing the source file location on disk, as long as it remains the same relative to the `srcs` path in pathConfig. +* X-binary-incompatibilities that are not x-source-incompatibilities (x is import or extend): + * *This is when simple recompilation fixes the problem without changing client source code* + * When one of the double (or more) declarations of a keyword field with *different* *defaults* are removed but at least one remains unchanged ??? TODO + * When extended modules change (transitively) in a x-x-incompatible way, but the module itself needs no changes. + * When imported modules change in an x-x-incompatible way, but the module itself needs to changes (see also RAP 6\) +* Incompatibility aggregation: + * A project P’ is x-incompatible with its previous version P when at least one of its modules L’ in P’ is x-incompatible with the corresponding L in P. + * If P’ is extend-incompatible but import-compatible, this is a valuable distinction to communicate since most libraries are used via import and not extend. + * After RAP6 this will happen much more often. + + + + + +## Implementation + +## Alternative solutions + +## References + +[^1]: Note that RAP 6, when implemented, would remove this incompatibility. It would become extend-incompatible with alternative removal only. + +[^2]: Note that RAP 6, when implemented, would remove this incompatibility. It would become extend-incompatible with alternative addition only. diff --git a/courses/RascalAmendmentProposals/RAP15/RAP15.md b/courses/RascalAmendmentProposals/RAP15/RAP15.md new file mode 100644 index 000000000..c556d096f --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP15/RAP15.md @@ -0,0 +1,119 @@ +--- +title: RAP 15 - Conditional Patterns to avoid non-linear matching +sidebar_position: 15 +--- + +| RAP | 15 | +| :---- | :---- | +| Title | Conditional Patterns | +| Author | Jurgen Vinju | +| Status | Draft | +| Type | Rascal Language, Static & Dynamic Type System, Runtime System | + +## Abstract + +The proposal is to extend the Pattern notation like so: + `syntax Pattern = Pattern pattern “if” Expression condition` + +The semantics is that if the `pattern` matches then we check if the boolean `condition` evaluates to true in the current environment (which includes optional bindings from the current pattern). + +And to introduce short-hand notation for all binary predicate operators (==, \!=, :=, \!:=, \<, \<=, \>=, \>) like so: +`syntax Pattern = Pattern pattern “==” Expression value` + `| Pattern pattern “!=” Expression value` + `| ...` + +And the short-hand expansion is: +`p == v` expands to `tmp:p if tmp == v` + +## Motivation + +This RAP solves *a number of issues* in the design of Rascal’s syntax and semantics in one go: + +* **Non-linear patterns (reused pattern variables)** often accidentally capture variables of the outer scope, leading to unexpected failure. This is one of the oldest issues in the design and has been mitigated by the checker providing info/warnings when it happens. +* When-clauses are written out of the execution order +* When-clauses are not allowed for functions with statement blocks as bodies +* Randomized input for tests has to be filtered inside of tests, often returning `true` when the test input is invalid (and wasting opportunity for the valid input). + +The urgency of this RAP is *low*. However, the non-linear matching issue has waited for more than 10 years now and it is still a weekly cause of time loss debugging this trivial issue in new code. + +Also there are other ways to solve the nonlinear matching problem. For example by introducing specific syntax for it in the Pattern notation. The benefit of this new notation is that it fixes the non-linear matching problem but also other problems with “when” and with random tests. + +**It would be less nice if we have to add conditional patterns anyway and have added yet another syntax for non-linear matching.** + +## Specification + +`syntax Pattern = Pattern pattern “if” Expression condition` + +Static semantics: + +* The pattern may bind variables that are used in the condition +* The condition may bind variables *inside* of it, but does not leak new variables beyond the current pattern. +* The condition must be of type `bool` + +Dynamic semantics: + +* First the pattern is matched against the current subject; with optional bindings as a side-effect. Note that if the pattern is not singular, backtracking may occur later. +* Then a new backtracking and binding scope is wrapped around the evaluation of the condition (such that complex back-tracking conditions can introduce variables and be cleaned up nicely) +* The condition finds the first way to evaluate to true (includes possible backtracking over the original pattern, but certainly also over non-singular parameters of && and ||. +* If there is no way to evaluate the condition to true, the entire pattern fails. +* Otherwise the pattern succeeds and continues with the bindings introduced by the pattern side (and drops the new bindings introduced by the condition side). + +Test semantics + +* Tests with a conditional pattern formal will try to generate a satisfying instance for a specific amount of maximum tries. This max counter is different from the current counter which generates a number of bindings for all parameters (usually 100^n where n is the number of parameters). +* If a condition fails, the test input generator will try again +* If a condition succeeds, the test input generator will move on to the next parameter, or start the body of the test function. + + +Short-hand notation + +* Expanding `p == v` to `tmp:p if tmp == v` for every binary operator is simple. +* Note how `_ == a` and `b := a` can be used for two different types of non-linear match. The first matches with equality (including keyword fields). The second matches with equality-modulo-keyword fields. + +New static restrictions on patterns: + +* Implicit nonlinear patterns are “duplicate declaration” errors: + * so: `and(x, x)` is and error (second x is a duplicate declaration + * Should be written as `and(x, _ := x) or and(x, _ == x)` +* All variable introductions in patterns are “fresh” from now on, so the static semantics of: + * `int x` is the same as + * `x` + * Except for the possibility of type inference in the latter. + +## Examples + +``` +// "dependent types" +int fac(0) = 1; +int fac(int n > 0) = fac(n - 1) * n; + +bool evenOdd(int E if E % 2 == 0, + int O if E % 2 == 1) = true; + +default bool evenOdd(int _, int _) = false; + +// remove duplicates with list matching +Bool and([*Bool x, Bool a, *Bool y, _ == a, *Bool z]) + = and([*x, a, *y]) +``` + +## Implementation + +This simplifies the implementation of the type-checker and the interpreter alike. Given the fact that we have had the warning for non-linear matches for so long, we can change it to a warning as soon as we have added the conditional patterns. + +1. First we add the conditional pattern notations and the short-hands +2. Then we add warnings: + 1. change the current informational message on nonlinear matches to a warning with suggestion for refactoring. + 2. Add warnings for all “when” clauses to be refactored to conditional patterns (if possible) + 3. Add warnings for tests that use `==>` for input filtering or `if(expression) return true;` +3. Then, in time, we remove the old semantics: + 1. Make non-linear matches produce double declaration errors + 2. Remove syntax and semantics of `when` + +## Alternative solutions + +* Think of another syntax for non-linear matching and forget about all the other possible conditions. +* Introduce a type system with dependent types (the current solution comes really close to a *dynamic* dependently typed system; a static dependently types system is much more involved and also hard to work with for programmers since nothing runs until it type-checks). + +## References + diff --git a/courses/RascalAmendmentProposals/RAP2/RAP2.md b/courses/RascalAmendmentProposals/RAP2/RAP2.md new file mode 100644 index 000000000..16fb02077 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP2/RAP2.md @@ -0,0 +1,90 @@ +--- +title: RAP 2 - Types are Parsers +sidebar_position: 2 +--- + +| RAP | 2 | +| :---- | :---- | +| Title | Types are Parsers | +| Author | Jurgen Vinju | +| Status | Draft | +| Type | Rascal Language | + +## Abstract + +The goal is to remove a number of features for calling parsers of different types and unify them in a single concept. **This concept is that any Rascal type represents a parser and we should be able to \`call\` this parser using a unified syntax.** We propose to not introduce new syntax at all, but to reuse the function call syntax. To parse using a Rascal type, the function name is represented by a reified type, like so: + +* \#int(“123”) + * parses the string “123” to the integer \`123\`. + * based on rascal expression parser or rascal-value parser +* \#start\[CompilationUnit\](|home://myCprogram.c|) + * uses the parser generated for C compilation units to parse a given C program in the given file. + * Based on (data-dependent) Rascal-generated parser + +Library functions like \`parse\` and \`readTextValueString\` will become deprecated after this feature has been added. Also the “as type” expression notation is redundant after this addition. + +We will have fewer library functions and fewer expression syntaxes after this, without loss of functionality. + +Parsing functions, as generated by types here, have parameters. The first and only positional parameter is what they will parse. The other parameters will be keyword parameters to provide options to the parse, such as \`allowAmbiguity\` and \`src\` and to provide contextual data to data-dependent parsers (when they are added). + +## Motivation + +* There are several ways of calling a parser right now in Rascal, none of them unique in terms of features, just different in syntax. It’s unnecessary. E.g. parse functions, read functions like readTextValueString, the \[AsType\] syntax, pattern matching. Is there more? + * Each has its own syntax + * Each has its own way of dealing with exceptions + * Each has its own configurability +* Non-terminals from syntax definitions are parsers conceptually, and very much so in the top-down sense we use them now. When they become parametrized, this only adds to the feeling of a non-terminal “being” a parser function. +* We often need to build builtin values from string input, but there is no convenient notation to do so, e.g. readTextValueString(...) + +## Specification + +No new syntax is necessary for the new feature: + +* syntax Expression \= Expression “(“ {Expression “,”}\* “)” // callOrTree exists already +* syntax Expression \= “\#” Type // type reification exists already + +The new semantics is for the callOrTree syntax, statically we also allow these applications now: + +* \ ( loc input, bool allowAmbiguity \= false ) +* \ ( str input, loc src \= |unknown:///|, bool allowAmbiguity \= false ) +* Both expressions return a value of type \`\&T\` (as instantiated) when successful, or they throw a syntaxError or a validationError exception value (see below) + +Semantically, the parser which will be called switches on the kind of type and the kind of input: + +* \[builtin, text\] For builtin types such as list, int, applied to a text file or string, the Rascal text value parser will be called; +* \[builtin, bin\] For builtin types applied to a binary serialized file, the Rascal binary value parser will be called +* \[data, text\] the Rascal text value parser will be called, and the value will be validated against the abstract syntax definition defined by the given reified type. Note that this gives rise to programmatically constructed abstract grammars which are used as parsers. +* \[data, bin\] the Rascal binary value parser will be called and the result will be checked against the expected top-level type +* \[syntax, text\] + * if the input does not start with “appl(” then a generated parser will be called to parse the text using the given non-terminal/grammar in the reified type + * If the input is an already serialized parse tree, the code will use the values reader and validate if the top-type is indeed the expected non-terminal or throw a validation exception +* \[syntax, bin\] the input is an already serialized parse tree in binary input, the code will use the values reader and validate if the top-type is indeed the expected non-terminal or throw a validation exception + +Exceptions thrown by the new expression: + +* syntaxError(loc src, str cause \= “”) // explains a syntactical problem with its exact location and a probable cause if possible. Note that we had parseError before and it may be good to not change this at the same time +* validationError(loc src, str cause= “”) // even though parsing was successful, the resulting structure did not match the expected type + +## Backwards Compatibility + +* The new feature is designed to simulate the old semantics of these things exactly, with a different syntax, i.e. semantically backward compatible: + * AsType expressions + * Parse functions in ParseTree.rsc and Prelude.java + * ReadValueFrom… +* The notable problem is exception semantics. The new notation will throw only parseError and validationError, which is different from the previous parsing API +* The new feature is syntactically different, so the old features must be labeled “deprecated” for a while and co-exist with the new feature +* A simple refactoring or quick-fix tool can be provided to translate the old notations to the single new notation. + +## Implementation + +We envision one or two RascalPrimitives to cover for the different kinds of types provided to the callOrTree: + +* Splitting out to more specific functions at compile time (primitive, data or syntax) +* Each RascalPrimitive will have to dynamically dispatch based on the content (str, loc, bin or text) + +The type checker should add specialized semantics for callOrTree expressions with reified type as “function”, treating them in effect as calls to the old parse function for example. + +The compiler should translate the expressions to direct calls to the above RascalPrimitives + +## References + diff --git a/courses/RascalAmendmentProposals/RAP3/RAP3.md b/courses/RascalAmendmentProposals/RAP3/RAP3.md new file mode 100644 index 000000000..509c5ebe4 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP3/RAP3.md @@ -0,0 +1,53 @@ +--- +title: RAP 3 - Concrete Patterns for External Parsers +sidebar_position: 3 +--- + +| RAP | 3 | +| :---- | :---- | +| Title | Concrete Patterns for External Parsers | +| Author | Jurgen Vinju | +| Status | Draft | +| Type | Rascal Language | + +## Abstract + +We use external compiler and IDE front-ends to lift on community efforts of constructing high quality language processors. We use Rascal to process the abstract syntax trees which are produced but we can only match using abstract patterns on these currently. A system by Arjan Mooij, internally at ESI, elegantly uses the CDT (Eclipse C parser) to parse concrete syntax strings and generate AST patterns from them. A similar feature can be added to Rascal, such that we can use elegant concrete syntax patterns on otherwise abstract syntax trees as well. + +## Motivation + +* Concrete syntax patterns are much more concise and also more independent of underlying tree structure +* Concrete syntax patterns are easier to write for domain experts +* We make good use of external parsers + +## Specification + +We have alternatives for implementing this feature. Currently concrete syntax is written as follows: + +(NonTerminal) \`concrete-syntax-string-with-holes\` + +In this example we use the NonTerminal as a parser to parse the concrete-syntax string at compile-time. The holes are replaced with simple unique placeholders before parsing and after parsing the resulting parse tree is changed to put the original holes back. Then the pattern interpreter or pattern compiler goes to work to translate the tree to either a pattern matching automaton or a constructor tree if the pattern is at an expression location rather than a pattern matching location. + +We propose to generalize this notation to allow any string function to be applied to the concrete syntax syntax fragment, like so: + +data Exp \= … ; // an abstract data definition or any other type +Exp javaExp(str x, loc l); // given this function which can parse a java expression string to an abstract data-type + +(javaExp) \`1 \+ 1\` // a concrete Java expression which will be parsed by the javaExp function (at compile time) + +The semantics would be that the normal string analysis and subsitution takes place to simplify placeholders for holes, then the string is passed to the given function, then the resulting value is visited to replace the placeholders with the holes again. The resulting value is a normal Rascal pattern which can be further processed by the interpreter or compiler. + +The implicit constraint is of course that the same parser is used to parse pattern strings as the parser which is used to parse subject programs to match against the patterns. This will remain a semantical constraint which is enforced by the programmer manually. However, due to Rascal’s type system, you can of course only match against patterns of the right type (i.e pattern and subject must have comparable types statically). + +## Backwards Compatibility + +* Since before functions were not allowed + +## Implementation + +* The challenge is to lift the value which is produced by the external parser back to a pattern expression in the interpreter and then nest the nested placeholders back in. +* With the concrete syntax feature we have a similar issue regarding the syntax trees; which will have to be implemented differently. +* We could also support pattern compilation/interpretation for values, mimicking the expression semantics; that would be a quicker hack perhaps. + +## References + diff --git a/courses/RascalAmendmentProposals/RAP4/RAP4.md b/courses/RascalAmendmentProposals/RAP4/RAP4.md new file mode 100644 index 000000000..1edebe2e1 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP4/RAP4.md @@ -0,0 +1,325 @@ +--- +title: RAP 4 - Rascal Function Semantics +sidebar_position: 4 +--- + +## Abstract + +Rascal functions are special, different from functions, procedures or methods in other kinds of functional and procedural languages. Rascal functions are much like term rewriting rules instead. Yet we see Rascal as a functional/imperative programming language with a static type system. This small document clarifies what Rascal functions are, summarizing a (hopefully) coherent design story. + +We explore how their syntax, static and dynamic semantics works, with examples and counter-examples. The issue this document tries to clear up is the static and dynamic semantics of function calling, including types of functions, and their sub-typing relation, which influences pattern matching, which influences function dispatch. + +These four language concepts (functions, sub-typing, pattern matching and function dispatch) are intertwined and mutually dependent. Therefore this document aims at describing these concepts in concert, avoiding completeness where possible to keep the document concise, by employing abstraction and selection combined with the occasional hand-waving. + +The definitions in this document are all defined informally by-example. This is a working document in this respect. The document is intended to motivate a number of design decisions, rather than document them in a formal sense. This is for another document. + +## Motivation + +Basic starting points for the design of Rascal functions are these: + +* Rascal functions are open to extension, to match the openness of data and syntax types; the data processed by these functions. If a module adds a new syntactic construct, then existing functions should be extensible (without changing existing definitions in other modules) to work on this new data-type. +* Rascal function parameters use pattern matching, for simplicity and consistency sake the pattern match language of Rascal function parameters is equal to the pattern match language for the `:=` and `<-` operators, and for `switch` and `visit` cases. +* Rewrite rules, as in term rewriting [Klop] have these two properties as well: open to extension and based on pattern matching. + +Still, simple Rascal programs should be familiar to normal Java and C[#,++,] programmers and not just term rewriting and functional experts. +* An important goal is to give a “familiar” look and feel to Rascal’s functions which have term rewriting semantics, in terms of syntax, static semantics and dynamic semantics. +* Another important goal is to make sure the static semantics as implemented by the type checker is a good over-approximate model of the dynamic semantics of Rascal; the type checker should not prevent to run programs which are fine. At the same time, the type checker should reject programs which will always throw exceptions, and warn about programs which throw exceptions in all likelihood. +* We understand from the start that due to the pattern matching semantics of Rascal’s function parameters, static checking can be made sound but not complete; not all call sites are guaranteed to find a matching function application at run-time, although statically we can discern between 3 cases: +* if certainly no function will match, (this should be a static error) +* maybe some functions will match, or maybe not (hence the incompleteness) +* that certainly one function will match (this would be a point of compiler optimization) +* Due to rewrite rule incompleteness, a run-time exception is thrown when no function matches (the CallFailed exception). This exception can also be caught to allow for complex back-tracking strategies if so desired. +* We strive for a design with no exceptional rules, **all kinds of values and kinds of types should be treated equally**. + +## Specification and examples + +#### Simple functions with simple patterns + +A single function, with a single alternative is defined as follows by a return type, a possibly empty list of formal parameter _patterns_ and a result expression: + +``` +int z() = 42; +int f(int i) = 2 * i; +``` + +Alternatively, return expressions can be constructed using imperative control flow: + +``` +int f(int i) { + return 2 * i; +} +``` + +For the purpose of the current definition, the body of a function is irrelevant. What matters is that the formal parameters are specified by patterns. In the above example, `int i` is a pattern which matches any value which happens to be of type `int` at _run-time_. + +Calling this function goes like so: +`f(1)`, and will return `2` in this case. + +The whole pattern language is at the disposal of the programmer defining a function, this includes literal patterns and deep patterns for example: + +``` +int f(0) = 0; // match only with integers equal to 0 +int f(/0) = 0; // match any value which has a 0 in it somewhere +``` + +The `when` condition to a function strictly enhances the fallibility of a function with arbitrary predicates on the parameters bound by the formal patterns: + +``` +int f(int i) = 2 * i when i mod 2 == 0; +``` + +The function fails to apply when the condition is `false` and does apply when it is true. Conditions may also introduce variables using the `:=` or `<-` pattern match boolean expressions. + +Finally, the body of a simple function may contain a `fail` statement like so: + +``` +int f(int i) { + if (i mod 2 == 0) { + fail f; + } + + return i * 2; +} +``` + +A `fail` to the encapsulating function name, as in `fail f` has the same effect as a failing `when` condition. + + +#### Types of formal parameter patterns of simple functions + +The static type of a formal pattern (it can only be _static_ since a pattern is a code feature not a data feature) is simply the static type of the _outermost_ pattern operator. I.e. `1` is an int, `int a` is an int. When `data X = x()` then the type of `x()` in `int f(x()) = 0;` is `X`, etc. + +Patterns are interesting in a static type checking sense; because patterns match dynamic values. For a pattern to match it must be that the type of the run-time value is a (non-strict) subtype of the static type of the pattern. Focusing on the typed variable pattern `int a`: it matches any value `v` _if and only if_ the dynamic type of `v` is a sub-type of the static type of the pattern: `int`. + +Note that in this sense a _pattern match_ is a bit like a _cast_ in other programming languages with sub-typing, such as Java. As you may know, downcasts may fail and so may pattern matching. + +A failed match in the context of a `visit`, `switch`, `<-` or `:=` will produce a `false` boolean value. For patterns at formal parameter positions of functions this semantics is a little different. If pattern matching on the formal patterns fails, then the function fails to apply dynamically and an exception is thrown. More on this feature of Rascal below. + +Variable patterns, without an explicit type, as in `a` in `int f(a) = 0;` are not allowed in function headers if they are unbound in the current scope. This is because _type inference_ for fresh variables is scoped only within function bodies. The reason for this design decision is to also scope the impact of a programmer mistake to the lexical scope of a function body. Otherwise, if function headers would also have type inference, errors could propagate between functions and cause all kinds of intractable and non-local type inference effects (see the Haskell language). + +An exception is made for the immediate children of ADT constructors; for example: + +``` +data X = x(int i); +int f(x(j)) = 2 * j; +``` + +Here we allow the omission of `int` before `j`, because `x` could be uniquely resolved and this implies a unique type for `j` without further inference. If there is another `f` in scope with the same amount of arguments, then it is required to type `j`, as in `int j`, or to qualify the constructor as in `X::x(j)`. + +#### Simple functions, with complex patterns + +The language for formal parameters to functions is the complete pattern language of Rascal, this includes the non-unitary pattern match operators for deep matching (`/`), list matching (`[*_,*_]`) and set matching (`{*_,e}`). + +A function with a non-unitary pattern will _backtrack_ over all formal patterns until a match is found: + +``` +int f([*_, 1, *_, int last]) = last; +``` + +when applied: `f([0,1,2])` the first match to succeed will bind the variables in the pattern, and then execute the body of the function. + +Note that the `fail` statement, as in `fail f;`, when used in the body of a function called `f` will make a function backtrack to the next possible match until all matches are tried. Only after the final match fails, an exception will be thrown. + +#### Multiple parameters + +There is a simple generalization from single parameters to multiple parameters. Patterns can bind values from left to right using typed variables. Patterns on the right may use values bound earlier on the left. Like so: + +``` +bool same(int a, a) = true; +``` + +#### Failing functions + +If the actual parameters to a function, their dynamic value that is, do not match the formal parameter patterns of a function definition, then the application of the function _fails_. Other reasons for a function to fail are a failing `when` condition or the execution of the `fail` statement. The `CallFailed` exception is catchable. + +I.e. the `same` function applied to different integers will fail to execute: `same(1,2)` will throw the following dynamic exception: `CallFailed(...fd..., [1, 2])` where `...fd...` represents an abstract representation of the `same` function and the second argument of the exception is the list of arguments which failed to match. + +This function will always fail: + +``` +int f(int i) { fail; } +``` + +Another big reason to make sure the Rascal run-time can deal with failing functions in a predictable and complete manner, is that it makes the semantics of functions more compositional. _The semantics of a single function being completely defined, independent of the following concepts of higher-orderness and function overloading, is key to keeping the design concise and predictable._ + + +#### Overloading and choice for dynamic dispatch + +The open extensibility of functions, designed to match the open extensibility of data types and context-free grammars, is predicated on Rascal’s functions failing to match. When one overloaded definition of a function fails, the next one is tried, etc, until all alternatives have been depleted. + +For example: + +int suc(0) = 1 +int suc(1) = 2 +int suc(2) = 3; + +When `suc(2)` is expressed, the semantics of Rascal dictate that only the final definition succeeds. The alternatives of a function are not tried in any specific order, at least that order is left undefined and open to compiler optimizations. + +To have some control over the order of applying overloaded functions, we have the `default` keyword: + +default int suc(int i) = i + 1; + +The overloaded alternatives of the `suc` function which are labeled with `default` are guaranteed to be tried only after the other non-labeled definitions have been tried. + +Also when functions fail explicitly using the `fail` statement or when their `when` clauses fail, the other functions will be tried next i.e these two definitions together define a complete function over the integers: + +int f(int i) = 2 * i + 1 when i % 2 == 0; +int f(int i) = 3 * i when i % 2 != 0; + +Or one could define a function by dispatching over the alternative definitions of an adt: + +int cc(ifThenElse(_, _, _)) = 1; +int cc(while(_, _)) = 1; +default int cc(statement _) = 0; + +In other words, the fallibile pattern matching feature for function parameters is the foremost (and preferred) mechanism for dynamic dispatch in Rascal. Overloaded definitions can be distributed over multiple modules and will be fused into one if one module extends another. +Finally, sometimes its handy to define an overloaded function from two existing ones with different names. The + operator on functions does exactly this, i.e: + +boolean mod2(int i) = true when i %% 2 == 0; +boolean mod3(int i) = true when i %% 3 == 0; +default other(int i) = false; + +(f + g + other)(9) // will return true because mod3 matches + +#### Higher order functions + +Now this is interesting. Since all formal parameters to functions are in fact patterns, the consequence is that passing around functions is also done via pattern matching itself. + +**Intermezzo**: higher-order functions in the context of a language with sub-typing can be called confusing at least. In other languages without strong pattern matching, but with sub-typing, already typing higher-order functions is complex. The main reason is the necessary contra-variance for the types of the formal parameter positions. I.e. unlike co-variance, where we have it that `list[int] <: list[value]` for functions it is necessarily the case that `int (value) <: int (int)` and not vice versa. You can see this via Liskov's substitution principle; something which is a sub-type of something else must be immediately substitutable at all the code positions where the super-type used to be. This means a function, at the place where it is called, must at least be able to handle the argument types declared locally. Perhaps more, but certainly not fewer. Hence we define contra-variance at the argument positions for such languages. Note however, that Rascal is not like other languages and the sub-typing rules for parameter positions are exactly what is going to be different. + +So, we are now defining what function typing and function sub-typing means, if formal parameters are patterns instead of simply typed formal parameter names. Based on this we will know how to _match_ function types, and thus how to pass functions as actual parameters to higher-order functions. + +A first class function is a Rascal run-time value with an actual concrete function type. So what is a function type? An example is this: `int (real)`. This is the type of a function with return type `int` and first formal pattern type `real`. + +Some other example types of functions are: + +``` +int f(0) = 1; // type is int(int) +str f([]) = "hello" // type is str(list[void]) +data X = x(); // type is X()` +list[value] f(x(), 1) = 1; // type is list[value] (X, int) +``` + +Pattern matching against function types allows programmers to pass these functions around: + +``` +int apply(int (int) func, int arg) = func(arg); +int f(0) = 1; + +int example = apply(f, 0); // f is matched against `int (int)` to bind the parameter `func` of `apply` +``` + +Since Rascal's types are _not data-dependent_ but patterns are very much data-dependent, we have an interesting issue here with static checking: it can not possibly be _complete_ for function calling in theory. Whatever we come up with (short of lifting the whole pattern language to the type language), we will always be able somehow to call functions statically which _will fail_ at run-time. + +For example, there is in principle no guarantee that the function passed to `apply` is actually fully defined on _all integers_. Note this is consistent with function application in general in Rascal. The function `int f(0) = 1;` is of type `int (int)`, yet it will throw a run-time exception on any integer except `0`. + +A Rascal type checker, therefore should not reject the following code, since it is entirely valid: + +``` +value x = someExpressionProducingValues; +int f(int i) = 0; +f(x); +``` + +`x` is of a strict supertype of the parameter type of `f`, yet this is allowed by Rascal. Like with all pattern matching in Rascal, the only requirement for the static type of the pattern and the subject term is for them to be _compatible_, i.e. for pattern _p_ and subject _s_, it is required that either `subtype(typeof(p),typeof(s)), or subtype(typeof(s), typeof(p))`, (inclusive or). Type incompatibility would imply dead code, since the pattern could never, ever, match. Hence this is a good static error message. + +The benefit of this static semantics is that dynamically, functions can be used to filter and dispatch on all kind of data using pattern matching which is essential for modular extensibility and other beneficial features of the language. + +That kind of power is paid for by the loss of a kind of completeness here for the static type system: a well-typed program may still go wrong! Namely we do not promise that for every function call instance that the function will not _fail_ to match, like other programming languages (Haskell, Java, C, C++, C#, etc._ do. The behavior, in fact, is much more like ASF+SDF and Prolog. + +To express the semantics of pattern matching on first class functions with typed patterns, driving the point home, is to have the following definition of sub-type for function types with one argument: + +``` +bool subtype(T1 (T2), T3 (T4)) = subtype(T1, T3) && comparable(T2, T4); +``` + +This generalizes to multiple arguments via pairwise compatibility, naturally, and functions with different arities are never comparable. + +As a consequence, when passing in functions as values to a higher-order (and probably generic) function there is a pretty good chance that the actual application of the function will not match. So higher-order functions should cater for this event by catching that exception and dealing with it: + +int apply (int (int) f, int arg) { + try { + return f(arg); + } + catch CallFailed(_, _) { + return -1; + } +} + +There is a short-hand for this using the IfDefinedOtherWise operator: `return f(arg) ? -1`; + +#### Warnings for probably incomplete function calls + +In many cases, it is possible to _warn_ the user about imminent pattern match failure, as in the above example. It is _not_, however, a static error. For consistency's sake pattern matching is only statically erroneous in case of type incompatibility, even for higher-order functions. Especially when functions are nested in deeper values, this becomes absolutely essential functionality. Also in the context of polymorphic types, like `&T`, this behavior is essential. + +Warnings on patterns which will never match (i.e. at function call sites) should be sound and not complete, i.e. we prefer the warnings to have no false positives, but false negatives are expected. The type checker will not always be able to predict imminent application failure. If we go for sound warnings, probably all interesting (dynamically dispatched, or higher-order) function calls would be labeled as possibly incomplete. + +For the `apply` example, above, it can be predicted that failure will occur, and the type checker would generate a warning that the call to `apply` never match. The checker could infer this because at the call site enough context information is present to decide the dynamic type of the first parameter to `apply`. This is not always the case of course. + +### Types for overloaded functions + +A question is what the dynamic type of an overloaded function would be. Especially when such a function is passed as a higher-order parameter this is an interesting issue since the type of the _dynamic value_ drives pattern matching. + +Examples: + +``` +int f(int _) = 0; +int f(real _) = 1; + +int g(int _) = 0; +real g(real _) = 0.0; +``` + +The type of the value produced by evaluating the expression `f`; what is it? We introduce here, for the sake of argument, a new type constructor `+` for a disjunctive type. Note that this actually does not exist in Rascal, we just use it to explain here why we did not introduce disjunctive types. The type of `f` namely might be exactly: `int(int) + int(real)` and the type of `g` is `int(int) + real(real)`. + +Such disjunctive types do not exist in Rascal’s type system, and for a good reason. They make type inference non-unique in a confusing way. There is an easier solution for this conundrum. +Because function types are both co- and contra-variant in their argument types in Rascal (the arguments only need to be comparable), a conservative approximation of the type of an overloaded function is to simply compute the least-upper-bound of the argument types. + +So the type of `int(int) + int(real)` is actually `int(num)` in Rascal. + +This design decision inherits the same weakness as typing call sites, it may be incomplete and thus calls to overloaded functions which were passed in as arguments may lead to CallFailed exceptions at run-time. In this sense the higher-order type system for function calls is consistent with the first-order calling semantics of functions. + +### Function construction + +Next to overloading by name, Rascal also features the construction of first class dynamically dispatched function alternatives via the addition operator: `+`. + +``` +int f(int i) = 0; +int g(real r) = 1; +int example = (f + g)(0); +``` + +The semantics is the same as overloading. The type of `(f+g)` is the least upper bound function type of `f` and `g` and the first function to match `0` will execute. When none of the alternatives match, the call expression fails as usual. Of course f and g must have the same number of parameters for this to typecheck: otherwise their `lub` would be `value` and function calling would be statically not supported. + +Another operator is function composition (f o g). + +#### Keyword parameters + +Functions in Rascal may also have additional parameters in the form of keyword parameters. These are quite different from formal pattern parameters: + +``` +int f(int i = 0) = i * 2; +``` + +The differences are: + +* a keyword parameter is always a typed variable name, and not a general pattern; +* a keyword parameter _always_ has an associated default value +* functions do not _match_ or _fail_ on keyword parameters. They simply bind actual keyword parameters to formal keyword parameters (if provided and of the right type), and otherwise the default value is bound. + +Consequently, keyword parameters are not a part of the dynamic dispatch feature of Rascal. + +Also, keyword parameters are not part of the type of a function. + +## Backward compatibility + +The current document actually describes mostly Rascal as it is already, with the following exceptions: + +* the current type checker produces errors sometimes where the proposal above would generate a warning, and it produces errors where the above proposal would not produce any warning or error. This implies a strictly weaker type checking which is _statically_ not backward compatible but can not dynamically change semantics. It will simply make more dynamic semantics possible, including more function calls to fail, but also more function calls to succeed than before. +* It can be that the compiler/type checker would filter out "dead code" before, which is not dead anymore when implementing the above proposal. With non-deterministic overlap, or order-dependent function definitions, this may lead to new non-deterministic behavior of existing programs. A warning feature for overlapping pattern matches for the overloaded alternatives seems to be in order. +* the ? operator on functions is not implemented + +## References + + + diff --git a/courses/RascalAmendmentProposals/RAP5/RAP5.md b/courses/RascalAmendmentProposals/RAP5/RAP5.md new file mode 100644 index 000000000..cb45f4476 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP5/RAP5.md @@ -0,0 +1,318 @@ +--- +title: RAP 5 - A single exact number type for Rascal +sidebar_position: 5 +--- + +| RAP | 5 | +| :---- | :---- | +| Title | **A single exact number kind for Rascal** | +| Author | Jurgen Vinju | +| Status | Draft | +| Type | Rascal Language | + +## Abstract + +This RAP proposes to unify all of Rascal’s number kinds (*int*, *real*, *rat*) into a single representation. The primary goal is to achieve simplicity while keeping interoperability with other external data sources and sinks. The secondary goal is to achieve the intended specification of Rascal arithmetic to be fully symbolic, i.e. *exact*, and so to coincide very strongly with the notions of elementary school and highschool arithmetic. + +All numbers will have associative and commutative semantics for \+ and \*, and classic distributivity for \* over \+. The rationals will have classic inversion laws of \*, / and \+, \-. This is unlike any programming language. There will be no implicit coercions, or rounding, or overflow for Rascal numbers. + +The `int` type will reflect all round numbers, positive or negative. The `rat` type (supertype of `int`) will reflect all rational numbers (including the ints) and the `real` type (supertype of `rat`) will additionally reflect all other irrational numbers (including the rats). It is implied there will be only one kind of `0` value. Future versions may include an even general number types (`complex`), each again strict super-types of the previous. + +Fractional representation may be used inside for efficiency’s sake, but for usability’s sake we’ll use decimal notation always for reading and writing numbers. This includes the rational numbers with infinite decimal sequences. For example: `1 / 3` will return `0.(3)` which stands for `0.33333333…(infinitely)`. `(⅓) \* 3 == 1` with this new system, and `0.(3) \* 3 == 1` as well. (the bracket notation is a standard mathematical notation for infinitely progression decimal sequences). + +Finally, we add midpoint-radius notation for *automatically handling approximations* of irrational numbers (pi, e) and raw data with inaccuracies. Programs will not explicitly have to deal with (im)precision, but Rascal will always show exactly (by default) how imprecise a calculation might be. Example: `pi(err=0.0005)` will return `3.141 ± 0.0005` to indicate that the real number pi is any number between 3.1405 and 3.1415 (and the current best guess is 3.141). We have a new theory on making the above algebraic laws of commutativity, transitivity and distributivity work for these “midpoint radius” numbers. + +## Motivation + +* Complexity and incoherence: Rascal has a number of design principles: symbolic representations, immutability, strong typing, simplicity, which are currently not fully attained with the number types. There are problems with the previous design: + * there is currently a plethora of overloaded operators designed to make the three numbers types work together, int, real and rat. + * The `==` operator coerces the different types of numbers, contradicting one of the main axioms of Rascal “what-you-see-is-what-you-get” captured by the specification: for all values x, y: `x == y iff “” == “” ` + * Also even though `1. == 1` it’s not true currently that `{1., 1} == {1}` (!!!) + * Interaction with external data-sources leads to a combinatorial explosion when multiple function arguments are at play, this is because `rat`, `real` and `int` are currently *incomparable* types. + * The least-upper-bound `num` type is not used by Rascal programmers because it does not add the abstraction required to **not** think about the differences between `int`, `rat` and `real`. I.e. the substitution principle does not blossom in the current design. + * Generic functions over `num` do not work because there is no shared concept of the `0` and `1` values. So even though it seems the `+` operator works on all types of numbers, the `num` type itself is not a fully implemented calculus. + * Example: `\&T \<: num sum(list[\&T \<: num] l) = (0 | it \+ i | i \<- l);` + * `sum` does not type-check because `0` is not a value of all of the sub-types of `num`, just for `int` (for rat there is `0r` and for real there is `0.`). *This is not only cumbersome, it is also confusing.* + * In the newly proposed system the above `sum` function is valid and returns an int for a list of ints, a rat for a list of rat and a num otherwise, because 0 \\in int ⇒ 0 \\in rat =\> \\0 in real +* *Opportunity 1*: since Rascal’s values are all symbolic and immutable anyway, there is an overhead which we pay and we do not exploit. + * We could have fully *exact* numbers without loss of precision without making Rascal slower. Rascal’s numbers were already “symbolic” and relatively slow at that as compared to floating point arithmetic in Julia or C. + * The current RAP should not make Rascal’s numbers slower than they are now, and possibly they will be faster + * The current RAP removes the need for run-time conversions of numbers, which may save some cycles +* *Opportunity 2*: similarly a single number type would pave the way for units of measure integrated into the value system. We do not include this in the current RAP, but without a single number kind the unit system would grow complex quickly. We see the simplicity of this RAP as an enabler for an elegant unit type system. +* *Opportunity 3*: currently API functions which interface with JVM libraries have to support all different kinds of Rascal’s numbers: int, real and rat. Especially with multiple arguments to a function this explodes. Library designers have to constantly think about a nearly impossible trade-off between convenience and efficiency. This problem can be removed by the current RAP. +* *Opportunity 4*: a new unique selling point for Rascal + * Not exactness is a big threat to validity for many research methods in many different fields of science, not in the least **software analytics** and **software metrics**. + * Researcher do use R, Python and Julia libraries without knowing about the impact of inexactness of floating point arithmetic or the inaccuracies of their own data (!) + * A re-implementation of statistical libraries on top of exact numbers in Rascal will open up a new audience and set of “killer” apps for Rascal users, possible also outside of the context **software analytics** and **software metrics**. + +## Specification + +Noted that these changes have to be reflected in **vallang** as well as in **Rascal** and in the links to external number manipulation libraries (apache.math) in library code. This is a BIG change. + +### Syntax and types + +1. Type system changes: + 1. The `num` type disappears. For forward compatibility, in case a more general type then `real` is added (like `complex`) we should not have a number type which is more general than a type which will be added later. + 2. The number type hierarchy will be: **void** \<: **int** \<: **rat** \<: **real** \<: **value** + 1. `**real**` is the most general number type + 2. `**rat**` is a proper sub-type of `**real**` + 3. `**int**` is a proper sub-type of `**rat**` + 4. The idea is that the system reflects the mathematical classes of numbers as much as possible. (see also LISP, Racket, Kawa “numerical tower”) + 3. It is understood that the run-time type of any number is always at its most specific, even though statically the type of a variable is more generic. + 1. This is essential for the interaction of the numbers with pattern matching + 2. This is essential for the canonical representation of numbers and containers with numbers in them (for equational reasoning, aliasing and sharing) + 4. In the previous type system, values of int and rat and real were of incomparable types, now `int` is a strict sub-type of `rat`. This will affect existing dynamically dispatched functions since they may not be mutually exclusive anymore. +2. The syntax of numbers becomes a single non-terminal with a number of alternatives. (all types in Rascal coincide with some kind of literal language). + 1. `[0-9]+` : rounded integers + 2. `[0-9]+ “.” [0-9]+` : finite rational numbers in decimal notation + 3.` [0-9]+ “.” [0-9]+ “(“ [0-9]+ “)”` : infinite decimal numbers with “repetents” to represent exactly rationals which have infinite decimal expansions. I.e. `0.(3) == ⅓` + 4. The `x r y` notation disappears and is replaced by the `x/y` division expression + 5. Syntactic support for midpoint-radius notation: `number1 ± number2` means the first number is exact within a range of `number1-number2` to `number1 \+ number2`. Number2 must be exact (no range), and so must number 1 (no range nesting). + 6. “±” NumberLiteral: means “approximately this number”, where each digit is interpreted as if it were rounded to the nearest significant digit. So this literal `±1.00` will reduce to `1 ± 0.005` and this `±1.0` will reduce to `1 ± 0.05`. This is a convenience syntax. + 7. Scientific notation with “e” still supported on top of decimal notation; this interacts nicely with the prefix ± notation for significant digits. + + 8. `.3` not supported anymore, you must write `0.3` for technical (ambiguity) reasons in the grammar + 9. The repetent notation with the brackets requires an additional disambiguation for the CallOrTree expression (no literal numbers can be functions) +3. Static types for arithmetic expressions + 1. `lub` function is defined according to int \<: rat \<: real + 2. For all expressions `e1`, `e2`, `type(e1) \<: **real`** and `type(e2) \<: **real`**: + 1. type(e1 \+ e2) = lub(type(e1), type(e2)) + 1. Implied: ints remain ints, rats remain rats and reals are reals + 2. Implied: Int \+ rat becomes rat, rat \+ real becomes real, etc. + 3. Implied: no run-time coercion is even necessary. The numbers are already of the right kind statically (\_always\_). + 2. type(e1 \- e2) = lub(type(e1), type(e2)) + 3. type(e1 \* e2) = lub(type(e1), type(e2)) + 4. type(e1 / e2) = lub(`**rat**`, lub(type(e1), type(e2)) + 1. / is not closed on integers: always at least a rat (statically) + 5. type(e1 **div** e2) = `int` + 6. type(e1 **mod** e2) = lub(type(e1), type(e2)) + 7. type([0-9]+ whole) = `**int**` + 8. type([0-9]+ whole . [0-9]+ fraction) = `rat` **when** `fraction != 0` + 9. type([0-9]+ whole . [0]+ fraction) = `int` + 1. **Implied:** 0.0 is an **int**, this is not strictly necessary obvious if you come from another PL, but for learnability in the REPL newbies can immediately (statically) see and understand that 0 == 0.0 == 0.000000(0). + 10. type([0-9]+ whole . [0-9]+ fraction “(“ [0-9]+ repetent “)”) = `**rat` when** fraction != 0\* and repetent != 0\* + 11. type([0-9]+ whole . [0]+ fraction “(“ [0]+ repetent “)”) = `**int`** + 12. type(e1 ± e2) = `**real**` if e2 != 0 + 1. Note that both e1 and e2 as left and right-hand sides of `±` always reduce (semantically) to values of type `**rat**` dynamically due to the axioms of `±`. + 13. Types of inherent number property fields: + 1. type(e1.numerator) = `int` + 2. type(e2.denominator) = `int` + 3. type(e1.radius) = `rat` + 4. type(e1.midpoint) = `rat` + 5. type(e1.whole) = `str` + 6. type(e1.fraction) = `str` + 7. type(e1.repetent) = `str` + 8. type(e1.scale) = `int` + 9. type(e1.precision) = `int` + 3. Recall: even if statically an expression like `e1 / e2` has type `real`, still the dynamic type of the result might be `int`. Consider: `1.001 / 1.001 == 1` + 1. This is sound because int \<: rat \<: real + + + +4. Each specific instance of a number has a canonical syntactic form when it is printed: + 1. Leading and trailing zeros are dropped + 2. Rationals are printed in decimal notation with repetents + 1. We always print `0.01075268817` and not `1/93` + 3. x/1 is printed as x, so is x \* 1 + 4. Error ranges are printed using “±” as 0.0001 ± 0.000000001 + 5. Might use E scientific notation to avoid printing a lot of 0’s after the `.` + +5. Division. With `/` division defined on all numbers, it is also defined on integers to return rational values. This is backward incompatible and requires a few additions: + 1. `1 / 3` will produce the rat `0.(3)` and not the int `0` anymore + 2. New operators added `**mod**` and `**div**` operators for whole number division + 1. 1 div 3 will produce 0 + 2. 1 mod 3 will produce 3 + 3. Mod and div can also be defined for general rationals and midpoint-radius numbers to mean “how many times does this fit?” + 3. `**int**` indexes into lists, so `a[numExpression]` remains a static Rascal error when `a` has a type sub-type of `list[value]`. It has to be `a[intExpression]`. + +6. The internal representation of any number is always logically a gcd-normalized fraction + 1. No more rounding errors + 2. A challenge to make fast + 3. Canonical representation due to the GCD, good for equality checks of (nested) collections with numbers in them (\!) + +7. Because there is only one number type all number operations are defined always + 1. == + 2. \+ + 3. / + 4. `div` (is / but whole number division) + 5. `mod` + 6. And there is only one `0` and one `1` +8. No more coercions and no more overloading + 1. No more coercions like the designers of Julia also advocate + 2. But also no more implicit overloading or upgrading of numbers to higher precisions. This is enabled by having only ONE canonical representation of numbers inside, which is rationals and pairs of rationals in the midpoint-radius case. + 1. Julia, for example, supports many different representations which can be traded by programmers (efficiency against exactness). Rascal does not offer these trade-offs anyway (never did support float or double), but we did overload `int \+ real` and `real / int` etc. to generate a complex hierarchy of combinatorial size with possibly “interesting” conversions between rat, real and int. + 3. The linear and predictable scheme of number I/O and conversions will be: + 1. parse a number notation, or import it. + 2. represent as rational, + 3. Compute the with rationals, + 4. print or export a number notation +9. Midpoint-radius notation for dealing with irrational and inaccurate numbers: + 1. Any irrational number such as pi or e or the outcome of the `sin` function, will be represented by a well-defined interval in midpoint-range notation + 2. Any inaccurate number (for example obtained by a noisy measurement) can also be represented by the same kind of intervals + 3. Using the midpoint range notation, we can offer a uniquely usable algebra with commutativity, associativity, distribution and (partial) inversion laws. + 4. The specific definitions of this algebra are still under embargo + 5. Midpoint range notation is like so: `midpoint ± radius` + 1. The midpoint represents *the most likely* *outcome* of a computation + 2. The radius represents an absolute radius around the midpoint. The actual value of the number may be any real number (rational or irrational) in `[midpoint \- radius, midpoint \+ radius]` (inclusive bounds) + 3. All the arithmetic operators manage the radius’ automatically + 1. Most code can be oblivious to the error ranges + 2. The midpoint calculation is always isomorphic to a calculation on rational numbers without the error radius (i.e. error oblivious) + 3. The radius’ are always a conservative over-approximation of the error + 4. The radius’ are as tight as we can get them for the general arithmetic operators, but no specific theory will be included in the semantics of Rascal to tighten them more. This is up to the programmer. + 6. Midpoint/radius numbers require more functions or operators for detecting overlap. + 1. Midpoint radius numbers are only equal if both the midpoint and the radius are equal + 1. `A ± e1 == B ± e2 ⇔ A == B && e1 == e2 + 2. Midpoint/radius numbers are partially ordered, not fully ordered like `rat` + 1. `A ± e1 \< B ± e2 ⇔ A \+ e1 \< B \- e2` + 3. Existing `in` operator detects interval inclusion (boolean): + 1. `a in b ⇔ a \>= b \- e && a \<= b \+ a` + 2. `a ± e1 in b ± e2 ⇔ (a \- e1) in b && (a \+ e1) in b + 4. Existing `&` operator detects interval overlap (boolean) + 1. & is currently defined on sets as set intersection + 2. We’re not defining interval intersection since the empty interval can not be represented using midpoint radius notation + +### Support functions + +* `str significant(num x)` prints a number with a number of digits upto the precision indicated by its radius. It fills with zeroes to the right. +* Num fromSignificant(str x) parses a number, assuming the number of digits after the point represent its accuracy as a rounded number, including trailing zeroes. So: `fromSignificant(“1.00”) would return `1 ± 0.005`. +* Bridging to JVM languages: + * `INumber.as{Float,Double,Integer,Long,TwosComplement}() throws NumberFormatException` (if the number of more accurate than the target container) + * `IValueFactory.number({String, float,double,int,long,byte[] twoscomplement} num)` converts the value *without loss of the original precision*. + * Finding out about loss of precision when communicating with JVM libraries: + * `num fitFloat(num x)` uses the most precise float representation possible (with the least error) to represent `x` + * `num fitDouble(num x)` uses the most precise double representation possible (with the least error) to represent `x` + * `num floatError(num x)` returns the error made by fitting `x` into a JVM float + * `num doubleError(num x)` returns the error made by fitting `x` into a JBM double + * Library support for mapping INumber to `java.math.\*` big decimals and big integers and the like. +* Mathematics library + * Some irrational numbers are represented by functions which produce rational approximations up to a given error: + * `num pi(rat err=1/1000)` produces pi within the bound given by `error`, and would be printed as such: `3.14129 ± 0.00001` + * `num e(rat err=1/1000)` produces pi within the bound given by `error` + * Goniometric functions, (sin, tan, cos, etc.) the same: + * tan(num n, rat err=1/1000), produces the `tan` of `n` within the bound given by `err` + * When a mathematical function is **undefined** for certain number(s), the function throws a value of ADT: `*NumberException*` with the constructor equal to the name of the function “Undefined” and its parameter(s) the input of the function. + * E.g. `data NumberException = tanUndefined(num x)`, (happens only when x == π/2, but that is logically impossible since Rascal would not have an exact representation of π/2, so it should never happen. + * Or, `data NumberException = avgUndefined(list\[num] l)` happens for `\[]` the empty list. + * The goal is to allow for precise error location, as well as allow for function-specific error-handling by catching/matching these run-time exceptions. +* Statistics library + * Important use case of Rascal numbers are source code statistics. + * Loss of precision via external libraries would be throwing in the towel, but the loss of precision can be handled by reflecting the imprecision of `float` and `double` using the error ranges? + * Proposal: semi-automated source-to-source translation of Java, Julia or R **open-source** libraries for statistics towards Rascal statistics library. + + +### Semantics + +* Computation semantics is completely based on traditional rational number algebra. The standard field of 0, 1, \+, \* extended with /, `div` and `mod` for convenience’s sake. +* Leading zeros are always dropped. +* 0.0 == 0, 1.0 == 1, etc. due to standard algebra exact digit counts (“precision”) of a number are **not semantically meaningful** although symbolically different: + * Because *1.0 == 1 iff 10 == 10 iff 1 == 1* + * All numbers in Rascal are simply fully \_exact\_ + * This means that a discussion on precision and resolution of rational numbers becomes relevant only at the boundaries between Rascal and other systems, + * Or we use the midpoint/radius notation for imprecise numbers + * The `str significant(num x)` function would return `”1.00”` for `significant(1 ± 0.005)`, to indicate that the original number was significantly precise up to 2 decimal digits after the dot. +* `{1., 1} == {1}` as a corollary to the previous, + * Because `{1. , 1} == {1}` iff `{1, 1} == {1}` iff `{1} == {1}`. + * `\[1. , 1] == [1, 1]`, by the same reasoning + * Note: this opens possibility for **further optimizing the persistent collection data-structures** under vallang (using capsule) due to the possibility of a canonical internal representation of numbers, leading to short-circuiting (in)equality tests of possibly very large collections of numbers. +* Divide by zero is a runtime exception. + * We considered using Bergstra’s meadows, but the issue here is that the 0 division cancellation law is almost never really applicable (the axiom is pretty complex), and thus mostly computations which end up dividing by zero end up producing “NaN” or “a” or “I had a division by zero somewhere”. By that time the cause of the division by zero can be buried deeply in the past which hampers debugging and error reporting to the end-user. + * “Fail fast, fail clearly” +* Repetents are *always* normalized to rationals inside, such that + * `0.(9) == 1` + * `0.(3) == ⅓` +* We arrive at new and simple axioms which were not possible before due to rounding issues: + * ` y != 0 ⇒ x / y \* y == x` + * `z != 0 /\\ y != 0 ⇒ x / (y / z) == (x / y) / z` + * `y != 0 /\\ z != 0 ⇒ x / (y / z) = x \* (z / y)` +* We arrive at axioms which were not possible due to number coercion and operator overloading: + * `(x \+ y) \+ z == x \+ (y \+ z)` + * `x == y iff "" == ""`, the “WYSIWYG” axiom, but care has to be taken that a canonical printing operation exists (see above) and that it never accidentally rounds numbers. +* Error range semantics needs to be reflected on all built-in operations, such as \+, \-, /, \*, etc. (algebraic rules are under embargo for now) + * Supports commutativity, associativity and distributive law + * Inversion law holds for the midpoints, but not for the radius’ + * “Obliviousness law” requires the midpoint calculations never to be affected by the error radius + * This is a usability feature: the programmer can always check the outcome of a calculation manually using highschool arithmetic and forgetting about the error radius + * This is a error-analysis refactoring feature: algorithms can be changed under the constraint of keeping all midpoints intact and improving only the bounds of the radius’. + * Two important axioms for the radius operator: + * Errors accumulate: E1 ± (e2 ± e3) = e1 \+= (e2 \+ e3) + * Bigger errors subsume smaller errors: (E1 ± e2) ± e3 = e1 ± (max(e2, e3)) + * These two reductions imply canonical rational values for the left and right-hand side of `±`, since all nesting of `±` is rewritten to `+` and `max` on `rat` + +### Design drawbacks and pitfalls + +* Exact numbers are almost never equal + * Have to use `round` a lot more than before? + * Ignorance used to be bliss, now you see the threat to validity of your code all the time and you have to deal with it explicitly. +* Rational numbers are more expensive than float and double + * Yes, factors of 10 to 20 slow down have been reported in other contexts + * Considering growth the speed of computer hardware this slowdown can also be seen as insignificant. +* Many, many syntax and API breaking changes + * Have your cake and eat it, unfortunately. +* Ugly notation of repetents with double brackets. + * Hard to fix + * Unicode overline is also possible when printing, but very hard to type in. +* Statistics libraries in pure Rascal are a pretty big construction and maintenance effort. + * Without these libraries the point of introducing exact numbers into Rascal is almost mute, since no code will actually *use* them. + * The only argument left would be design simplicity. + * Without a semi-automatic acquisition of all these algorithms, or an open-source enthusiast willing to contribute this might be the breaking point for this RAP. +* Error radiuses impose an additional overhead, but it would be a USP to have visible error rates on all numbers at all times. +* Error radiuses only exist in numerical analysis packages and have not been applied in programming languages before as such. + * Probably because speed is always preferred over correctness + * But also because the specific kind proposes here is unique in providing a distributive law (which most people require to use) + * Alternative representations such as plain intervals suffer immensely from syntactic and semantic usability problems: + * No inversion, distributivity or associativity + * Error drift (the ranges start drifting from the midpoints) + * Syntactic overhead (each computation needs definition on at least two points in each program) + * So intervals are for numerical experts (not the audience of Rascal) + + + +### Implementation details + +* Using builtin integers to represent small rationals, i.e. the LSBs for the nominator and the MSBs for the denominator. This can go down to `byte` level as well for very small numbers. + * We might even improve on the current representation overhead of big decimals +* Eventually large nominators/denominators will be represented by byte arrays (or big integer libraries) +* Canonical representation is essential, **so GCD becomes a main bottleneck** and this needs to be investigated and mitigated + * Lazy + * Incremental/persistent + * Shared +* No known big decimal libraries which support repetents (\!), not even notationally. Because it is expensive and because we should use rationals. + +## Backwards Compatibility + +* Support (most of) the old syntax for numbers, but normalize to the new notation when printing the values +* Rounding errors disappear which may result in new values not seen before +* Library re-design introduces many API changes + +## Implementation + +* Vallang implements fast rationals, itself, based on learning from other open-source libraries. Avoid wrapping other implementations for speed reasons +* Investigate use of JIT compilation and unboxing/boxing and/or MethodHandles +* Library implementations (semi-automatic source-to-source translations) +* Implementation involves removing a lot of existing code which become unnecessary in vallang, the Rascal interpreter, the Rascal type-checker and compiler(s) including removing unnecessary library code. + +## Future work + +What this RAP does not touch at all and still is important: + +* **Dimensions and units**, run-time type system and static type system (studied in concert) + * Unit and dimension type systems have been investigated broadly and deeply in literature and in practise. + * It should not be hard to borrow the right design from somewhere? + * There is an overlap to be expected with our algebraic data-types, and perhaps a meaningful integration which avoids this overlap and allows for full extensibility of unit dimensions would be possible? +* **Easier syntax and semantics for mapping and folding number operations** over the number containers (lists, sets, maps and relations) + * i.e. `list[real] \+ list[real]` could pairwise add the numbers if `+` weren’t already list concatenation. + * Now we have to write `[ a[i] \+ b[i] | i \<- index(a) ] ` or something in this vain, not even taking notice of the possible difference in length of the two lists. +* **Diagramming**: Integration with diagram, graph and table visualisation libraries + * A **single number type** makes integration with external diagramming and graphing tools much easier + * First step: adapting Shapes and Salix bridges, perhaps also the older Figure library. + * Idea: optimize diagramming a lot: look at screen or paper resolution constraints to filter invisible data before processing it into the diagram. + * So: `scatterPlot(rel[real,real] input, resolution=600 dpi)` would not “print” dots which overlap by more than 50% (or 75% or 90%??) if printed at 600dpi. Here “print” means render or communicate to the graphics engine at all. + * Note that this filtering is very different from mathematical rounding. Here a “dot” is printed in a two dimensional fashion even though it is represented by mathematically precise single dimension \ position. + * “Overlapping” dots for more than 50% is a two-dimensional perspective. + * The overlapping points are removed entirely from the output set to be visualized, simply because some other point already represents their “visual value”. + * This filtering entails a communication and rendering optimization with possible orders of magnitude gains in efficiency. Consider an exponentially distributed and filled data-set, while paper and screens are simply linear in both directions. + * There are obvious threats to validity here and possible introduction of visible artefacts which must be taken care of very carefully. + * Possible related work study? Who has solved this issue in which big data processing workbench already? + +## References + diff --git a/courses/RascalAmendmentProposals/RAP6/RAP6.md b/courses/RascalAmendmentProposals/RAP6/RAP6.md new file mode 100644 index 000000000..fd25be8fc --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP6/RAP6.md @@ -0,0 +1,124 @@ +--- +title: RAP 6 - improved import/extend semantics +sidebar_position: 6 +--- + +| RAP[^1] | 6 | +| :---- | :---- | +| Title | improved import/extend semantics | +| Author | Jurgen Vinju | +| Status | Draft | +| Type | Rascal Language | + +### Issue + +We have received reports, and encountered ourselves, some baffling emergent behavior of the current semantics of \`import\` and \`extend\` module semantics in Rascal. We see unexpected and complex behaviors, often after long debugging sessions, where the implementation is not wrong per se, but just very hard to get. Ergo, we have to do something about the design of these two language features. It’s not comprehensible now. + +### Analysis + +First the current relevant features of import and extend are listed: + +* “import X;” makes the declared items (variables, functions, data-types and syntax-definitions) of the imported module “X” visible in the importing module. +* An “imported” module is a singleton instance, with its own module state with respect to its global variables + * If more than one importer modules A and B import the same imported module X, they share the same view on the state of the imported module X. +* “import X;” does not make the declared items of modules it imports itself visible to the importing module. Import is not transitive. + * This is like Java import also works and what our (beginner) audience expects + * It helps to keep namespaces clean + * It is therefore necessary for information hiding and reuse + * An exception is made for non-terminals and data-types, their names are propagated +* When an importing module declares a function or a data-type of the same name which already exists in the imported module something interesting happens: + * The data/syntax-type is not shadowed but merged + * A function is not shadowed but it’s alternatives are added to an overloaded function (however, not \_recursively\_ to other alternatives in more deeply imported modules) + * This behavior stems from a time when “extend” did not exist yet in Rascal. It’s a backward compatibility issue. When “extend” did not exist yet, this was the only reason why data-types, syntax definitions could be (marginally) modularly extensible. The “extend” feature was added later because this is not enough, in combination with extensible overloaded functions. + * When we added extensible overloaded recursive functions over the extensible data-types it became also urgent to add “extend”. + +* “extend X”; like “import” also makes the declared items (variables, functions, data-types and syntax-definitions) of the imported module “X” visible in the importing module. +* An “extended” module is not an instance and it does not have state. + * Instead of making reference to an external module instance, all its declarations are cloned into the local scope of the extending module + * This also goes for global variables. Each extended global has its own instance in the extending module. +* “Extend X” makes all the declared items of modules it imports itself visible to the importing module. “Extend” is transitive. + * Even the “import” declarations of the extended module are cloned + * As are the “extend” declarations. + * And all private parts of the module as well. +* When an importing module declares a function or a data-type of the same name which already exists in the extened module something useful happens: + * The data/syntax-type is merged, as if the declarations were next to each other in the same module + * A function’s alternatives are added to the overloaded function of the extending module + * By effectively merging the declarations of data-types and syntax-definitions and functions into the same extending module, both recursive functions and recursive data and syntax types are now openly extensible + * Recursive calls in extended modules now resolve to the bigger overloaded function rather than to the overloaded function as it was in the original scope. + +**Observation 1:** by not shadowing names declared by imported modules, “import” merges definitions almost like “extend” does, but not completely transitively and recursively. The **semi-merge** surely generates hard-to-predict run-time behavior (why did this function not match?) + +* Never static errors are produced that this semi-merging is going on + +**Observation 2:** open extensibility is for data-types (languages) and recursive functions that operate on these data-types is a distinguishing Rascal feature with a strong language-oriented flavor. It is an important (yet advanced) language feature. “Extend” does not have any information hiding feature, which is necessary for the “openness” it requires. + +**Observation 3:** “import” is useful for libraries of non-extensible functions and specifically for information hiding. We can not do without “import” either: larger Rascal programs would become nearly impossible to write and maintain (remember ASF+SDF which only had “extend” semantics for its “import” declarations). + +**Observation 4:** the feature interactions between import and extend are gruesome. + +* With globals involved, it becomes unclear what instance we are talking about +* With function merging involved it becomes unclear which overloaded alternatives are active at which level in the import/extend hierarchy + +### Solution proposal + +We propose to remove as much functionality overlap and interactions between **import** and **extend** as possible by removing the historical features of import which belong to extend, in order to: + +* Avoid complex feature interactions +* Produce more static and early warnings to the programmer + +Unfortunately, this proposal can not be backward compatible to previous Rascal versions. *It breaks the semantics of existing imports.* + +The concrete proposal is to: + +* remove all the function, data-type and syntax definition **merging effects** from “import”; +* **do not propagate** syntax and data-type names over transitive imports any more; +* let local names in **the importing module shadow** equal names from the imported modules. + * Imported but shadowed names will still be accessible via qualified module names + * Imported not shadowed names will be accessible as before + +Positive consequences: + +* Users who need to program extensible languages will be forced to use “extend” inside their language implementations to fix the new static errors they would get if they use “import”; +* User who need to simply use a library or a final language implementation, without having to extend it, will be better off using “import” for its information hiding features. + * A warning for an unnecessary “extend”, i.e. one where no function or type is effectively merged would give feedback to avoid using extend over import. + * They can use their own function names without having to “know” all the names in the modules they are importing. +* The current type-checkers have to jump through hoops to implement the current import semantics which merges definitions instead of shadowing; they will become simpler in that regard. + +By affecting this change, the type checker will start producing more warnings and errors automatically. For example: + +module A; +import basic/Identifiers; + +module B; +import A; + +syntax Exp \= Id; // error undeclared non-terminal Id (before it would get the Id from basic/Identifiers via the transitive import of A). + +and: + +module A + +data X \= x(); + +int f(x()) \= 0; + +module B; +import A; + +// X shadows the X from module A: +data X \= y(X x); // possible warning: X is not productive, there is no base case + +int f(y(x())) \= 1 // undeclared constructor x on local type \`X\` + +Negative consequences: + +* This will break existing Rascal programs, but when it breaks most of the time a static error will pop up. + * Imports will have to be changed to extends to fix the issues by the users of Rascal +* New static checks have to be designed and implemented, with good error messages to: + * Suggest using import over extend (when nothing needs merging) + * Suggest using extend over import (when definitions become incomplete due to shadowing) + * Such as non-productive non-terminals + * And such as overloaded functions which miss cases + * Suggest qualified names when a function is shadowed but reachable from an import. + +[^1]: RAP is at the moment following Pyhton’s PEP ([https://www.python.org/dev/peps/](https://www.python.org/dev/peps/)). We need to look at other projects to see what is best. See for instance, [http://yt-project.org/](http://yt-project.org/) diff --git a/courses/RascalAmendmentProposals/RAP7/RAP7.md b/courses/RascalAmendmentProposals/RAP7/RAP7.md new file mode 100644 index 000000000..29ec63a73 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP7/RAP7.md @@ -0,0 +1,107 @@ +--- +title: RAP 7 - Final Pattern Variables +sidebar_position: 7 +--- + +| RAP | 7 | +| :---- | :---- | +| Title | Final Pattern Variables | +| Author | Jurgen Vinju | +| Status | Draft | +| Type | Rascal Language | + +## Abstract + +The proposal is to disallow assignments (=) into all pattern variables, including those of function parameters and the left-hand side of generators (all pattern variables). This effectively makes all those names \`final\` in Java jargon. + +## Motivation + +The basic motivation is to simplify the programming and analysis of Rascal programs, prepare for more use of closures when/if we add parallelism, and fix new issues newbies have with closures. + +1. Rascal functions are supposed to define actual “functions”, mathematically. + 1. A function is an injective relation between its input parameter values and its output value. + 2. By assigning into the formal parameters of functions, this functional relation is ill-defined during the execution of a function (dependent on control flow and path) + 3. Equational reasoning (i.e. when thinking about the correctness of a recursive function) is very hard if the parameters change value. +2. Closures capture variable references now, also of loop variables and function parameters, which generates newbie questions but also confounds advanced readers of Rascal (including its designers). + 1. By making pattern variables immutable, closures must capture the value of these variables instead of the references; + 2. Local and global variables will model only mutable state in Rascal. + 3. Programmers will not accidentally be able to capture state anymore, but they will be able to do it on purpose. + 4. If/when adding concurrency to Rascal, both the equational reasoning argument and the capturing references argument become much more pressing: + 1. Accidentally capturing state will lead to either more implicit locking, or more implicit races (depending on the concurrency design) + 2. Functional parameter passing and return are fundamental to the map/reduce paradigm, as long as parameters are immutable this makes a lot more sense. + +## Specification + +* The static checker should disallow \`x \= exp;\` for all \`x\` introduced as pattern variables; this is the core of the proposal and the rest follows. +* Closures should capture all pattern variables by-value (*or else we would not be able to satisfy the constraint that they are final*) + +For example: + + int f(int j) { + j \= 0; // error\! Can not assign to pattern variable j + + for (k \<- \[0..9\]) { + k \*= 2; // error Can not assign to pattern variable k + println(k); + } + } + + // prints 0123456789 (and not 999999999 anymore\!): + void testClosureNoState() { + x \= for (int j \<- \[0..10\]) { + append () { return j; }; + } + for (f \<- x) { + print(f()); + } + } + + // prints 999999999: + void testClosureWithState() { + int state \= \-1; + x \= for (int j \<- \[0..10\]) { + state \= j; // no error, this is a local variable + append () { return state; }; + } + for (f \<- x) { + print(f()); + } + } + + // prints 123456789: + void testClosureWithScopedState() { + x \= for (int j \<- \[0..10\]) { + int state \= j; // no error, this is a local variable + append () { return state; }; // \`state\` is new every loop + } + for (f \<- x) { + print(f()); + } + } + +## Backwards Compatibility + +* This change is \_**not\_ backwards compatible** + * New static errors will appear where assignment are made into pattern variables + * The run-time semantics of closures will silently change from “999999999” to “123456789”, which might break code using the Figure library (for example) + * It’s possible to (temporarily) add warnings to closures which capture references to pattern variables, to make this visible. + * It’s highly likely that the “99999999” behavior was a bug anyway... + * A better proposal is to provide a tool (menu option) to highlight all these occurrences to the user once and allow them to rewrite if necessary. + * Because capturing values will be the default for loops and interactions with salix and the figure library; + * Warnings by the type checker would be mostly false positives; + +## Implementation + +* Distinguish pattern variable role from normal local and global variable role in type checker (is already the case) +* Add rule to disallow assignments into pattern variables +* Do not lift pattern variables to references anymore (currently all variables which are captured are lifted). This is a filter to be added to that part of the compiler. + +## References + +* [https://stackoverflow.com/questions/21340116/onmousedown-pointers-inside-a-loop-in-rascal/21386790?r=SearchResults\&s=1|25.3812\#21386790](https://stackoverflow.com/questions/21340116/onmousedown-pointers-inside-a-loop-in-rascal/21386790?r=SearchResults&s=1|25.3812#21386790) +* [https://stackoverflow.com/questions/41070422/figure-doesnt-show-correct-string-on-event/41110035?r=SearchResults\&s=4|24.2505\#41110035](https://stackoverflow.com/questions/41070422/figure-doesnt-show-correct-string-on-event/41110035?r=SearchResults&s=4|24.2505#41110035) +* [https://stackoverflow.com/questions/54278043/box-callback-functions-returning-the-same-string-in-rascal/54278923?r=SearchResults\&s=6|21.5850\#54278923](https://stackoverflow.com/questions/54278043/box-callback-functions-returning-the-same-string-in-rascal/54278923?r=SearchResults&s=6|21.5850#54278923) +* [https://github.com/heathermiller/spores](https://github.com/heathermiller/spores) +* [http://blog.sethladd.com/2012/01/for-loops-in-dart-or-fresh-bindings-for.html](http://blog.sethladd.com/2012/01/for-loops-in-dart-or-fresh-bindings-for.html) +* + diff --git a/courses/RascalAmendmentProposals/RAP8/RAP8.md b/courses/RascalAmendmentProposals/RAP8/RAP8.md new file mode 100644 index 000000000..f882c92e4 --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP8/RAP8.md @@ -0,0 +1,137 @@ +--- +title: RAP 8 - Simple and Almost Safe Concurrency for Rascal +sidebar_position: 8 +--- + +| RAP | 8 | +| :---- | :---- | +| Title | Simple and Almost Safe Concurrency for Rascal | +| Author | Jurgen Vinju, and … | +| Status | Draft | +| Type | Rascal Language | + +## Abstract + +Since Rascal has (mostly) immutable data and it’s on the JVM it is basically ready to benefit from multicore architectures using the JVMs multithreading and concurrency features. + +The proposal is to introduce a single structured “for-loop-like” programming construct, called \`**fork**\` which extends the closure feature of Rascal: by allowing its blocks to run concurrently, sharing read-access to local variable state and write access to an (ordered) result list. + +The \`fork\` statement ends when all of its concurrent loop blocks have ended (if it returns results) and it then returns a list of values in the order of spawning each block. If no results are computed, the fork statement ends after spawning all of its computations directly. + +As file locations, the \`loc\` type, in Rascal is highly prominent and these “point” to external file locations and other resources outside of the JVM, the concurrency feature also requires us to think about resource locking. This can be implemented and designed orthogonally, see RAP 10\. + +## Motivation + +* Rascal cannot utilize multithreading yet while there is ample opportunity for it + * Map-reduce over files of a project for example +* We want something which is “safe” to fit the Rascal design: + * No null pointers or uninitialized values or computations which return null + * No concepts of \`threads\` or \`processes\`, just concurrently running functions finishing sooner because they may run in parallel + * Seamless integration of exceptions + * No garbled stack traces + * No leakage of JVM thread semantics into stacktraces +* The goal is \_not\_ to invent a generic safe concurrent programming model, the goal is \_only\_ to allow Rascal code to run concurrently without stepping out of the language. + * A generic safe concurrent programming model would not fit and would interact with way to many design elements; + * We don’t need a generic concurrent programming model to satisfy the need for using parallelism in Rascal. + +## Specification + +* syntax Statement \= fork: Label label "fork" "(" {Expression ","}+ generators ")" Statement body +* Generators have the same semantics as in if-statements and for-loops, including filtering, backtracking, etc. There are no exceptions to this rule. +* The existing \`append\` statement interacts with \`fork\` just like in \`for\` loops + * However, it adds not to the end but insert the value in order of creation of the running block (so the order of the generators\!) + * \`append\` stores the result of a computation at a position in the result list which corresponds to the order in which that code was spawned by the fork loop. + * The length of the returned list is always less than or equal to the number of spawned computations by the \`fork\` loop. If some closure stopped via the \`break\` and \`continue\` statements, then certain results may be missing. + * The static type of the fork result is “list-of” the least upper bound of all static types of the append expressions in the fork block, i.e. \`list\[value\]\`. +* The body of the fork statement is interpreted similar to a closure: + * The body is the body of a closure with formal parameters all variables captured by the body. + * All variables captured by the fork scope are **captured by-value** + * This includes lexically nested closures using variables from outer scopes\! + * The closure returns the values provided by \`append\` statements + * The closure returns \`void\` on \`break\` and \`continue\` statements + * Due to the **immutability of the captured values, no locking is required**, and this is where Rascal benefits from immutability in the presence of concurrency. +* Run-time semantics of fork: + * First all closures are created by executing through the generators, each closure has a “spawning-index” which indicates their order of occurrence in the \`fork-loop\` + * Then the closures are registered and executed with a (global) thread-pool which is configured automatically by inspecting JVM configuration parameters (i.e. one fewer threads than available cores) + * The \`append Exp\` statement translates to \`return Exp;\` in the closure + * Blocking semantics of \`fork\`: + * The fork loop blocks until all closures have finished executing if the body contains an \`append\` statement + * The fork loop also block if the block does not contain an \`append\` statement. This is to guarantee that any nested IO has finished when we exit the fork block. + * \`break;\` and \`continue;\` terminate a closure without returning a value. They both translate to \`return;\` in the closure and have no different semantics. Care has to be taken with the labels, since local for loops and while loops also interpret \`break\` and \`continue\`. + * The return values of the closures are collected in an array at their spawning-index, every closure \_must\_ return a value. + * The spawning index is a unique reference into the array, which is why writing into the array it is **lock-free.** + * The \`fork\` loop returns a \`list\` which contains the array of results in the order of the spawning-index + * The static type of the result is “list-of” the least upper bound of all static types of the append expressions in the fork block. + * If an exception is thrown by at least one of the closures: + * It is s re-thrown by the \`fork\` loop; + * But not before the other closures are signaled to cancel. + * Their results are lost. + * Cancellation of running processes is guaranteed on loop iteration positions and function calls (i.e. recursion). + * The stack trace shows the failed closure on top of the path leading to the fork loop. + * Only the first exception is reported. If other threads throw exceptions just before they could be canceled, their stacktraces are lost. + * Notes on safety: + * The above design guarantees safety against races and deadlocks (up to a point) due to the absence of variable reference capturing; + * The only “accidental” way for the concurrent blocks to write back to shared data is via the result array which has no aliases per position + * Each loop iteration has access to the same shared but immutable data, this is cheap since only references to shared data need to be passed. (Shallow) cloning is never even necessary to share data safely. + * Deadlocks and races are not possible for code which does not pass closure values to the body of fork loops. + * Deadlocks and races are indeed pretty hard to construct, but still possible: + * The fork would need to share references to previously constructed variable-reference capturing closures, or share file locks. + * To make this sane and usable, at least all such captured variables should be declared \`volatile\` in the generated code. However, this may be overkill. + * Open question: do we need to provide locking mechanisms for reference captured variables on the language level? + * Hopefully not, since it is only required by corner cases due to the above design. Almost never will writable references be captured by concurrent blocks of code. + * Perhaps these corner cases can be detected statically, but that is doubtful since closures can hide as generic \`value\`-typed values anywhere on the stack or on the heap. +* Notes on fairness: + * Siblings of fork loops are not required to be fairly treated, i.e. no preemptive scheduling is required; + * Nested fork-loops require some degree of fairness, i.e. an outer fork must block and release its thread when they’re waiting on an inner block (possibly executing on another thread) to finalize. + * It’s important when implementing the interaction with the thread pool, when waiting for the results of all forked blocks the waiting thread should yield its thread to the thread pool such that the things it depends on can terminate. +* Note on efficiency: + * Nested forks may lead to a lot of intermediate list construction. This is very similar to the nested/recursive construction of long strings via templates. It could be smart to implement lazy and balanced concatenation for lists for this, like we did for strings. + * Java streams do this as well, but not balanced. + +## Examples + +// parallel analysis of all files in a project +facts \= **fork** (/**loc** f \<- crawl(myProject)) { + **append** analyze(f); +} + +// map-reduce over a tree +**int** sum(**int** i) \= i; +**int** sum(**node** n) { + result \= **fork**(child \<- node) { // map + **append** sum(child); + } + **return** (0 | **it** \+ i | i \<- result); // reduce +} + +## Backwards Compatibility + +* This is a new feature with a new syntax, so no old code uses it +* Retaining the semantics of exceptions is important for language design consistency, and for debugging parallel programs as well. + +## Implementation + +* The cancellation policy on exceptions is the hardest part to implement, it may interact with the \_monitor feature\_ of Rascal, which also has a cancellation feature. + * Cancellation is implemented via throwing InterruptedException if a thread is signaled +* Compiling the block part of a fork to a value-capturing closure is “easy” +* The thread-pool semantics should be based on java.concurrency standard library primitives, including: + * the handling of exceptions + * the spawning of jobs + * the collection of job results + * \`fork\` loops may be nested, via function calls or directly, implying: + * The thread-pool semantics should be thread-safe itself + * A global thread-pool or a local thread-pool per fork? + * Thread pools are heavy objects + * Fork loops should be cheap and not spawn threads all the time + * Nested fork loops should share cores efficiently and not fight for resources + * It looks like investing in an intricate but thread-safe global thread pool for fork loops is best +* All captured variable references in all closures should become \`volatile\`?? + * In case we use closures which capture the same reference from different parallel computations. + * Still not sure about this; we also just might let this go wrong. Sharing captured variables between parallel fork jobs is simply not a good idea anyway. +* No additional features to the static checker seem necessary, however: + * That assumption requires wiring the by-value passing semantics on the outside of the blocks, and renaming all the variables used inside, or simulating local function definition semantics with shadowing formal parameters. + * The goal would be to compile closures defined \_inside\_ the fork loop without changing the compiler for the closure functionality. + +## References + +* https://stackoverflow.com/questions/106591/do-you-ever-use-the-volatile-keyword-in-java diff --git a/courses/RascalAmendmentProposals/RAP9/RAP9.md b/courses/RascalAmendmentProposals/RAP9/RAP9.md new file mode 100644 index 000000000..09b03f8af --- /dev/null +++ b/courses/RascalAmendmentProposals/RAP9/RAP9.md @@ -0,0 +1,166 @@ +--- +title: RAP 9 - Events - a simple unified intermediate format for exceptions, errors, asserts, test results and Java stack traces +sidebar_position: 9 +--- + +| RAP | 9 | +| :---- | :---- | +| Title | Events - a simple unified intermediate format for exceptions, errors, asserts, test results and Java stack traces | +| Author | Jurgen Vinju | +| Status | Draft | +| Type | Rascal Language | + +## Abstract + +Proposal to merge the (open-ended) data-types for Rascal exceptions, Java exceptions, error messages, assert failures and test results into a single common representation. + +The representation features a cause tree, source locations, context parameters and user-friendly messages. This new representation can be used to \`throw\` exceptions, collect static error messages and warnings, and report on test results. All in the same unified manner. + +Different back-end tools can be reused based on this shared representation, such as UI (error panes, highlights, stack traces) and analysis (root cause analysis, statistical debugging). + +## Motivation + +Exceptions are modeled as values in Rascal. In the standard library we have a common data-type called \`RuntimeException\` for them. A special kind of exception is reserved for assertion failures. Then we have \`Message\` to represent errors (info, warning, error, fatal), and we have test results (hidden inside the quick check feature). Finally normal Java exceptions also play an important role in our runtime which are now all mapped to the \`Java(str message)\` Exception constructor. + +These formats all share characteristics: + +* They are an intermediate representation between some thing that happened and its reporting in the UI; +* They each require user reporting, causal tracing and analysis; +* They are all \`open\` ended, in the sense that many types of exceptions, errors and test results will exist which can not be thought of in advance. + +In the interest of simplicity and reuse (of representation, of analysis and of UI components), we can unify these representations into a single data-type. This will remove unnecessary code from different kinds of tools and enable newbies to start off with something that has everything they needed before they even thought of it. + +## Specification + +data Severity \= info() | warning() | error() | fatal(); + +@synopsis{An event is something to throw or report which has a severity, a source location, a cause, and a user-friendly message. Events kinds are modelled using an open-ended set of constructors of the data-type Event} +data Event( + Severity severity \= info(), + loc src \= |unknown:|, + Event cause \= root(), + str message \= “” +); + +@synopsis{Events can be composed using boolean logic to explain their causes} +data Event + \= and(list\[Event\] causes) // all these causes are required to produce the event + | or(list\[Event\] causes) // either one of these caused were required + | root() // this is the root cause of the event + | not(Event cause) // the absence of this event caused the event + ; + +// for the above it could be interesting to generate messages which compose +// the messages of the constituent causes. Elided here for the sake of brevity. + +@synopsis{Events to model stack traces} +data Event + \= methodCall(str class, str name, severity=info()) + | functionCall(str module, str name, severity=info()) + | mainCall(str class) + | staticInit(str class) + ; + +@synopsis{Events to model test results and assertion results, with their causes.} +data Event + \= testFailure(str name, map\[str, value\] parameters, Severity severity=error()) + | testSuccess(str name, map\[str, value\] parameters, Severity severity=info()) + | assertFailed(Severity severity=error()) + | assertSucceeded(Severity severity=info()) + | expectedEqual(value lhs, value rhs) + | expectedUnequal(value lhs, value rhs) + | expectedMatch(value lhs, value rhs) + | expectedNoMatch(value lhs, value rhs) + ; + +@synopsis{Events to model static errors and warnings} +data Event + \= unexpectedType(Type got, Type expected, Severity severity=error()) + | undeclaredName(str name, Severity=error()) + ; /\* etc \*/ + +@synopsis{Events to model run-time errors} +data Event + \= matchFailed(Severity severity=warning()) + | divisionByZero(Severity severity=error()) + | permissionDenied(Severity severity=error()) + ; /\* etc \*/ + +1. a “stack trace” is modeled via the “cause” of an event. + 1. The \`throw\` statement, when confronted with an expression of sub-type \`node\`, will re-ify the stack by adding to the\`causes\` field to the element \`functionCall(...)\` that is the current function with its current parameters. + 1. If the Event thrown already has user-defined causes, then these are kept as conjunctive causes. For example if the current throw is part of a catch block, the programmer might link the caught Event into the current Event by setting: \`catch Event caughtEvent: { throw newEvent(cause=caughtEvent\`); }\` + 2. Every functionCall on the causes chain has its own singleton cause, namely its own caller. + 3. \`throw\` will also generate the \`src\` location of itself into the Event, and also annotate functionCall and methodCall values with this information where possible. + 4. the causes of test failure may be the \`expected\*\` events or an exception thrown during the execution of the test, or both. +2. static error messages may be caused by observations or inferences from the source code rather than run-time events. + 1. The Event data-type may wrap arbitrary complex symbols to represent the results of name and type analysis which are relevant to explain an error message. + 2. The \`not\` event can be used that some errors are caused by the absence of something rather than the presence. For example: \`not(subTypeOf(x, y))\` explains that an error is caused by a required event (x being a subtype of y) has not happened. + 3. Errors may also be linked in this way, and repeated errors can be cleaned up if they are caused by other errors, to prevent spamming the user. This clean up can be done generically based on a list of Event’s rather than the checker having to manage these dependencies explicitly. All the checker has to do is to record meticulously what causes every error. +3. The standard attributes of \`Events\` can be produced by automatically \`throw\` but also programmatically provided (e.g. by a type checker or test runner). This makes no difference for the downstream processing of Events. +4. We have to rewrite all existing Messages, Exceptions and test result representations + 1. Declarations in Message, Exception + 2. Representations in QuickCheck +5. Java exceptions are modeled as constructors of Event as well. If not known in advance they are declared and generated at run-time using Java reflection. + 1. the name of the event constructor being the simple name of the Java Throwable class: e.getClass().getSimpleName(). + 2. the message set to \`e.getMessage()\` + 3. the causes set to top of the stack-trace, recursively to be represented by \`methodCall\` Event constructors + 4. The Java \`getCause()\` if not \`null\` is a conjunctive cause of the top exception, so if that exists a Java exception will have both a stacktrace of its own, and another exception as causes. + +## Examples + +data Event \= insufficientGiniDatapoints(str message=”The Gini coefficient requires at least three datapoints to make sense”); // default keyword parameter links event kind to user-friendly message + +int giniCoefficient(list\[num\] data) { + if (size(data) \<= 2\) + throw insufficientGiniDatapoints(); // throw fills in causality (trace, location) + … +} + +data Event \= failedTodo(str title, str message=”the task \ failed to complete”); + +void process(list\[TODO\] x) { + for (t \<- todo) { + try { + t.task(); + } + catch Event e : { + throw failedTodo(t.title, causes=\[e\]); // link the cause and the stack trace + } + } +} + +data Event \= unexpectedType(str exp, Type actual, Type expected, Severity=error(), str message=”\ requires parameters of type \, but we have a \ here.” ); +list\[Event\] typecheck(Program p) \= \[\*check(ast) | /Expression ast := p\]; +list\[Event\] check(add(Exp l, Exp r)) \= \[unexpectedType(“addition”, l.type, \\int(), src=l.src\] + when l.type \!= int(); +default list\[Event\] check(Expression e) \= \[\]; + +## Backwards Compatibility + +1. Code which used produced the Message datatype or threw exceptions of the \`Exception\` data-type will have to be reviewed and modified + 1. TypePal will need to be adapted to generate Events rather than Messages + 2. We have to move user-friendly messages to the declaration sites of error kinds. +2. UI facing components, such as Eclipse support and LSP support will have to be adapted. +3. The test runners will need to be extended to report in this style. This is a new feature with no backward compatibility issues. + +## Implementation + +* Code generated for the throw statement needs changing +* Code generated for \`java\` methods needs to be wrapped with a try-catch, such that the stack trace can be reified as \`methodCall\` events, and the exception can be modeled as a new constructor of Event using reflection. E.g: + * catch (Throwable e) { + * C \= tf.constructor(Event, e.getClass().getSimpleName()); + * throw vf.constructor(C).setParameter(“causes”, …); + * } +* Reifing the stack trace as Event::methodCall and Event::functionCall IConstructors may be quite an expensive operation. + * It could be worth the trouble to implement IConstructor again especially for these two constructors, and let them wrap a JVM exception trace for on-demand reification. + * And lazily produce the next wrapper for their \`causes\` on request (i.e. lazily implementing \`getParamer\` or \`getField\`) + * Possibly a generic \`*LazyConstructor* class *implements* *IConstructor\`* could be added to Vallang which would take Producer\ as constructor parameters rather than IValues directly, and a \`*LazyKeywordParameterWrapper implements IWithKeywordParameters\`*, likewise would take Producer\ lambda’s. +* There is no way I know of to retrieve method parameter values from Java stack traces; except: + * [https://github.com/cretz/stackparam](https://github.com/cretz/stackparam) +* Java 10 came with a new stack trace API: + * [https://docs.oracle.com/javase/10/docs/api/java/lang/StackWalker.html](https://docs.oracle.com/javase/10/docs/api/java/lang/StackWalker.html) +* Displaying stack traces: + * The REPL would print the stack trace in a traditional manner rather than showing the actual value of the stack trace + + +## diff --git a/courses/RascalAmendmentProposals/RascalAmendmentProposals.md b/courses/RascalAmendmentProposals/RascalAmendmentProposals.md new file mode 100644 index 000000000..3d8fd39c4 --- /dev/null +++ b/courses/RascalAmendmentProposals/RascalAmendmentProposals.md @@ -0,0 +1,33 @@ +--- +title: Rascal Amendment Proposals +sidebar_position: 8 +keywords: + - RFC + - RAP + - architecture + - design + - compatibility +--- + +Rascal Amendement Proposals are short documents that motivate and detail significant changes to +the Rascal language, its implementation architecture or its ecosystem of library packages. + +The following documents are thoughts on (future) changes to Rascal which may or may +not be turned into actual maintenance projects on the language. Completed RAPs are marked complete. + +* [ ] ((RAP1)) - Rascal deployment and package management +* [ ] ((RAP2)) - "Types are parsers" +* [ ] ((RAP3)) - Concrete patterns for external parsers +* [ ] ((RAP4)) - Rascal functions documented +* [ ] ((RAP5)) - A single exact number type +* [ ] ((RAP6)) - Disentangle semantics of import and extend +* [ ] ((RAP7)) - Final pattern variables +* [ ] ((RAP8)) - Simple and almost safe concurrency +* [ ] ((RAP9)) - "Events": a simple unified intermediate format for exceptions, errors and test results +* [ ] ((RAP10)) - Concurrent source location access +* [ ] ((RAP11)) - New datetime implementation plus support for partial datetime +* [ ] ((RAP12)) - Separate string editing from visit statement +* [ ] ((RAP13)) - Name-parametrized syntax modifiers +* [ ] ((RAP14)) - Backward compatibility for Rascal modules +* [ ] ((RAP15)) - Conditional patterns and removal of accidental non-linear matching + diff --git a/pom.xml b/pom.xml index 8a1e6d58d..edb3d2cab 100644 --- a/pom.xml +++ b/pom.xml @@ -206,6 +206,7 @@ ${project.basedir}/courses/GettingStarted ${project.basedir}/courses/Rascalopedia ${project.basedir}/courses/Bibliography + ${project.basedir}/courses/RascalAmendmentProposals ${project.basedir}/courses/Packages