-
Notifications
You must be signed in to change notification settings - Fork 14
Completing CLR interop
Generally, CLR interop works the same as JVM interop. However, there are some unique aspects of the CLR that ClojureCLR does not yet address. Namely,
-
ref
andout
parameters - Type references (including generic types, nullable types, and assembly-qualified type names)
- Assembly references
- Multi-dimensional arrays
Below you will find some analysis on each of the points outlined above. I wanted to make sure that each problem was clearly stated and stated competing options where they exist. However, I do have some preferences, so let me state them clearly here.
- Introduce syntactic forms for marking
ref
andout
parameters in calls to methods. - When at least one
ref
orout
parameter is involved in host expression, the call is set up to return a vector whose first element is the return value and successive elements are the return values on theref
andout
parameters. - (Possible, not convinced) Introduce another binding form designed just for host expressions like these to avoid the overhead of vector creation.
Examples:
(let [ [p q r] (.method x 12 (refparam y) (outparam)) ... ] ... )
(with-results [ [p q r] (.method x 12 (refparam y) (outparam)) ... ] ... )
Introduce a new Lisp reader construct, probably |…| to introduce arbitrary strings into symbol names directly. These would be used on either side of a / to separately deal with namespace name versus symbol name. This could be designed so that ab|$&*()|def
creates a symbol with name "ab$&*()def"
or so that |…| must surround the whole name, as in |ab$&*()def|
. This would allow constructs such as:
(|com.myco.mytype+nested, MyAssembly, Version=1.3.0.0, Culture=neutral, PublicKeyToken=b14a123334343434|/DoSomething x y)
I’m not sure this is a complete answer, but something like the following extension to import would be nice:
(import
; establishes mapping from Class1 to Some.Namespace.Class1, etc.
'(Some.Namespace Class1 Class2)
; establishes mapping from Class3 to |Some.Namespace.Class3, assembly id|, etc.
'([ |Assembly id| Some.Namespace] Class3 Class4)
; establishes mapping from SomeSymbol to |Some.Namespace.Class5, assembly id|, etc.
'([ |Assembly id| Some.Namespace] [Class5 SomeSymbol] [Class6 AnotherSymbol] ))
CLR methods allows ref
and out
parameters. Two problems arise:
- We need to indicate in some way that
ref
orout
parameters are being used in certain positions in method calls. - We need to assign the changed/new values to something.
Having a positive indication is necessary: CLR allows overloading on ref
/out
parameters. In other words, the following is legal:
class Test
{
static void m(int x) { ... }
static void m(ref int x) { ... }
}
At present, a host expression of the form (.m v x)
is ambiguous. Providing type hints, as in (.m v (int x))
, still does not suffice. Even with reflection at runtime, there is no way to indicate that a ref
parameter is being passed.
Two solutions jump to mind:
- metadata tagging:
(.m v #^{:ref true} x)
- A special syntactic form, in the same sense as (int x) is used:
(.m v (out x))
- Unfortunately,
ref
is already in use.
- Unfortunately,
Metadata tagging might work. Though it is not possible to tag primitive types (among others), as in (.m v #^{:ref true} 12)
, for ref
and out
parameters, we will need to have a variable. However, metadata tagging might not be compatible with solutions to:
ref
and out
parameters imply an assignment of a value. Clojure does not have an assignment semantics for local variable names. It has been suggested that we
allow ref
and out
parameters to be filled only by atoms/vars/refs, but that precludes the efficiency of local variable bindings and unnecessarily conflates
this problem with the problems being solved by those mechanisms.
Possible solutions:
- Introduce a new mutable binding type solely for this application — one hesitates to do so
- Multiple return values
- Introduce a new binding form into the language
- Something I haven’t thought of — I hope
Let’s forget I mentioned it.
One way to handle ref
and out
paremeters is to treat the changed values as multiple return values. In languages that don’t support multiple returns, this is
typically done by returning a vector of values. We could just use the destructuring bind of a let to handle this. Calling a method with signature String m(Object x, ref Int32 y, out String z)
would look like:
(let [ [p q r] (.m x (refparam y) (outparam)) ] ... )
This requires explicit indication of ref
and out
positions so that the destructuring bind for multiple return values can be distinguished from a destructuring of a seq return value from a regular method call.
- Advantage: no new mechanisms are required. Host expression analysis can detect the (clr:ref y) or (clr:out) forms syntactically and arrange for the appropriate machinery to be inserted around this call.
- Disadvantage: We force a vector to be created on each call. In a tight loop, this could have a non-trivial performance impact. Small vectors can be handled fairly cheaply, but there is still a cost.
We need to provide a scope for the variables that receive the values.
(in-out [ [p q r] (.m x (clr:ref x) (clr:out)) ... ] ... )
follows the pattern of let
. Only host expressions would be allowed in the value positions. The only point would be to avoid the vector creation. A host expression not occuring within this special form could either return just the return value of the call or operate as multiple return values via a vector. The latter is most likely preferable, as it coexists with the let
solution.
Please. Go for it.
Clojure uses symbols to name types in two ways:
- a package-qualified symbol (one containing periods internally) is taken to name the Java class with the same character sequence
- a namespace may contain a mapping from a symbol to a Java class, via
import
.
Resolving a symbol is the process of determining the value of a symbol during evalution. Relevant pieces of code from Compiler.resolveIn
:
...
else if(sym.name.indexOf('.') > 0 || sym.name.charAt(0) == '[')
{
return RT.classForName(sym.name);
}
...
else {
Object o = n.getMapping(sym);
if(o == null) {
if(RT.booleanCast(RT.ALLOW_UNRESOLVED_VARS.deref())) {
return sym;
}
else {
throw new Exception("Unable to resolve symbol: " + sym + " in this context");
}
}
return o;
}
}
There is similar code used by the syntax-quote processor in the Lisp reader.
Identifying types with symbol names works reasonably well for Java because package-qualified class names are syntactically compatible with symbols.
Not so for the CLR. Typenames can contain arbitrary characters. Backslashes can escape characters that do have special meaning in the typename syntax (comma, plus, ampersand, asterisk, left and right square bracket, left and right angle bracket, backslash). Fully-qualified type names can contain an assembly identifier, which involves spaces and commas. Thus, fully-qualified type names cannot be represented as symbols.
I do not see a way we can just use strings. We can use (symbol s)
to construct a symbol from an arbitrary string but only by wrapping all interop statements with a syntax-quote. That can get nasty when trying to do a Type/member construct, such as:
`( ~(string "com.myco.mytype+nested, MyAssembly, Version=1.3.0.0, Culture=neutral, PublicKeyToken=b14a123334343434" "DoSomething") x y)
One solution would be to add new Lisp reader functionality that would allow arbitrary names and namespace names for symbols. It could be a special macro character, or a #-macro. This could be the rough equivalent to the Common Lisp |:
|
Vertical bars are used in pairs to surround the name (or part of the name) of a symbol that has many special characters in it. It is roughly equivalent to putting a backslash in front of every character so surrounded. For example, |A(B)|, A|(|B|)|, and A\(B\) all mean the symbol whose name consists of the four characters A, (, B, and ).
We would only need to do this for the namespace name and name parts, leaving the / separating namespace from name in the open. We could also simplify by surrounding the whole name and not only part of a name. The code above would become
(|com.myco.mytype+nested, MyAssembly, Version=1.3.0.0, Culture=neutral, PublicKeyToken=b14a123334343434|/DoSomething x y)
I would recommend either |…| or #|…| as the convention.
As shown in the previous example, fully-qualifying type names with assembly names is uuuggly. And, in fact, we can’t do it at the moment. So how does ClojureCLR deal with type references at the moment? It looks for the type name in the current assembly and mscorlib (the default behavior of Type.GetType(String)
. It then looks for the type name in all loaded assemblies. If there is a unique type with that name, it takes it. If there is not, then it fails.
Clojure on the JVM uses class loaders and classpath hacking to achieve type uniqueness.
CLojureCLR at the moment is not robust in handling type identity. A piece of code that evaluates properly one moment can be hosed on the next evaluation by the loading of an assembly between the two evals.
I don’t have a definitive answer to this. One solution is to extend namespace mapping of types to deal with this. I’m open to suggestions on the syntactical details, but something like:
(import
; establishes mapping from Class1 to Some.Namespace.Class1, etc.
'(Some.Namespace Class1 Class2)
; establishes mapping from Class3 to |Some.Namespace.Class3, assembly id|, etc.
'([ |Assembly id| Some.Namespace] Class3 Class4)
; establishes mapping from SomeSymbol to |Some.Namespace.Class5, assembly id|, etc.
'([ |Assembly id| Some.Namespace] [Class5 SomeSymbol] [Class6 AnotherSymbol] ))
I’m guessing this would handle most cases of potential ambiguity and greatly simply user code.
The JVM does not have true multi-dimensional arrays, just ragged arrays. The core Clojure functions that manipulate multi-dimensional arrays assume raggedness.
The CLR of course has ragged arrays, but it also supports true (rectangular) multi-dimensional arrays. In the implementation of the core Clojure functions on the CLR, we assumed ragged arrays. Thus, we have no support for true multi-dimensional arrays.
The functions of interest are:
-
(aget array idx+)
— Returns the value at the index/indices. Works on arrays of all types. -
(aset array idx+ val)
— Sets the value at the index/indices. Works on arrays of reference types. Returns val. -
(make-array class dim+)
— Creates and returns an array of instances of the specified class of the specified dimension(s).
We could easily overload make-array to take a second argument of a vector of ints specifying the dimensions. Thus:
(make-array Int32 4 5 6) ; => a ragged array
(make-array Int32 [4 5 6]) ; => a multi-dimensional array
Or we could just have a new function called make-multidim-array
.
For aget
and aset
, I think overloading them in this way would not be advised due to performance implications. We can expect these functions to be called in tight loops. Better to introduce new functions:
(aget-md array idx+)
(aset-md array idx+)
We would also need to introduce equivalents to aset-int
, etc.
I’m open to suggestions on names.