@@ -165,3 +165,120 @@ For embedders, it may be important to be able to interrupt Python threads by
165
165
other means. We use the TruffleSafepoint mechanism to mark our threads waiting
166
166
to acquire the GIL as blocked for the purpose of safepoints. The Truffle
167
167
safepoint action mechanism can thus be used to kill threads waiting on the GIL.
168
+
169
+ ## C Extensions and Memory Management
170
+
171
+ ### High-level
172
+
173
+ C extensions assume reference counting, but on the managed side we want to leverage
174
+ Java tracing GC. This creates a mismatch. The approach is to combine the two.
175
+
176
+ On the native side we use reference counting. The native code is responsible for doing
177
+ the counting, i.e., calling the ` Py_IncRef ` and ` Py_DecRef ` API functions. Inside those
178
+ functions we add special handling for the point when first reference from the native
179
+ code is created and when the last reference from the native code is destroyed.
180
+
181
+ On the managed side we rely on tracing GC, so managed references are not ref-counted.
182
+ For the ref-counting scheme on the native side, we approximate all the managed references
183
+ as a single reference, i.e., we increment the refcount when object is referenced from managed
184
+ code, and using a ` PhantomReference ` and reference queue we decrement the refcount when
185
+ there are no longer any managed references (but we do not clean the object as long as
186
+ ` refcount > 0 ` , because that means that there are still native references to it).
187
+
188
+ ### Details
189
+
190
+ There are two kinds of Python objects in GraalPy: managed and native.
191
+
192
+ #### Managed Objects
193
+
194
+ Managed objects are allocated in the interpreter. If there is no native code involved,
195
+ we do not do anything special and let the Java GC handle them. When a managed object
196
+ leaks to native extension:
197
+
198
+ * We wrap it in ` PythonObjectNativeWrapper ` . This is mostly in order to provide different
199
+ interop protocol: we do not want to expose ` toNative ` and ` asPointer ` on Python objects.
200
+
201
+ * When NFI or Sulong call ` toNative ` /` asPointer ` we:
202
+ * Allocate C memory that will represent the object on the native side (including the refcount field)
203
+ * Add a mapping of that memory address to the ` PythonObjectNativeWrapper ` object to a hash map ` CApiTransitions.nativeLookup ` .
204
+ * We initialize the refcount field to a constant ` MANAGED_REFCNT ` (larger number, because some
205
+ extensions like to special case on some small refcount values)
206
+ * Create ` PythonObjectReference ` : a weak reference to the ` PythonObjectNativeWrapper ` ,
207
+ when this reference is enqueued (i.e., no managed references exist), we decrement the refcount by
208
+ ` MANAGED_REFCNT ` and if the recount falls back to ` 0 ` , we deallocate the native memory of the object,
209
+ otherwise we need to wait for the native code to eventually call ` Py_DecRef ` and make it ` 0 ` .
210
+
211
+ * When extension code wants to create a new reference, it will call ` Py_IncRef ` .
212
+ In the C implementation of ` Py_IncRef ` we check if a managed object with
213
+ ` refcount==MANAGED_REFCNT ` wants to increment its refcount. In such case, the native code is
214
+ creating a first reference to the managed object, we must make sure to keep the object alive
215
+ as long as there are some native references. We set a field ` PythonObjectReference.strongReference ` ,
216
+ which will keep the ` PythonObjectNativeWrapper ` alive even when all other managed references die.
217
+
218
+ * When extension code is done with the object, it will call ` Py_DecRef ` .
219
+ In the C implementation of ` Py_DecRef ` we check if a managed object with refcount==MANAGED_REFCNT+1
220
+ wants to decrement its refcount to MANAGED_REFCNT, which means that there are no native references
221
+ to that object anymore. In such case we clear the ` PythonObjectReference.strongReference ` field,
222
+ and the memory management is then again left solely to the Java tracing GC.
223
+
224
+ #### Native Objects
225
+
226
+ Native objects are backed by native memory and may never leak to managed code. If they do not
227
+ leak to managed code, they are reference counted as usual, where ` Py_DecRef ` call that reaches
228
+ ` 0 ` will deallocate the object. If a native object does leak to managed code:
229
+
230
+ * We increment the refcount of the native object by ` MANAGED_REFCNT `
231
+ * We create:
232
+ * ` PythonAbstractNativeObject ` Java object to represent it
233
+ * ` NativeObjectReference ` , a weak reference to the ` PythonAbstractNativeObject ` .
234
+ * Save the mapping from the native object address to the ` NativeObjectReference `
235
+ object into hash map ` CApiTransitions.nativeLookup ` (next time this native object leaks to
236
+ the managed code, we only fetch the existing wrapper and don't do any of this).
237
+ * When ` NativeObjectReference ` is enqueued, we decrement the refcount by ` MANAGED_REFCNT `
238
+ and if it falls to ` 0 ` , it means that there are no references to the object even from
239
+ native code, we can destroy it. If it does not fall to ` 0 ` , we just wait for the native
240
+ code to eventually call ` Py_DecRef ` that makes it fall to ` 0 ` .
241
+
242
+ ### Cycle GC
243
+
244
+ We leverage the CPython's GC module to detect cycles for objects that participate
245
+ in the reference counting scheme (native objects or managed objects that leaked to native).
246
+ See: https://devguide.python.org/internals/garbage-collector/index.html .
247
+
248
+ There are two issues:
249
+
250
+ * Objects that are referenced from the managed code have refcount >= ` MANAGED_REFCNT ` and
251
+ until Java GC runs we do not know if they are garbage or not.
252
+ * We cannot traverse the managed objects: since we don't do refcounting on the managed
253
+ side, we cannot traverse them and decrement refcounts to see if there is a cycle.
254
+
255
+ The high level solution is that when we see a "dead" cycle going through a managed object
256
+ (i.e., cycle not referenced by any native object from the "outside" of the collected set),
257
+ we fully replicate the object graphs (and the cycle) on the managed side (refcounts of native objects
258
+ in the cycle, which were not referenced from managed yet, will get new ` NativeObjectReference `
259
+ created and refcount incremented by ` MANAGED_REFCNT ` ). Managed objects already refer
260
+ to the ` PythonAbstractNativeObject ` wrappers of the native objects (e.g., some Python container
261
+ with managed storage), but we also make the native wrappers refer to whatever their referents
262
+ are on the Java side (we use ` tp_traverse ` to find their referents).
263
+
264
+ Then we make the managed objects in the cycle only weakly referenced on the Java side.
265
+ One can think about this as pushing the baseline reference count when the
266
+ object is eligible for being GC'ed and thus freed. Normally when the object has
267
+ ` refcount > MANAGED_REFCNT ` we keep it alive with a strong reference assuming that
268
+ there are some native references to it. In this case, we know that all the native
269
+ references to that object are part of potentially dead cycle, and we do not
270
+ count them into this limit. Let us call this limit * weak to strong limit* .
271
+
272
+ After this, if the managed objects are garbage, eventually Java GC will collect them
273
+ together with the whole cycle.
274
+
275
+ If some of the managed objects are not garbage, and they leak back to native code,
276
+ the native code can then access and resurrect the whole cycle. W.r.t. the refcounts
277
+ integrity this is fine, because we did not alter the refcounts. The native references
278
+ between the objects are still factored in their refcounts. What may seem like a problem
279
+ is that we pushed the * weak to strong limit* for some objects. Such object may leak to
280
+ native, get ` Py_IncRef ` 'ed making it strong reference again. Since ` Py_DecRef ` is
281
+ checking the same ` MANAGED_REFCNT ` limit for all objects, the subsequent ` Py_DecRef `
282
+ call for this object will not detect that the reference should be made weak again!
283
+ However, this is OK, it only prolongs the collection: we will make it weak again in
284
+ the next run of the cycle GC.
0 commit comments