Skip to content

Conversation

@iluuu1994
Copy link
Member

@iluuu1994 iluuu1994 commented Mar 25, 2024

RFC: https://wiki.php.net/rfc/structs-v2

This is some early experimentation on structs, which are objects with value semantics (the same as arrays and strings). The main motivation is making data structures modeled with classes more ergonomic. For example, there's a desire for a faster Vector implementation (like the one from php-ds). However, reference types can lead to defensive copying to avoid accidental changes at a distance to an owned value, or bugs if copying is mistakenly omitted. readonly solves the latter by completely disallowing mutation. However, this essentially just forces a copy, which is bad for performance, especially for big data structures.

Structs instead automatically "separate" (clone) themselves from any other reference only once a modification is made to the object. If it is only referenced from a single place, no copy is needed. These are also called CoW (copy-on-write) semantics.

Method calls pose an interesting problem: When there is a chain of property accesses, ending in a method call, the entire chain must be separated to avoid the change from leaking to other references. However, with the standard method call syntax, it is not clear whether a call will refer to a self-mutating method call (imagine a Vector::push() method, requiring separation), or an immutable method (not requiring separation) in advance. To signal to the engine that the chain needs to be separated, as it would for arrays on $a['b']['c] = 'c';, we use the $parent->children->push!($child); syntax instead. If $parent is referenced from multiple places, it will be cloned and the clone will be stored in $parent. The same happens for the children property. This ensures that any other reference to the same instance as $parent remains unmodified.

Side note: We're looking for a different syntax for method calls, because Bob would like to use it for macros at some point.

TODOs:

  • mutating at decl-site support
  • Disallow SplObjectStorage/WeakMap. Hashing structs is complex and should be solved in a separate RFC.
    • References are problematic for hashes. If a data object contains references, the reference may implicitly change the hash for objects that are already stored in some bucket. Any future lookup will be impossible. So, using data objects containing references as keys must necessarily clone the object and unref any references.
    • SplObjectStorage internally adds objects to an array with the handle as the key. This key is guaranteed to be unique. Using a hash for structs will not be straight forward, because we cannot avoid hash collisions. Thus, we will either need support for structs in arrays themselves, or we'll need to add a higher-level bucket to the underlying array so that we can handle collisions ourselves.
    • WeakMap is a weird use-case for data values, because their RCs change unpredictably. It might be best to disallow them as WeakMap keys. Moreover, WeakMap will not add a refcount for its object key, making it possible to change the hash by modifying the object in-place, leaving the object in the wrong bucket.
  • Disallow ArrayObject to avoid uncontrolled changes and integer keys
  • Opcache/Optimizer
    • RC inference
    • JIT
  • Disallow ReflectionProperty::setValue() and ReflectionMethod::invoke(). They require @prefer-ref in userland for overrides of these methods.
  • Sending $this by reference within mutating methods is problematic. This is because the mutating method will addref to $this. Sending $this by ref would attempt to wrap it in a reference, and then separate it in the callee. This problem extends to non-static closures. This could be solved by setting EX(This) to a reference in the caller, so that the callee can forward this forward this reference. The mutating method would need to generate FETCH_THIS or a similar opcode to unwrap the reference, as well as one that does not (for sending by-ref, specifically). However, this sounds quite complicated, so it may be better to postpone this.

Benchmark:

Valgrind shows a small performance regression. The real-time showed a small slowdown of +0.07% for mean and +0.10% for fastest of 20 runs with -T10,300 of the Symfony Demo benchmark.

Details

Before:
2.995101 + 2.997169 + 2.998048 + 3.000360 + 3.000639 + 3.003076 + 3.003258 + 3.003537 + 3.003724 + 3.003811 + 3.004047 + 3.004114 + 3.004636 + 3.004637 + 3.004900 + 3.005175 + 3.005566 + 3.006546 + 3.007003 + 3.007804
After:
2.997989 + 3.000772 + 3.003236 + 3.003272 + 3.003358 + 3.003549 + 3.004557 + 3.004694 + 3.004716 + 3.004723 + 3.004893 + 3.005222 + 3.005734 + 3.007125 + 3.007395 + 3.007596 + 3.008235 + 3.008681 + 3.009244 + 3.010537

Mean: 100 / 3.00315755 * 3.0052764 = +0.07%
Fastest: 100 / 2.995101 * 2.997989 = +0.10%

@iluuu1994 iluuu1994 added the RFC label Apr 2, 2024
@@ -0,0 +1,27 @@
--TEST--
Copy link
Contributor

@mvorisek mvorisek Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a test with spl_object_id() confirming the expected CoW semantic like:

$o = new data class;
$origId = spl_object_id($o);

$fx = function ($o) use ($origId) {
    var_dump(spl_object_id($o) - $origId);

    $o->x = 2; // mutate the data class instance to enforce object copy
    var_dump(spl_object_id($o) - $origId);
    $o->x = 3;
    var_dump(spl_object_id($o) - $origId);

    $o2 = $o;
    var_dump(spl_object_id($o2) - $origId);
    $o2->x = 3;
    var_dump(spl_object_id($o2) - $origId);
    $o2->x = 12;
    var_dump(spl_object_id($o2) - $origId);
    $o2->x = 13;
    var_dump(spl_object_id($o2) - $origId);
};
$fx();

var_dump(spl_object_id($o) - $origId);

@iluuu1994 iluuu1994 requested a review from dstogov May 6, 2024 17:14
@iluuu1994 iluuu1994 changed the title [RFC] Implement data classes (WIP) [RFC] Implement structs (WIP) May 8, 2024
Comment on lines +9112 to +9241
zend_object_clone_obj_t clone_call = Z_OBJ(EX(This))->handlers->clone_obj;
ZEND_ASSERT(clone_call);
ZVAL_OBJ(result, clone_call(Z_OBJ(EX(This))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you separate the object at this point? Why not to delay this for CoW?

Comment on lines +1042 to +1045
if (OP1_TYPE & (IS_VAR|IS_CV)) {
SEPARATE_DATA_OBJ(object);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it is not necessary to separate data objects for $this?

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't like the "!" syntax at call-site. I understand that it mainly need for the efficient implementation. It reflects that we call the "mutating" method (and need to separate arryas and structs before INIT_METHOD_CALL). I would prefer to use regular call syntax and not to think about "!" at each call.

  2. Now it's allowed to modify $this "struct" in non-"mutating" methods. This should be prohibited or "mutating" flag should be add explicitly or it should be removed at all. There is some design problem...

  3. Now it's possible to have "struct" with dynamic properties - "#[AllowDynamicProperties]". Really I would expect "struct" to be non-extensible.

  4. I don't see why "struct" can't be inherited by another "struct". In C# this is done to keep structs small and allocate them on stack. We use regular heap-allocated "zend_object"s anyway.

  5. There is something wrong with comparison

<?php
struct S1 {
	function __construct(public int $a, public int &$b) {}
}
$c = $d = 2;
$a = new S1(1, $c);
$b = new S1(1, $d);
var_dump($a === $b);
var_dump($a, $b);
var_dump($a === $b);
?>
--OUTPUT--
bool(false)
object(S1)#1 (2) {
  ["a"]=>
  int(1)
  ["b"]=>
  &int(2)
}
object(S1)#2 (2) {
  ["a"]=>
  int(1)
  ["b"]=>
  &int(2)
}
bool(true)
  1. The behavior becomes too complex... Can you predict the output before running this? Did you guess and how many time did you think?
<?php
struct S1 {
	function __construct(public int $a) {}
	public mutating function foo() {
		echo __FUNCTION__,"\n";
		$this->a++;
	}
}
struct S2 {
	function __construct(public S1 $s) {}
	public mutating function bar() {
		$this->s->foo!();
		return $this->s;
	}
}
$s = new S2(new S1(1));
$copy = $s;
$s->bar!()->foo!();
var_dump($s, $copy);
?>

@staabm
Copy link
Contributor

staabm commented Apr 9, 2025

I wonder whether it would be useful (and maybe could even be utilized to generate faster code in the engine) if structs would not support magic methods

@iluuu1994
Copy link
Member Author

I think blocking magic methods is unlikely to make a difference. The slowdown comes from checking whether an object needs to be "separated" (cloned when RC>1) in various paths. This could only really be avoided if we know beforehand whether an object can be a struct, but due to limited typing and context this is very often not the case, and thus I have not optimized for it. To me, the +0.07%/+0.10% slowdown is acceptable, but whether it will be for the rest of internals is for them to decide. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants