-
Notifications
You must be signed in to change notification settings - Fork 20
Code Overhaul. #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tor code into multiple files, add intial unit tests, add paralelism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the refactoring part, splitting everything up and even integrating some of the optimizations from previous work.
All of that didn't really change much of the existing function signatures, and where it did it mostly improved the semantics of what the code is doing.
What I am not so sure is if it was a good thing to add the parallelism part in here. It strongly changes the main loop that is executed, and having everything in the same PR as the refactoring makes it hard to review.
libraries/cropping.py
Outdated
| return cube | ||
|
|
||
|
|
||
| def expand_cube(cube: np.ndarray) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think cropping.py is the best name for the file. The cropping part is just an implementation detail of the expand_cube function.
Not sure what the best name is, but something that captures that this is about deriving all new expanded cubes from an existing cube.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have named it resizing for now as the most generic possible name I can think of, not resolving this chain as if anyone has any better ideas I'm open to suggestions.
|
Are we sticking with python for now? I know the original readme mentions a transition to C or java would be faster. Also I'd just like to throw my vote in for Rust or Haskell if we go that route. They both compile to native-level performance and could make the code very readable |
|
I at least am happy to move to another language, but so far haven't had much success finding a good option, numpys in place multi-dimensional matrix rotation is extremely helpful and so far I haven't found any good alternatives. essentially until someone is willing to at least build a proof of concept in another language I personally will be sticking with python. |
|
I'd definitely keep the python version as a reference, and easy way for people to experiment with completely new ideas (and not just optimizations). Implementations in different languages can easily be added alongside the python version. It might be desirable to have to same command line interface for all the implementations, to make it easier to compare/benchmark them. |
…cropping library to resizing, fix type anotations
|
parallelism has been moved out of this branch, this has also given me the opportunity to spot some some optimizations I missed. without break early on seen hash: with break early on seen hash: |
scamille
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on the tests.
|
included faster packing code from https://github.com/RibenaJT at mikepound/cubes#4 (comment) unless there is pushback or further major comments I will be pulling this into main on Friday 14th at 12:00 UTC |
libraries/packing.py
Outdated
| # return cube_hash | ||
|
|
||
| data = polycube.tobytes() + polycube.shape[0].to_bytes(1, 'big') + polycube.shape[1].to_bytes(1, 'big') + polycube.shape[2].to_bytes(1, 'big') | ||
| return int.from_bytes(data, 'big') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need int.from_bytes(data, 'big'), or can we simply return data?
There is some time overhead of int.from_bytes.
Not large for polycube of size 8000, for example (about 10 µs), but can add up, and is more noticeable for larger sizes.
int.from_bytes constructs an integer (if data is very big, e.g. 100_000 bytes - this can take a very long time).
- But the
bytesobjects are already comparable (for theget_canoincal_packingfunction in cubes.py). - And when adding
cube_hashto theknown_hashesin thegenerate_polycubesfunction in cubes.py, thesetinternally computes a 64-bit integerhash(cube_hash)anyway.
So the int specifically is not required; bytes will be enough for our purposes, right?
Maybe cube_hash would be better named as cube_id - because it's not yet a hash (which set uses internally), it's just another representation of a numpy array - that is hashable and comparable, and which corresponds to our cube (rotationally invariantly).
pull request: bertie2#1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had time to try it, so may or may not help in practice, but I did wonder if cube_hash should be like a true hash - e.g a value that indicates that 2 shapes MAY be identical (e.g. collisions allowed) and if a hash match is found then those candidates would be tested for true equality.
I thought the hash could be an hash of these properties combined (which I think should yield the same hash for all rotations, so maybe cutting time down for rotations?):-
- As now the width/height/depth (sorted to ensure all rotations give same hash)
- the number of cubes in the shape
- a list of numbers, that are the number of cubes in each 3d "slice" of the shape, again sorted to ensure rotations give same hash.
E.g a 2x2x2 cube with one corner missing would be (2,2,2,
7,
((3,4),(3,4),(3,4) --to be sorted in some way
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RibenaJT
Your comment seems unrelated to the changes that I've proposed in my comment.
But I'll reply to you here:
oh, so you think of like a heuristic - even before any hash calculations for all 24 rotations...
However:
- in case of a collision, how would they be tested for equality - would still need to consider all 24 rotations, right?
- in case of not collision, would still need to consider all 24 rotations - for the future comparisons (I think almost surely some future polycubes will collide with the current one, so might as well not wait until this happens and compute the
cube_idright away)
So we're calculating these 24 rotations in any case anyway?
- "the number of cubes in the shape" - I don't think it's needed, we are not computing the polycubes of different N at the same time; so the polycubes of different N are never stored in the same set, and so can never be confused with each other, unless you have a different application in mind
- What do you mean by 3d "slice"? - a 2d slice of each layer I assume, from bottom to top layer, for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry it should probably been a new comment.
Yes, if there was a collision - you would still need to compare all 24 rotations.
If no collision - I don't think you need to rotate, since the hash should be the same for all rotations (if the properties are sorted in a way that makes the orientation of the shape irrelevant, e.g. sorting the w/h/d/slices so that e.g. a 3x2x4 oblong would have an hash of {2,3,4,{sorted_slices}} for all orientations) - therefore we know it is a new shape.
It was just a vague idea - it all depends on how "discriminating" the hash is to how effective this would be (versus the cost of rotating).
Yes, by 3d slice, I meant taking all (2d) slices of the shape in all 3 axis.
replaced unnecessary `int.from_bytes(data, 'big')` with just `data` - change naming: cube_hash to cube_id - typo correction: get_canoincal_packing - relevant docstring updates
6% faster with better cropped cube
simplify `int.from_bytes(data, 'big')` + other
|
@VladimirFokow I have modified on top of your code a little to still pack the bits, its still just a little bit slower, benchmarks below, but it saves a lot of memory, and at n>12 we are memory limited, I am still using the tobytes and frombuffer, like you did rather than manual shifting. with bit-packing: without packing: |
|
@bertie2 yes
|
|
merging imminently, cleaned up README and included test data for ease of use. |
implements improvements from other repo from myself and ana-096..
refactors the code into multiple files.
adds .gitignore.
adds initial unit tests.
adds parallelism, by implementing a single unique hash for each shape by taking the largest value (in terms of the unsinged integer value of the bit array) amongst hashes for each rotation.
makes rendering a command line flag.
looking for feedback on all of this, whether we want to merge it as is, cherry pick certain parts, or ignore this as rubbish, I think at least the refactor and unit tests are desperately needed if we are to work on this in parallel.
please note this is an opportunity to review and consider for the whole community, anyone feel free take a look and leave feedback if you have time even if you wont necessarily be working on it.
finally, this is leading towards a distributed system for managing this computation, since each job slice doesn't need a full hash set of all existing cubes to do processing, the current calls to multiprocessing could all be replaced with net calls to a matching client for massive scale.