Code Overhaul. #3

bertie2 · 2023-07-12T17:16:24Z

implements improvements from other repo from myself and ana-096..
refactors the code into multiple files.
adds .gitignore.
adds initial unit tests.
adds parallelism, by implementing a single unique hash for each shape by taking the largest value (in terms of the unsinged integer value of the bit array) amongst hashes for each rotation.
makes rendering a command line flag.

looking for feedback on all of this, whether we want to merge it as is, cherry pick certain parts, or ignore this as rubbish, I think at least the refactor and unit tests are desperately needed if we are to work on this in parallel.

please note this is an opportunity to review and consider for the whole community, anyone feel free take a look and leave feedback if you have time even if you wont necessarily be working on it.

finally, this is leading towards a distributed system for managing this computation, since each job slice doesn't need a full hash set of all existing cubes to do processing, the current calls to multiprocessing could all be replaced with net calls to a matching client for massive scale.

…tor code into multiple files, add intial unit tests, add paralelism

scamille

I do like the refactoring part, splitting everything up and even integrating some of the optimizations from previous work.
All of that didn't really change much of the existing function signatures, and where it did it mostly improved the semantics of what the code is doing.

What I am not so sure is if it was a good thing to add the parallelism part in here. It strongly changes the main loop that is executed, and having everything in the same PR as the refactoring makes it hard to review.

scamille · 2023-07-12T18:03:27Z

libraries/cropping.py

+    return cube
+
+
+def expand_cube(cube: np.ndarray) -> np.ndarray:


I don't think cropping.py is the best name for the file. The cropping part is just an implementation detail of the expand_cube function.

Not sure what the best name is, but something that captures that this is about deriving all new expanded cubes from an existing cube.

I have named it resizing for now as the most generic possible name I can think of, not resolving this chain as if anyone has any better ideas I'm open to suggestions.

libraries/cropping.py

matthunz · 2023-07-12T18:19:27Z

Are we sticking with python for now? I know the original readme mentions a transition to C or java would be faster.

Also I'd just like to throw my vote in for Rust or Haskell if we go that route. They both compile to native-level performance and could make the code very readable

bertie2 · 2023-07-12T18:30:18Z

I at least am happy to move to another language, but so far haven't had much success finding a good option, numpys in place multi-dimensional matrix rotation is extremely helpful and so far I haven't found any good alternatives.

essentially until someone is willing to at least build a proof of concept in another language I personally will be sticking with python.

scamille · 2023-07-12T18:43:45Z

I'd definitely keep the python version as a reference, and easy way for people to experiment with completely new ideas (and not just optimizations). Implementations in different languages can easily be added alongside the python version.

It might be desirable to have to same command line interface for all the implementations, to make it easier to compare/benchmark them.

…cropping library to resizing, fix type anotations

…ical hash

bertie2 · 2023-07-12T19:13:30Z

parallelism has been moved out of this branch, this has also given me the opportunity to spot some some optimizations I missed.

without break early on seen hash:

python.exe .\cubes.py 9
Loading polycubes n=8 from cache: 6922 shapes

Got polycubes from cache n=8

Hashing polycubes n=9
completed 100.00%

Generating polycubes from hash n=9
completed 100.00%
Wrote file for polycubes n=9

Found 48311 unique polycubes

Elapsed time: 122.02s

with break early on seen hash:

python.exe .\cubes.py 9 
Loading polycubes n=8 from cache: 6922 shapes

Got polycubes from cache n=8

Hashing polycubes n=9
completed 100.00%

Generating polycubes from hash n=9
completed 100.00%
Wrote file for polycubes n=9

Found 48311 unique polycubes

Elapsed time: 59.456s

tests/utils.py

tests/test_parallel.py

scamille

Great work on the tests.

bertie2 · 2023-07-12T23:14:29Z

included faster packing code from https://github.com/RibenaJT at mikepound/cubes#4 (comment)
also re-wrote the unpacking code while at it, overall speedup of 15% as we spend more time in c land and less doing python operations.

unless there is pushback or further major comments I will be pulling this into main on Friday 14th at 12:00 UTC

VladimirFokow · 2023-07-13T00:33:17Z

libraries/packing.py

+    # return cube_hash
+
+    data = polycube.tobytes() + polycube.shape[0].to_bytes(1, 'big') + polycube.shape[1].to_bytes(1, 'big') + polycube.shape[2].to_bytes(1, 'big')
+    return int.from_bytes(data, 'big')


Do we need int.from_bytes(data, 'big'), or can we simply return data?

There is some time overhead of int.from_bytes.
Not large for polycube of size 8000, for example (about 10 µs), but can add up, and is more noticeable for larger sizes.

int.from_bytes constructs an integer (if data is very big, e.g. 100_000 bytes - this can take a very long time).

But the bytes objects are already comparable (for the get_canoincal_packing function in cubes.py).

And when adding cube_hash to the known_hashes in the generate_polycubes function in cubes.py, the set internally computes a 64-bit integer hash(cube_hash) anyway.

So the int specifically is not required; bytes will be enough for our purposes, right?

Maybe cube_hash would be better named as cube_id - because it's not yet a hash (which set uses internally), it's just another representation of a numpy array - that is hashable and comparable, and which corresponds to our cube (rotationally invariantly).

pull request: bertie2#1

I haven't had time to try it, so may or may not help in practice, but I did wonder if cube_hash should be like a true hash - e.g a value that indicates that 2 shapes MAY be identical (e.g. collisions allowed) and if a hash match is found then those candidates would be tested for true equality.

I thought the hash could be an hash of these properties combined (which I think should yield the same hash for all rotations, so maybe cutting time down for rotations?):-

As now the width/height/depth (sorted to ensure all rotations give same hash)

the number of cubes in the shape

a list of numbers, that are the number of cubes in each 3d "slice" of the shape, again sorted to ensure rotations give same hash.

E.g a 2x2x2 cube with one corner missing would be (2,2,2,
7,
((3,4),(3,4),(3,4) --to be sorted in some way
)

@RibenaJT
Your comment seems unrelated to the changes that I've proposed in my comment.

But I'll reply to you here:
oh, so you think of like a heuristic - even before any hash calculations for all 24 rotations...
However:

in case of a collision, how would they be tested for equality - would still need to consider all 24 rotations, right?

in case of not collision, would still need to consider all 24 rotations - for the future comparisons (I think almost surely some future polycubes will collide with the current one, so might as well not wait until this happens and compute the cube_id right away)
So we're calculating these 24 rotations in any case anyway?

"the number of cubes in the shape" - I don't think it's needed, we are not computing the polycubes of different N at the same time; so the polycubes of different N are never stored in the same set, and so can never be confused with each other, unless you have a different application in mind

What do you mean by 3d "slice"? - a 2d slice of each layer I assume, from bottom to top layer, for example

Yes, sorry it should probably been a new comment.

Yes, if there was a collision - you would still need to compare all 24 rotations.
If no collision - I don't think you need to rotate, since the hash should be the same for all rotations (if the properties are sorted in a way that makes the orientation of the shape irrelevant, e.g. sorting the w/h/d/slices so that e.g. a 3x2x4 oblong would have an hash of {2,3,4,{sorted_slices}} for all orientations) - therefore we know it is a new shape.

It was just a vague idea - it all depends on how "discriminating" the hash is to how effective this would be (versus the cost of rotating).

Yes, by 3d slice, I meant taking all (2d) slices of the shape in all 3 axis.

replaced unnecessary `int.from_bytes(data, 'big')` with just `data` - change naming: cube_hash to cube_id - typo correction: get_canoincal_packing - relevant docstring updates

6% faster with better cropped cube

simplify `int.from_bytes(data, 'big')` + other

…ster encoding

bertie2 · 2023-07-13T13:48:51Z

@VladimirFokow I have modified on top of your code a little to still pack the bits, its still just a little bit slower, benchmarks below, but it saves a lot of memory, and at n>12 we are memory limited, I am still using the tobytes and frombuffer, like you did rather than manual shifting.

with bit-packing:

 python .\cubes.py --no-cache 9                          

Hashing polycubes n=3
completed 100.00%

Generating polycubes from hash n=3
completed 100.00%

Hashing polycubes n=4
completed 100.00%

Generating polycubes from hash n=4
completed 100.00%

Hashing polycubes n=5
completed 100.00%

Generating polycubes from hash n=5
completed 100.00%

Hashing polycubes n=6
completed 100.00%

Generating polycubes from hash n=6
completed 100.00%

Hashing polycubes n=7

Generating polycubes from hash n=7
completed 100.00%

Hashing polycubes n=8
completed 100.00%

Generating polycubes from hash n=8
completed 100.00%

Hashing polycubes n=9
completed 100.00%

Generating polycubes from hash n=9
completed 100.00%

Found 48311 unique polycubes

Elapsed time: 64.867s

without packing:

 python .\cubes.py --no-cache 9

Hashing polycubes n=3
completed 100.00%

Generating polycubes from hash n=3
completed 100.00%

Hashing polycubes n=4
completed 100.00%

Generating polycubes from hash n=4
completed 100.00%

Hashing polycubes n=5
completed 100.00%

Generating polycubes from hash n=5
completed 100.00%

Hashing polycubes n=6
completed 100.00%

Generating polycubes from hash n=6
completed 100.00%

Hashing polycubes n=7
completed 100.00%

Generating polycubes from hash n=7
completed 100.00%

Hashing polycubes n=8
completed 100.00%

Generating polycubes from hash n=8
completed 100.00%

Hashing polycubes n=9
completed 100.00%

Generating polycubes from hash n=9
completed 100.00%

Found 48311 unique polycubes

Elapsed time: 57.474s

VladimirFokow · 2023-07-13T13:55:41Z

@bertie2 yes

np.packbits reduces the memory 8 times (if the polycubes array was of dtype np.int8), but takes a toll on time for non-flat large arrays. If you .flatten() inside np.packbits, now the .flatten() takes this time, so you're not escaping the time penalty.
I guess, I was focusing on optimizing time too much, and not memory, but apparently memory is a more urgent issue, so 👍

bertie2 · 2023-07-16T18:54:48Z

merging imminently, cleaned up README and included test data for ease of use.

implement improvements from other repo from myself and ana-096, refac…

70d2cc0

…tor code into multiple files, add intial unit tests, add paralelism

scamille reviewed Jul 12, 2023

View reviewed changes

bertie2 added 2 commits July 12, 2023 19:56

back out multithreading changes for a different pull request, rename …

218503d

…cropping library to resizing, fix type anotations

cleanup of annotations, optimisation to break early on seeing a canon…

d6b518e

…ical hash

scamille reviewed Jul 12, 2023

View reviewed changes

tests/utils.py Show resolved Hide resolved

scamille reviewed Jul 12, 2023

View reviewed changes

tests/test_parallel.py Outdated Show resolved Hide resolved

scamille approved these changes Jul 12, 2023

View reviewed changes

bertie2 added 2 commits July 12, 2023 21:29

remove legacy parrelism test file

287dffd

try faster packing and unpacking with builtin byte conversions

2059b9f

VladimirFokow reviewed Jul 13, 2023

View reviewed changes

- cube_hash is now not int but bytes:

0f3220d

replaced unnecessary `int.from_bytes(data, 'big')` with just `data` - change naming: cube_hash to cube_id - typo correction: get_canoincal_packing - relevant docstring updates

VladimirFokow mentioned this pull request Jul 13, 2023

simplify int.from_bytes(data, 'big') + other bertie2/opencubes#1

Merged

joulebit and others added 4 commits July 13, 2023 04:04

6% faster with better cropped cube

7dd61a5

Merge pull request #2 from joulebit/overhaul

0c5ca24

6% faster with better cropped cube

Merge pull request #1 from VladimirFokow/overhaul

2459e21

simplify `int.from_bytes(data, 'big')` + other

still pack the bits to save memory but now use frombuffer for much fa…

15816f7

…ster encoding

bertie2 added 2 commits July 16, 2023 19:49

add test data, make unit-tests correctly auto discover, update README

6de8ef8

fix minor typo

1e380a8

bertie2 merged commit 60dbcdb into mikepound:main Jul 16, 2023

Code Overhaul. #3

Code Overhaul. #3

Uh oh!

Conversation

bertie2 commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scamille left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scamille Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

bertie2 Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matthunz commented Jul 12, 2023

Uh oh!

bertie2 commented Jul 12, 2023

Uh oh!

scamille commented Jul 12, 2023

Uh oh!

bertie2 commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scamille left a comment

Choose a reason for hiding this comment

Uh oh!

bertie2 commented Jul 12, 2023

Uh oh!

VladimirFokow Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RibenaJT Jul 13, 2023

Choose a reason for hiding this comment

Uh oh!

VladimirFokow Jul 13, 2023

Choose a reason for hiding this comment

Uh oh!

RibenaJT Jul 13, 2023

Choose a reason for hiding this comment

Uh oh!

bertie2 commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VladimirFokow commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bertie2 commented Jul 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bertie2 commented Jul 12, 2023 •

edited

Loading

scamille left a comment •

edited

Loading

bertie2 commented Jul 12, 2023 •

edited

Loading

VladimirFokow Jul 13, 2023 •

edited

Loading

bertie2 commented Jul 13, 2023 •

edited

Loading

VladimirFokow commented Jul 13, 2023 •

edited

Loading