Skip to content

Make faster matToNDArray safe with non-image number of channels#28

Merged
alanocallaghan merged 2 commits intomainfrom
revert-faster-conversion
Feb 11, 2025
Merged

Make faster matToNDArray safe with non-image number of channels#28
alanocallaghan merged 2 commits intomainfrom
revert-faster-conversion

Conversation

@alanocallaghan
Copy link
Contributor

@alanocallaghan alanocallaghan commented Feb 10, 2025

This reverts commit f2e5ab0.

The listed commit breaks matToNDArray in instanseg for at least some numbers of input channels. Seems like 1 or 3 input channels works, others fail with:

ai.djl.translate.TranslateException: java.lang.RuntimeException: OpenCV(4.10.0) /home/runner/work/javacpp-presets/javacpp-presets/opencv/cppbuild/linux-x86_64/opencv-4.10.0/modules/core/src/arithm.cpp:658: error: (-209:Sizes of input arguments do not match) The operation is neither 'array op array' (where arrays have the same size and the same number of channels), nor 'array op scalar', nor 'scalar op array' in function 'arithm_op'

	at ai.djl.inference.Predictor.batchPredict(Predictor.java:197)
	at ai.djl.inference.Predictor.predict(Predictor.java:133)
	at qupath.ext.instanseg.core.TilePredictionProcessor.process(TilePredictionProcessor.java:143)
	at qupath.ext.instanseg.core.TilePredictionProcessor.process(TilePredictionProcessor.java:35)
	at qupath.lib.experimental.pixels.PixelProcessor$ProcessorTask.run(PixelProcessor.java:273)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.RuntimeException: OpenCV(4.10.0) /home/runner/work/javacpp-presets/javacpp-presets/opencv/cppbuild/linux-x86_64/opencv-4.10.0/modules/core/src/arithm.cpp:658: error: (-209:Sizes of input arguments do not match) The operation is neither 'array op array' (where arrays have the same size and the same number of channels), nor 'array op scalar', nor 'scalar op array' in function 'arithm_op'

	at org.bytedeco.opencv.global.opencv_dnn.blobFromImage(Native Method)
	at qupath.ext.djl.DjlTools.matToNDArray(DjlTools.java:415)
	at qupath.ext.instanseg.core.MatTranslator.processInput(MatTranslator.java:61)
	at qupath.ext.instanseg.core.MatTranslator.processInput(MatTranslator.java:20)
	at ai.djl.translate.Translator.batchProcessInput(Translator.java:103)
	at ai.djl.inference.Predictor.batchPredict(Predictor.java:185)
	... 11 common frames omitted
19:03:25.184 [JavaFX Application Thread] [INFO ] q.l.g.TaskRunnerFX$PluginProgressMonitorFX - Processing complete in 0.24 seconds
19:03:25.193 [JavaFX Application Thread] [INFO ] q.l.g.TaskRunnerFX$PluginProgressMonitorFX - Processing complete in 0.01 seconds
19:03:25.412 [instanseg1] [INFO ] q.ext.instanseg.ui.InstanSegTask - Results: InstanSegResults[nPixelsProcessed=262144, nTilesProcessed=1, nTilesFailed=1, nObjectsDetected=0, processingTimeMillis=856, wasInterrupted=false]
19:03:25.436 [instanseg1] [ERROR] q.ext.instanseg.ui.InstanSegTask - 1 tiles failed. This could be a memory issue.
Consider decreasing tile size or the number of threads used.

@petebankhead
Copy link
Member

Hmmm, guess it makes sense that OpenCV doesn't support other channel numbers.

Unfortunate though, I have a vague memory that the performance cost of having to call array.concat was much too big, at least under some conditions.

I'll have a look tomorrow to see if I can figure out why I made this change, and why I didn't realise it was broken.

I suppose a compromise might be to check the channel numbers? Even if it only works for 3 channels, that's potentially a common enough use case to be worthwhile.

@alanocallaghan
Copy link
Contributor Author

I have not understood the openCV/DJL code well enough to offer any real advice or objections on solutions, but generally speaking I'd say if it's faster and safe in some situations I think it's a good idea to use it then and only default to the slow version when necessary

@petebankhead
Copy link
Member

I wrote a script to debug it at https://gist.github.com/petebankhead/f088f5880e808a7976cabb7c974d76c3

Basically, I can confirm your observation that it's broken except for 1, 3 or 4 channels (which are all valid for images from OpenCV's perspective).

I'm not seeing a substantial difference in performance in the script, either using CPU or GPU (MPS).

What I think was happening is that it was giving me a performance bottleneck when running in real-world scenarios using MPS with InstanSeg and multithreading. The old code required creating lots of arrays and then concatenating them, and I think that ended up causing too much competition and shifting of data. VisualVM told me this was where the problem was, and switching to blobFromImage helped by simplifying to a single call.

The other option I would have tried was to construct a float[] or FloatBuffer on the Java side with everything reshaped properly, and then call manager.create to create the proper NDArray in a single step. But I didn't fancy having to figure out how to transpose everything properly, and I thought blobFromImage gave me an easy way out.

Conclusion:

  1. I don't want to revert the commit entirely, but we do need to check if we have 3 or 4 channels and only use blobFromImage in that case.
  2. We really need tests for this code... and also to confirm that we get the same results with/without blobFromImage.
  3. Ideally we would have a more optimized conversion for other numbers of channels that doesn't rely on array.concat. For now, we could just add a comment in the code to that effect.

@petebankhead
Copy link
Member

Reassuringly, when I change the top of the script to

// How many channels do you want?
int nChannels = 4

boolean useGPU = true

try (var scope = new PointerScope()) {
    Device device
    if (useGPU)
        device = GeneralTools.isMac() ? Device.fromName("mps") : Device.gpu()
    else
        device = Device.cpu()

    try (var manager = NDManager.newBaseManager(device)) {
        var mat = new Mat(32, 64, opencv_core.CV_32FC(nChannels))
        opencv_core.randn(mat, OpenCVTools.scalarMat(1.0, mat.depth()), OpenCVTools.scalarMat(1.0, mat.depth()))
        long startTime = System.nanoTime()
        var array = matToNDArray(manager, mat, "NCHW", false)
        var array2 = matToNDArray(manager, mat, "NCHW", true)
        long endTime = System.nanoTime()
        println array
        println "Time: ${(endTime - startTime)/10e9} seconds"
        println "Same? ${array.contentEquals(array2)}"
    }
}

it seems that the results are the same when using blobFromImage and skipping it to revert to the previous code.

@alanocallaghan
Copy link
Contributor Author

Seems like a general-purpose manual reshape would be the best way to go although I don't know that I'm volunteering to do it given I'd want to test it very well, and I certainly don't understand the task well enough to get started yet.

For the time being, reverting everything but the special case sizes sounds sensible

@alanocallaghan
Copy link
Contributor Author

It's not immediately obvious to me how to test this code without packaging pytorch or tensorflow into the DJL extension, which we presumably don't want to do. Otherwise we get ai.djl.engine.EngineException: No deep learning engine found., I think

I'll revisit; with decent tests I'd be happy to write a manual transposer

@alanocallaghan alanocallaghan changed the title Revert "Faster Mat to NDArray channels-first conversion" Make faster matToNDArray safe with non-image number of channels Feb 11, 2025
@alanocallaghan alanocallaghan merged commit 072b603 into main Feb 11, 2025
1 check passed
@alanocallaghan alanocallaghan deleted the revert-faster-conversion branch February 11, 2025 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants