SyncBatchNorm problems #12001

kaleidoscopical · 2018-08-02T15:17:05Z

kaleidoscopical
Aug 2, 2018

It runs into a fail of asnumpy() when I simply replace BatchNorm() to contrib.SyncBatchNorm(). Could anyone explain how to use the new function?

haojin2 · 2018-08-02T16:56:41Z

haojin2
Aug 2, 2018
Collaborator

@zhanghang1989

0 replies

zhanghang1989 · 2018-08-02T19:08:22Z

zhanghang1989
Aug 2, 2018

The SyncBN is implemented in a blocking way. Please use DataParallelModel in https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L11 for debugging, printing or asnumpy.

0 replies

Roshrini · 2018-08-02T21:11:09Z

Roshrini
Aug 2, 2018
Collaborator

@nswamy Can you please add label: Question

0 replies

kaleidoscopical · 2018-08-02T23:16:14Z

kaleidoscopical
Aug 2, 2018
Author

@zhanghang1989 It seems unsuitable for non-gluon users. Any suggestions?

0 replies

zhanghang1989 · 2018-08-03T01:57:33Z

zhanghang1989
Aug 3, 2018

I am not familiar with Symbol API. Can someone help this?

0 replies

safrooze · 2018-08-09T21:37:18Z

safrooze
Aug 9, 2018

@kaleidoscopical I haven't personally used SyncBatchNorm, but the operator is available in both nd and sym format. What are you using?

0 replies

kaleidoscopical · 2018-08-11T07:52:03Z

kaleidoscopical
Aug 11, 2018
Author

@safrooze I have tried both of them. While the nd version seems to perform similarly as the original BatchNorm, the sym version fails when calling asnumpy().

0 replies

zhanghang1989 · 2018-08-20T22:58:51Z

zhanghang1989
Aug 20, 2018

If using asnumpy() or print() please use DataParallelModel https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py#L11

Training example of using SyncBatchNorm can be found at https://github.com/dmlc/gluon-cv/blob/master/scripts/segmentation/train.py

0 replies

zhanghang1989 · 2018-08-20T23:02:46Z

zhanghang1989
Aug 20, 2018

If using standard BatchNorm works, change it to SyncBatchNorm and set num_device correctly, it will work as well.

0 replies

piyushghai · 2018-10-09T05:22:07Z

piyushghai
Oct 9, 2018

@kaleidoscopical Were you able to get a suitable answer to your question ?

0 replies

cctgem · 2018-11-09T06:08:55Z

cctgem
Nov 9, 2018

@kaleidoscopical Did you find a way to use syncbn by symbol api?

0 replies

zhanghang1989 · 2018-11-11T22:02:42Z

zhanghang1989
Nov 11, 2018

You can't call asnumpy() or do print() with SyncBatchNorm. Please use standard BatchNorm for debugging and change it to SyncBatchNorm with correct num_device after debugging.

0 replies

kaleidoscopical · 2018-11-17T17:04:54Z

kaleidoscopical
Nov 17, 2018
Author

According to a quick test, the described failure in this issue has disappeared in the newest version of MXNet. Further report about efficiency and performance will be updated if time is available.

@zhanghang1989 Thanks for your kind help. The asnumpy() function will be eventually called in the metric evaluation part, e.g. mx.metric.CrossEntropy(). This is the place where the fail happened as I described in the beginning of this issue. I am not sure what has been changed in the newest version, and whether the SyncBatchNorm really is executed because of ignorable speed drop (great if it is really executed).

0 replies

tranvanhoa533 · 2018-12-06T02:34:03Z

tranvanhoa533
Dec 6, 2018

@kaleidoscopical Which version of mxnet did you use ? I use mxnet version 1.3.1 but It still runs into a fail of asnumpy().

0 replies

kaleidoscopical · 2018-12-15T08:40:58Z

kaleidoscopical
Dec 15, 2018
Author

@tranvanhoa533 which error message it shows?

Hi @zhanghang1989 !
I think it still has a problem when training with multiple precision, leading to

mxnet.base.MXNetError: [17:38:53] include/mxnet/././tensor_blob.h:236: Check failed: mshadow::DataType<DType>::kFlag == type_flag_ TBlob.get_with_shape: data type do not match specified type.Expected: 2 v.s. given 0

This error message disappears when adding an explicit cast of type right before (16 to 32) and after (32 to 16) mx.sym.contrib.SyncBatchNorm using mx.sym.cast.

It is tedious and sacrifices much more memory and speed. Any suggestion to solve it?

0 replies

zhanghang1989 · 2018-12-17T18:42:00Z

zhanghang1989
Dec 17, 2018

SyncBN does not support fp16 training yet.

0 replies

cccorn · 2019-03-23T12:22:22Z

cccorn
Mar 23, 2019

I met the same problem with mxnet version 1.5.0, and I found a solution from L_xiaoming in https://discuss.gluon.ai/t/topic/7842

The solution is to specify the parameter 'key' in the SyncBatchNorm layer, and you can just use the same string as the layer's name.

But I don't know why it works.

0 replies

pengwangucla · 2019-03-29T08:01:13Z

pengwangucla
Mar 29, 2019

I tried with all the solutions, and neither of them is working. It is really weird When I only use

I met the same problem with mxnet version 1.5.0, and I found a solution from L_xiaoming in https://discuss.gluon.ai/t/topic/7842

The solution is to specify the parameter 'key' in the SyncBatchNorm layer, and you can just use the same string as the layer's name.

But I don't know why it works.

The problem still exists, after given key. There is no matching error, however, the program running is jammed and never continue. The max number of syncbn I can use is 2. Don't know why

0 replies

zhanghang1989 · 2019-04-01T16:29:25Z

zhanghang1989
Apr 1, 2019

Gluon-CV is using syncBN for training all segmentation model and YoloV3. Please take a look at how to use it.

0 replies

SyncBatchNorm problems #12001

Uh oh!

Replies: 19 comments

Uh oh!

haojin2 Aug 2, 2018 Collaborator

Uh oh!

Uh oh!

Roshrini Aug 2, 2018 Collaborator

Uh oh!

kaleidoscopical Aug 2, 2018 Author

Uh oh!

Uh oh!

Uh oh!

kaleidoscopical Aug 11, 2018 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaleidoscopical Nov 17, 2018 Author

Uh oh!

Uh oh!

kaleidoscopical Dec 15, 2018 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haojin2
Aug 2, 2018
Collaborator

Roshrini
Aug 2, 2018
Collaborator

kaleidoscopical
Aug 2, 2018
Author

kaleidoscopical
Aug 11, 2018
Author

kaleidoscopical
Nov 17, 2018
Author

kaleidoscopical
Dec 15, 2018
Author