Skip to content

Conversation

@dsharma283
Copy link

This patch enables to use adapters with HDR and NDR
speeds.
issue id 3431

Signed-off-by: Devesh Sharma [email protected]

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@jsquyres
Copy link
Member

bot:retest

@bwbarrett
Copy link
Member

bot:ompi:retest

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to #3431, it looks like you are trying to solve two issues:

  1. Support 25/50Gbps devices.
  2. Fix a segv.

In #3431, the real error in the segv appears to be:

*** Error in `/usr/local/imb/openmpi/dcheck/IMB-MPI1': free(): invalid pointer: 0x00007ff37b2f34d8 ***

However, it's not clear from that output whether that free() is in the Open MPI code or in IMB. I.e., is Open MPI handling the rejected port badly (i.e., not propagating the error properly), or is IMB failing?

If there's a separate fix for fixing a segv in a failure path for the openib BTL when a port is rejected, that would be useful to include in a separate commit on this PR.

/* HDR: 50Gbps * 64/66, in megabits */
*bandwidth = 50000;
break;
case 128:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. According to https://github.com/torvalds/linux/blob/master/include/rdma/ib_verbs.h#L428-L435, I do not see values for 64 and 128 defined in the kernel ABI for the active_speed field when querying IB port attributes. Where are you getting these 64/128 values from?

  2. As a secondary (but related) issue, is there a better way than using hard-coded integer values here in common_verbs_port.c? On the kernel side, we have the nice IB_SPEED_* enums; do public equivalents exist in userspace that we can use here in Open MPI? (ditto for IB_WIDTH_*)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jeff,

Thanks for your comments and feedback.
Regarding the invalid pointer free(), I did some initial debugging and found that openmpi is raising this stack trace whenever it fails to select any of the usable btl for a given process. for example, with my transport if I select btl_openib_flags in such a way where only PUT support is available (flags value 306). I will open a seprate PR for this issue.

Yes, it is true that currently values 64 and 128 are not defined in the kernel/user IB stack. I will push the needed change to the ib-stack very soon.

To the best of my knowledge, those macros are not there in user-space. I can try pushing those to rdma-core once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, either a separate PR or another commit on this PR to fix that segv would be great; thanks.
  2. Once you get a patch accepted upstream for the 64/128 values, we can probably take this commit. I just wouldn't want to take a commit here that is dependent upon a vendor-specific OFED stack (which then might vary between vendors). Hope that makes sense.
  3. I think getting the macros would be an extra bonus, and can likely be a separate upstream patch for libibverbs. I.e., depending on the timing, either a separate PR or a separate commit on this PR that converts these naked values to enums would be awesome.

Many thanks.

Copy link

@chrissamuel chrissamuel Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note that we have been pointed to this PR by Mellanox about OMPI not working with CX5 cards and ROCE.

However, we do not see the backtrace that @dsharma283 reports, just the error:

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   spartan-gpgpu008
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   spartan-gpgpu007
  Local device: mlx5_0
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: spartan-gpgpu008: task 1: Exited with exit code 17
srun: error: spartan-gpgpu007: task 0: Killed

@dsharma283
Copy link
Author

I have posted a change in upstream linux-rdma community to pull the port-speed and link-width enums. I will supply a rev of this patch once the change is accepted.
http://marc.info/?l=linux-rdma&m=149425108020822&w=2

Just for the reference I am leaving this pull-request open till I comeback with proper fix.

@chrissamuel
Copy link

I can confirm that the supplied change to opal/mca/common/verbs/common_verbs_port.c makes the verbs interface work with RoCE on Mellanox CX5 cards connected at 50Gb/s .

Before (via TCP btl):

       8 bytes took        53 usec (   0.301 MB/sec)
      16 bytes took        56 usec (   0.569 MB/sec)
[...]

After (via openib btl):

       8 bytes took        15 usec (   1.080 MB/sec)
      16 bytes took         6 usec (   5.096 MB/sec)
[...]

Patched with just:

[samuel@spartan-build openmpi-3.0.0]$ diff -iubw opal/mca/common/verbs/common_verbs_port.c.orig opal/mca/common/verbs/common_verbs_port.c
--- opal/mca/common/verbs/common_verbs_port.c.orig      2017-09-13 07:31:33.000000000 +1000
+++ opal/mca/common/verbs/common_verbs_port.c   2017-11-15 10:45:53.643484102 +1100
@@ -68,6 +68,14 @@
         /* EDR: 25.78125 Gbps * 64/66, in megabits */
         *bandwidth = 25000;
         break;
+    case 64:
+        /* HDR: 50Gbps * 64/66, in megabits */
+        *bandwidth = 50000;
+        break;
+    case 128:
+        /* NDR: 100Gbps * 64/66, in megabits */
+        *bandwidth = 100000;
+        break;
     default:
         /* Who knows? */
         return OPAL_ERR_NOT_FOUND;

@jsquyres
Copy link
Member

@chrissamuel @dsharma283 Did that patch get accepted upstream?

@chrissamuel
Copy link

Not from what I see, it was discussed here:

https://patchwork.kernel.org/patch/9716209/

There was talk of rearchitecturing it in a way that might result in some distros being unsupported.

@dsharma283
Copy link
Author

dsharma283 commented Nov 15, 2017 via email

@chrissamuel
Copy link

@dsharma283 I see no evidence of any mention of anything using IBV_ in the linux-rdma repos here. https://github.com/linux-rdma

@chrissamuel
Copy link

chrissamuel commented Nov 16, 2017

@jsquyres looking at your link it looks like the file has changed since then and IB_SPEED_HDR is now defined for ib_port_speed:

https://github.com/torvalds/linux/blob/master/include/rdma/ib_verbs.h#L464-L472

enum ib_port_speed {
	IB_SPEED_SDR	= 1,
	IB_SPEED_DDR	= 2,
	IB_SPEED_QDR	= 4,
	IB_SPEED_FDR10	= 8,
	IB_SPEED_FDR	= 16,
	IB_SPEED_EDR	= 32,
	IB_SPEED_HDR	= 64
};

Could we perhaps merge a reduced version of this PR to just recognise IB_SPEED_HDR devices please?

@hppritcha
Copy link
Member

@jsquyres could you double check whether this is okay or not now

This patch enables to use adapters with HDR speeds.
issue id 3431

Signed-off-by: Devesh Sharma <[email protected]>
@jsquyres
Copy link
Member

I updated the patch to remove the "128" case (because that's not upstream).

@jsquyres jsquyres changed the title ompi/opal: add support for HDR and NDR link speeds ompi/opal: add support for HDR link speeds Mar 22, 2018
@jsquyres jsquyres merged commit a15d823 into open-mpi:master Mar 22, 2018
@dsharma283
Copy link
Author

dsharma283 commented Mar 26, 2018 via email

@jsquyres
Copy link
Member

No worries. Note that this cherry pick is slated for the next releases in the v2.1.x, v3.0.x, and v3.1.x series (#4945, #4946, and #4947).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants