- 
                Notifications
    You must be signed in to change notification settings 
- Fork 11
Add ability to set thread affinity #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good I could find just few nits in threads.c you might want to address.
thanks.
        
          
                source/perflib/threads.c
              
                Outdated
          
        
      | unsigned int ret = 0; | ||
|  | ||
| for (size_t i = 0; i < sizeof(a) * CHAR_BIT; i++) | ||
| ret += ((a & (1ULL << i)) == 0); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this what I've messed up in my sgguested change. we discussed this off-llist we are supposed to count bits which are set, right? if so then we need ret += ((a & (1ULL << i)) != 0); here.
| goto err; | ||
| } | ||
|  | ||
| ta = OPENSSL_malloc(sizeof(*ta) * threadcount); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good to use OPENSSL_malloc() so tests work with libraries which don't provide OPENSSL_malloc_array()
| args[i].num = i; | ||
| perflib_run_thread(&threads[i], &args[i]); | ||
| if (!(run_threads[i] = perflib_run_thread_(&threads[i], &args[i], | ||
| ta + i))) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use &ta[i] here so it is clear we work with array, thanks.
Signed-off-by: Eugene Syromiatnikov <[email protected]>
efab928    to
    c033606      
    Compare
  
    Signed-off-by: Eugene Syromiatnikov <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Co-Authored-by: Alexandr Nedvedicky <[email protected]> Signed-off-by: Eugene Syromiatnikov <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Signed-off-by: Eugene Syromiatnikov <[email protected]>
c033606    to
    30f2a1f      
    Compare
  
    | static ossl_inline unsigned int popcount(affinity_t a) | ||
| { | ||
| return __builtin_popcountl(a); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to special case the ability to use a compiler built in here? It seems like the balance between the ifdeffery here and a single function that counts up to sizeof(unsigned long) * 8 bits is biased in favor of just having one function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't like the idea of rolling own implementation when the built-in is right here, but I don't really care here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to use compiler built-ins can we also enable them for clang?
diff --git a/source/perflib/threads.c b/source/perflib/threads.c
index 8cf3a76..4f9187c 100644
--- a/source/perflib/threads.c
+++ b/source/perflib/threads.c
@@ -22,7 +22,7 @@
 /** affinity_t-typed value with nth bit set. */
 #define AFFINITY_BIT(n) ((affinity_t)1U << (n))
 
-#if defined(__GNUC__)
+#if defined(__GNUC__) || defined(__clang__)
 
 static ossl_inline unsigned int popcount(affinity_t a)
 {
@@ -41,7 +41,7 @@ static ossl_inline unsigned int popcount(affinity_t a)
     return ret;
 }
 
-#endif /* __GNUC__ */
+#endif /* __GNUC__ or __clang__ */
 
 int perflib_roundrobin_affinity(affinity_t *cpu_set_bits, size_t cpu_set_size,
                                 size_t num, size_t cnt, void *arg)
to be honest I'm with Neal here. My reasoning is the peftools need to be portable to as many platforms/compilers as (conveniently) possible. you are rolling the builtin implementation anway so using a bultinn one here does not buy as much.
on the other hand if limit ourselves to clang and GCC tools, then I'm fine with going to bultin only one.
the true reason I don't like the if/else here is it leaves a dead/untested code behind. In my opinion the true choice here should be:
- being portable, then roll your own
 or
- let's rely on compiler then code will work on platforms where bultiin is provided
in my view the perftools are roll your own case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's the goal and the problem we are seeking to solve here that we need code to support to solve.
How will we support this, when people start fiddling with it and complain about impact because they do it poorly, and attempt to be smarter than the scheduler and do it badly.
To ask this another way, if our goal is to have less noisy data for falling-down-the-hill performance tests, does this belong in testing support for us and not in the main library?
mea culpa, that's what this is, so yeah it's in the test library, I have less objections.
| "\t-v verbose output, includes min, max, stddev, and median times\n" | ||
| "\t-T timeout for each test run in seconds, can be fractional" | ||
| "\t-T timeout for each test run in seconds, can be fractional\n" | ||
| "\t-b Set CPU affinity for the threads (in round robin fashion)\n" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about adding this option to all the other tests in the repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was prototyping on pkeyread, but, yeah, adding it to other tests should be trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand Nikola's question better now. and I think he is making a good point. let me ask the question different way: what is a difference between running the test using the command:
./pkeyread -f all -k all -b 16
and
taskset  0xffff ./pkeyread -f all -k all 16
If I understand things right, then th -b is a shortcut so people don't need to think of using a taskset(1) is my understanding correct?
| OSSL_TIME max_time; | ||
|  | ||
| int err = 0; | ||
| int error = 0; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Theres a good portion of this PR dedicated to renaming variables that doesn't really have anything to do with the addition of thread affinity management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this address linker issues. there is function err() which conflicts with variables err. the changes in this PR just discovered this conflict. so the change got included here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be more prudent to just rename the err() function that commit 953a33f introduced then?
Though I'm surprised that a reasonable compiler can't tell the difference between a variable reference and a function call of the same name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be more prudent to just rename the err() function that commit 953a33f introduced then?
Though I'm surprised that a reasonable compiler can't tell the difference between a variable reference and a function call of the same name.
the err() is a windows version of err(3) which is commonly used on *nix it would make me sad to see it go,
|  | ||
| #include <string.h> | ||
| #include <openssl/crypto.h> | ||
| #include <openssl/macros.h> | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? There is no other change in this file
| } | ||
|  | ||
| void | ||
| err(int status, const char *fmt, ...) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why to duplicate errx function? Same for warn and warnx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err/warn append the output of perror() to the message, while errx/warnx just print the provided string (along with the program name as a prefix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err/warn append the output of perror() to the message, while errx/warnx just print the provided string (along with the program name as a prefix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now; sorry for the noise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow why we need to duplicate all this stuff for thread affinity.. This seems unrelated.
If we want this it should probably be done separately, or we should ask ourselves why we can't use the standard stuff.
| # include <err.h> | ||
|  | ||
| # else /* _WIN32 */ | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use new lines around include in #if.
| The work is ok, but I'm a little bit lost why the work is needed. | 
| 
 my understanding is you want to pin a thread to CPU so scheduler does not migrate the thread which runs performance test around the system. I think this does not present on system with low number of cores. it becomes more important on large multicore systems. | 
| static ossl_inline unsigned int popcount(affinity_t a) | ||
| { | ||
| return __builtin_popcountl(a); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to use compiler built-ins can we also enable them for clang?
diff --git a/source/perflib/threads.c b/source/perflib/threads.c
index 8cf3a76..4f9187c 100644
--- a/source/perflib/threads.c
+++ b/source/perflib/threads.c
@@ -22,7 +22,7 @@
 /** affinity_t-typed value with nth bit set. */
 #define AFFINITY_BIT(n) ((affinity_t)1U << (n))
 
-#if defined(__GNUC__)
+#if defined(__GNUC__) || defined(__clang__)
 
 static ossl_inline unsigned int popcount(affinity_t a)
 {
@@ -41,7 +41,7 @@ static ossl_inline unsigned int popcount(affinity_t a)
     return ret;
 }
 
-#endif /* __GNUC__ */
+#endif /* __GNUC__ or __clang__ */
 
 int perflib_roundrobin_affinity(affinity_t *cpu_set_bits, size_t cpu_set_size,
                                 size_t num, size_t cnt, void *arg)
to be honest I'm with Neal here. My reasoning is the peftools need to be portable to as many platforms/compilers as (conveniently) possible. you are rolling the builtin implementation anway so using a bultinn one here does not buy as much.
on the other hand if limit ourselves to clang and GCC tools, then I'm fine with going to bultin only one.
the true reason I don't like the if/else here is it leaves a dead/untested code behind. In my opinion the true choice here should be:
- being portable, then roll your own
 or
- let's rely on compiler then code will work on platforms where bultiin is provided
in my view the perftools are roll your own case.
| "\t-v verbose output, includes min, max, stddev, and median times\n" | ||
| "\t-T timeout for each test run in seconds, can be fractional" | ||
| "\t-T timeout for each test run in seconds, can be fractional\n" | ||
| "\t-b Set CPU affinity for the threads (in round robin fashion)\n" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand Nikola's question better now. and I think he is making a good point. let me ask the question different way: what is a difference between running the test using the command:
./pkeyread -f all -k all -b 16
and
taskset  0xffff ./pkeyread -f all -k all 16
If I understand things right, then th -b is a shortcut so people don't need to think of using a taskset(1) is my understanding correct?
| 
 I think the difference between: and Is that in the latter case we rely on the OS scheduler to place threads on unique cores. In the former case thread 1 is guaranteed to have an affinity of 0x1, thread 2 an affinity of 0x2, thread 3 an affinity of 0x4, etc. In the latter all threads can run on any ore in the affinity set. Will they likely be scheduled to unique cores? Probably. Are they guaranteed to be? No. I guess the question to ask is "Does that matter to us?", and honestly, I'm not sure of the answer there. | 
| 
 So, the original reason I ended up writing that is that while working on  
 All those factors are predominantly relevant only when running on NUMA systems, naturally. | 
| > All those factors are predominantly relevant only when running on NUMA systems, naturally. understood. my preference here is to get away with  In my opinion the less we do here the better. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I'm not entirely convinced about affinity changes here. My preference is to get away with task_set (and similar command line tools on other than linux OSes if stach command is provided). Would you mind to keep it in your perftools fork for a while? I would include it when I will be sure perftools need it. At the moment I just think time has not come yet. thanks.
| > 
> 
> All those factors are predominantly relevant only when running on NUMA systems, naturally. back in the time you were hunting down the x509storeissuer performance would running the test using task_set help you or it was not sufficient so you had to opt to implement thread pinning using glibc? | 
| } | ||
|  | ||
| void | ||
| err(int status, const char *fmt, ...) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow why we need to duplicate all this stuff for thread affinity.. This seems unrelated.
If we want this it should probably be done separately, or we should ask ourselves why we can't use the standard stuff.
| do { \ | ||
| fprintf(stderr, "%s:%d(%s): ", __FILE__, __LINE__, __FUNCTION__); \ | ||
| errx(__VA_ARGS__); \ | ||
| } while (0) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like yet another way to do this just for this PR, I'm questioning why this needs to be included,
| static ossl_inline unsigned int popcount(affinity_t a) | ||
| { | ||
| return __builtin_popcountl(a); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's the goal and the problem we are seeking to solve here that we need code to support to solve.
How will we support this, when people start fiddling with it and complain about impact because they do it poorly, and attempt to be smarter than the scheduler and do it badly.
To ask this another way, if our goal is to have less noisy data for falling-down-the-hill performance tests, does this belong in testing support for us and not in the main library?
mea culpa, that's what this is, so yeah it's in the test library, I have less objections.
| IMO we should try to rely on using command line tools to deal with thread affinity (like  so I would drop the affinity related changes from here. other pieces left behind still seem useful as they make things tidy. | 
| As mentioned in [1], I haven't managed to find any discernible difference in performance and noise levels on  | 
Includes the relevant update to the
pkeyreadtest, as it already tries to report some thread indices in the-vmode.