MinerX scheduler questions #7945

SgtCoin · 2022-01-13T20:25:15Z

SgtCoin
Jan 13, 2022

The MinerX team has requested details on the new scheduler process. Mainly, what is the expectation with multiple workers on the same node with multiple GPUs. I am opening this discussion for further clarification from the MinerX team on the questions.

MinerX, please comment your questions and comments on this discussion for review.

rjan90 · 2022-01-14T23:06:59Z

rjan90
Jan 14, 2022
Maintainer

So documenting a couple of the variables that I understand the meaning/use of:

The MAX_PARALLELISM variable specifies whether to use multithread or not when the GPU is NOT in use. -1 = multithread , and 1 should be single thread.
The MAX_PARALLELISM_GPU variable specifies how many CPU cores to use when the GPU is used. Note: If it is set to 0, it inherits the properties of MaxParallelism of either multithread or single thread.
All non-GPU tasks: AP_<size>_GPU_UTILIZATION, C1_<size>_GPU_UTILIZATION, PC1_<size>_GPU_UTILIZATION, should only be set to 0 (or not at all; zero is default for them). If it´s above, the scheduler will think that these tasks require a GPU.

0 replies

benjaminh83 · 2022-01-17T07:15:18Z

benjaminh83
Jan 17, 2022

I have a test sealing worker with one and two GPUs. The enhanced scheduler only seems to work perfect for the one GPU setup:

1 GPU worker: I set variables like
PC2_32G_GPU_UTILIZATION=0.5
C2_32G_GPU_UTILIZATION=0.5
And I get a nice behaviour where my 3090 is allowed to do 2C2, 2PC2 or 1C2,1PC2

2 GPU worker: I do the same, but it's does not work as I was expecting.
Still the scheduler allows up to 4 jobs, but the handling is not really ideal. It seems like it's doing one job at a time and utilising both GPUs for that one job.

I guess this is because the actual execution of the job is done inside the lotus-worker, while the enhanced scheduler is just the one assigning jobs.

So for me - using these variables does not really seems to get me all the way. I would still have to run 1 worker per physical GPU, to ensure it's taking full advantage.

Secondly, I noticed that a lotus-worker that does 2C2 jobs, still only provides one CPU thread per GPU while feeding data. This might be a bottleneck. If I run the old trick with 2 workers for one 3090, then each process will run 1 thread at 100%. I guess lotus-worker should assign assign one thread per job to load data to the GPU.

So these issues might not be related to the enhanced scheduler directly, but a lot of us would like to use 0.5 GPU per job, and it seems like this would need some more tweaking in the lotus-worker to actually work.

Thanks!

0 replies

neondragon · 2022-01-17T10:01:47Z

neondragon
Jan 17, 2022

Configuration
Two GPUs in the server running Lotus Miner.

Expected
Multiple WindowPoSt will run concurrently.
WinningPoSt will do one of:

When GPU(s) not in use

Run on GPU(s)

When GPU(s) in use by other threads

Use the GPU in parallel with other threads
Immediately preempt all other threads using the GPU
Run on the CPU instead

Actual
Maximum of one WindowPoSt runs concurrently, but it uses all GPUs in parallel.
WinningPoSt is blocked until WindowPoSt releases the GPU lock. When it does run, it's far too late.

0 replies

kernelogic · 2022-01-17T21:45:13Z

kernelogic
Jan 17, 2022

With env C2_32G_GPU_UTILIZATION=0.5 and PC2_32G_GPU_UTILIZATION=0.5, I can confirm 2 PCs or 2 C2s can run with a single 3090 and a single worker process.

0 replies

SgtCoin · 2022-01-25T19:44:13Z

SgtCoin
Jan 25, 2022
Author

@magik6k Can you lay out what we expect the scheduler to do, given the feedback above?

0 replies

benjaminh83 · 2022-01-28T09:17:53Z

benjaminh83
Jan 28, 2022

I wondered why my system starts 2 PC2 jobs at the same time, and one just is always done before the other. Well, turns out that the lotus-worker does run each job one-by-one, so first base tree_c for both of them, and then base tree_r afterwards. So it makes good sense now that one job is done once base tree_r computing is done, and then the second job has to go through the process.

Unfortunately, this is very inefficient. We want both jobs hitting the GPU at the same time - just like we run multiple PC1 jobs against the CPU!

Logs/example here:

2022-01-28T09:48:44.324 INFO filcrypto::proofs::api > seal_pre_commit_phase2: start
2022-01-28T09:48:44.325 INFO filecoin_proofs::api > validate_cache_for_precommit_phase2:start
2022-01-28T09:48:44.325 INFO filecoin_proofs::api > validate_cache_for_precommit_phase2:finish
2022-01-28T09:48:44.325 INFO filecoin_proofs::api::seal > seal_pre_commit_phase2:start
2022-01-28T09:48:44.325 INFO storage_proofs_porep::stacked::vanilla::proof > replicate_phase2
2022-01-28T09:48:44.325 INFO storage_proofs_porep::stacked::vanilla::proof > Building trees [1048576 descriptors max available]
2022-01-28T09:48:44.325 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree c using the GPU
2022-01-28T09:48:44.325 INFO storage_proofs_porep::stacked::vanilla::proof > Building column hashes
2022-01-28T09:49:06.890 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 1/8 of length 153391689
2022-01-28T09:49:48.017 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 2/8 of length 153391689
2022-01-28T09:50:27.522 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 3/8 of length 153391689
2022-01-28T09:51:06.549 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 4/8 of length 153391689
2022-01-28T09:51:48.979 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 5/8 of length 153391689
2022-01-28T09:52:34.789 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 6/8 of length 153391689
2022-01-28T09:53:20.034 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 7/8 of length 153391689
2022-01-28T09:54:05.077 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 8/8 of length 153391689
2022-01-28T09:54:05.553 INFO neptune::proteus::program > Using kernel on CUDA.
2022-01-28T09:54:05.588 INFO neptune::proteus::program > Using kernel on CUDA.
2022-01-28T09:54:09.092 INFO storage_proofs_porep::stacked::vanilla::proof > tree_c done
2022-01-28T09:54:09.092 INFO storage_proofs_porep::stacked::vanilla::proof > building tree_r_last
2022-01-28T09:54:09.092 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree r last using the GPU
2022-01-28T09:54:51.532 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 1/8 of length 153391689
2022-01-28T09:55:35.904 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 2/8 of length 153391689
2022-01-28T09:56:22.361 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 3/8 of length 153391689
2022-01-28T09:57:07.261 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 4/8 of length 153391689
2022-01-28T09:57:53.300 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 5/8 of length 153391689
2022-01-28T09:58:37.402 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 6/8 of length 153391689
2022-01-28T09:59:22.817 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 7/8 of length 153391689
2022-01-28T10:00:07.081 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 8/8 of length 153391689
2022-01-28T10:00:07.558 INFO neptune::proteus::program > Using kernel on CUDA.
2022-01-28T10:00:11.535 INFO storage_proofs_porep::stacked::vanilla::proof > tree_c done
2022-01-28T10:00:11.535 INFO storage_proofs_porep::stacked::vanilla::proof > building tree_r_last
2022-01-28T10:00:11.535 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree r last using the GPU
2022-01-28T10:00:25.486 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 1/8
2022-01-28T10:00:49.237 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 2/8
2022-01-28T10:01:15.729 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 3/8
2022-01-28T10:01:38.563 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 4/8
2022-01-28T10:02:03.092 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 5/8
2022-01-28T10:02:27.175 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 6/8
2022-01-28T10:02:52.249 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 7/8
2022-01-28T10:03:14.829 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 8/8
2022-01-28T10:03:21.178 INFO neptune::proteus::program > Using kernel on CUDA.
2022-01-28T10:03:21.726 INFO storage_proofs_porep::stacked::vanilla::proof > tree_r_last done
2022-01-28T10:03:21.740 INFO storage_proofs_core::data > dropping data /mnt/md4/sealing3/sealed/s-t02576-8537
2022-01-28T10:03:38.892 INFO filecoin_proofs::api::seal > seal_pre_commit_phase2:finish
2022-01-28T10:03:38.893 INFO filcrypto::proofs::api > seal_pre_commit_phase2: finish
2022-01-28T10:03:38.906 INFO filcrypto::proofs::api > seal_commit_phase1: start
2022-01-28T10:03:38.907 INFO filecoin_proofs::api > validate_cache_for_commit:start
2022-01-28T10:03:38.907 INFO filecoin_proofs::api > validate_cache_for_commit:finish
2022-01-28T10:03:38.907 INFO filecoin_proofs::api::seal > seal_commit_phase1:start: SectorId(8537)
2022-01-28T10:03:39.634 INFO filecoin_proofs::api::seal > seal_commit_phase1:finish: SectorId(8537)
2022-01-28T10:03:39.778 INFO filcrypto::proofs::api > seal_commit_phase1: finish
2022-01-28T10:03:39.871 INFO filcrypto::proofs::api > seal_commit_phase1: start
2022-01-28T10:03:39.871 INFO filecoin_proofs::api > validate_cache_for_commit:start
2022-01-28T10:03:39.871 INFO filecoin_proofs::api > validate_cache_for_commit:finish
2022-01-28T10:03:39.871 INFO filecoin_proofs::api::seal > seal_commit_phase1:start: SectorId(8537)
2022-01-28T10:03:40.713 INFO filecoin_proofs::api::seal > seal_commit_phase1:finish: SectorId(8537)
2022-01-28T10:03:40.858 INFO filcrypto::proofs::api > seal_commit_phase1: finish
2022-01-28T10:03:40.901 INFO filcrypto::proofs::api > seal_commit_phase1: start
2022-01-28T10:03:40.901 INFO filecoin_proofs::api > validate_cache_for_commit:start
2022-01-28T10:03:40.901 INFO filecoin_proofs::api > validate_cache_for_commit:finish
2022-01-28T10:03:40.901 INFO filecoin_proofs::api::seal > seal_commit_phase1:start: SectorId(8537)
2022-01-28T10:03:41.794 INFO filecoin_proofs::api::seal > seal_commit_phase1:finish: SectorId(8537)
2022-01-28T10:03:41.928 INFO filcrypto::proofs::api > seal_commit_phase1: finish
2022-01-28T10:03:43.269 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 1/8
2022-01-28T10:04:07.584 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 2/8
2022-01-28T10:04:30.770 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 3/8
2022-01-28T10:04:51.048 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 4/8
2022-01-28T10:05:12.696 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 5/8
2022-01-28T10:05:35.679 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 6/8
2022-01-28T10:05:57.938 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 7/8
2022-01-28T10:06:18.862 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 8/8
2022-01-28T10:06:26.339 INFO storage_proofs_porep::stacked::vanilla::proof > tree_r_last done
2022-01-28T10:06:26.339 INFO storage_proofs_core::data > dropping data /mnt/md4/sealing3/sealed/s-t02576-8544
2022-01-28T10:06:28.631 INFO filecoin_proofs::api::seal > seal_pre_commit_phase2:finish
2022-01-28T10:06:28.631 INFO filcrypto::proofs::api > seal_pre_commit_phase2: finish
2022-01-28T10:06:28.631 INFO filcrypto::proofs::api > seal_commit_phase1: start
2022-01-28T10:06:28.631 INFO filecoin_proofs::api > validate_cache_for_commit:start
2022-01-28T10:06:28.631 INFO filecoin_proofs::api > validate_cache_for_commit:finish
2022-01-28T10:06:28.631 INFO filecoin_proofs::api::seal > seal_commit_phase1:start: SectorId(8544)
2022-01-28T10:06:28.844 INFO filecoin_proofs::api::seal > seal_commit_phase1:finish: SectorId(8544)
2022-01-28T10:06:28.977 INFO filcrypto::proofs::api > seal_commit_phase1: finish
2022-01-28T10:06:28.984 INFO filcrypto::proofs::api > seal_commit_phase1: start
2022-01-28T10:06:28.984 INFO filecoin_proofs::api > validate_cache_for_commit:start
2022-01-28T10:06:28.984 INFO filecoin_proofs::api > validate_cache_for_commit:finish
2022-01-28T10:06:28.984 INFO filecoin_proofs::api::seal > seal_commit_phase1:start: SectorId(8544)
2022-01-28T10:06:29.166 INFO filecoin_proofs::api::seal > seal_commit_phase1:finish: SectorId(8544)
2022-01-28T10:06:29.293 INFO filcrypto::proofs::api > seal_commit_phase1: finish
2022-01-28T10:06:29.300 INFO filcrypto::proofs::api > seal_commit_phase1: start
2022-01-28T10:06:29.300 INFO filecoin_proofs::api > validate_cache_for_commit:start
2022-01-28T10:06:29.300 INFO filecoin_proofs::api > validate_cache_for_commit:finish
2022-01-28T10:06:29.300 INFO filecoin_proofs::api::seal > seal_commit_phase1:start: SectorId(8544)
2022-01-28T10:06:29.478 INFO filecoin_proofs::api::seal > seal_commit_phase1:finish: SectorId(8544)
2022-01-28T10:06:29.605 INFO filcrypto::proofs::api > seal_commit_phase1: finish

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MinerX scheduler questions #7945

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MinerX scheduler questions #7945

Uh oh!

SgtCoin Jan 13, 2022

Replies: 6 comments

Uh oh!

Uh oh!

rjan90 Jan 14, 2022 Maintainer

Uh oh!

benjaminh83 Jan 17, 2022

Uh oh!

neondragon Jan 17, 2022

Uh oh!

Uh oh!

kernelogic Jan 17, 2022

Uh oh!

SgtCoin Jan 25, 2022 Author

Uh oh!

benjaminh83 Jan 28, 2022

SgtCoin
Jan 13, 2022

rjan90
Jan 14, 2022
Maintainer

benjaminh83
Jan 17, 2022

neondragon
Jan 17, 2022

kernelogic
Jan 17, 2022

SgtCoin
Jan 25, 2022
Author

benjaminh83
Jan 28, 2022