Skip to content

Hanging during simulations #37

@sanguinariojoe

Description

@sanguinariojoe

Xubuntu 15.10, sailfish from master (f111f6e), ATI R9 290


If I launch the Lid-Driven cavity, everything seems to be working fine... Unfortunately it is suddenly hanging:

[  1751  INFO Master/sobremesa] Machine master starting with PID 25192 at 2016-04-07 18:18:13 UTC
[  1751  INFO Master/sobremesa] Simulation started with: ./ldc_2d.py
[  1760  INFO Master/sobremesa] Sailfish version: f111f6e4a0953357f0871374aa825bc2eaafc2a0
[  1761  INFO Master/sobremesa] Handling subdomains: [0]
[  1761  INFO Master/sobremesa] Subdomain -> GPU map: {0: 0}
[  1764  INFO Master/sobremesa] Selected backend: opencl
[  2291  INFO Subdomain/0] Initializing subdomain.
[  2291  INFO Subdomain/0] Required memory: 
[  2291  INFO Subdomain/0] . distributions: 5 MiB
[  2291  INFO Subdomain/0] . fields: 0 MiB
[  2422  INFO Subdomain/0] On-GPU invalid result check disabled as the device does not support all required features.
/home/pepe/Downloads/sailfish/sailfish/backend_opencl.py:159: UserWarning: Received OpenCL source code in Unicode, should be ASCII string. Attempting conversion.
  return cl.Program(self.ctx, preamble + source).build() #'-cl-single-precision-constant -cl-fast-relaxed-math')
[  5056 WARNING Subdomain/0] Running infinite simulation.
[  5056  INFO Subdomain/0] Starting simulation.
[  5510  INFO Subdomain/0] iteration:2000  speed:277.77 MLUPS
[  5727  INFO Subdomain/0] iteration:3000  speed:295.56 MLUPS
[  5951  INFO Subdomain/0] iteration:4000  speed:288.61 MLUPS
[  6175  INFO Subdomain/0] iteration:5000  speed:288.83 MLUPS
[  6441  INFO Subdomain/0] iteration:6000  speed:243.48 MLUPS
[  6753  INFO Subdomain/0] iteration:7000  speed:208.11 MLUPS
[  7033  INFO Subdomain/0] iteration:8000  speed:230.93 MLUPS
[  7318  INFO Subdomain/0] iteration:9000  speed:227.47 MLUPS
[  7574  INFO Subdomain/0] iteration:10000  speed:252.54 MLUPS
[  7808  INFO Subdomain/0] iteration:11000  speed:276.91 MLUPS
[  8067  INFO Subdomain/0] iteration:12000  speed:250.54 MLUPS
[  8304  INFO Subdomain/0] iteration:13000  speed:273.10 MLUPS
[  8595  INFO Subdomain/0] iteration:14000  speed:222.76 MLUPS
[  8858  INFO Subdomain/0] iteration:15000  speed:246.14 MLUPS
[  9052  INFO Subdomain/0] iteration:16000  speed:333.59 MLUPS
[  9260  INFO Subdomain/0] iteration:17000  speed:311.17 MLUPS
[  9503  INFO Subdomain/0] iteration:18000  speed:266.69 MLUPS
[  9774  INFO Subdomain/0] iteration:19000  speed:238.98 MLUPS
[ 10013  INFO Subdomain/0] iteration:20000  speed:271.23 MLUPS
[ 10268  INFO Subdomain/0] iteration:21000  speed:253.38 MLUPS
[ 10535  INFO Subdomain/0] iteration:22000  speed:243.09 MLUPS
[ 10782  INFO Subdomain/0] iteration:23000  speed:262.50 MLUPS
[ 11032  INFO Subdomain/0] iteration:24000  speed:258.22 MLUPS
[ 11283  INFO Subdomain/0] iteration:25000  speed:258.77 MLUPS
[ 11527  INFO Subdomain/0] iteration:26000  speed:265.50 MLUPS
[ 11791  INFO Subdomain/0] iteration:27000  speed:245.31 MLUPS
[ 12058  INFO Subdomain/0] iteration:28000  speed:242.33 MLUPS
[ 12311  INFO Subdomain/0] iteration:29000  speed:255.68 MLUPS
[ 12564  INFO Subdomain/0] iteration:30000  speed:256.76 MLUPS
[ 12818  INFO Subdomain/0] iteration:31000  speed:254.30 MLUPS
[ 13066  INFO Subdomain/0] iteration:32000  speed:261.79 MLUPS
[ 13491  INFO Subdomain/0] iteration:33000  speed:152.45 MLUPS
[ 13741  INFO Subdomain/0] iteration:34000  speed:259.01 MLUPS
[ 14018  INFO Subdomain/0] iteration:35000  speed:233.74 MLUPS
[ 14260  INFO Subdomain/0] iteration:36000  speed:267.39 MLUPS
[ 14510  INFO Subdomain/0] iteration:37000  speed:258.93 MLUPS

If I cancel the job, it seems to be a synchronization problem between threads:

  File "./ldc_2d.py", line 41, in <module>
    ctrl.run()
  File "/home/pepe/Downloads/sailfish/sailfish/controller.py", line 793, in run
    return self._finish_simulation(subdomain_specs, summary_receiver)
  File "/home/pepe/Downloads/sailfish/sailfish/controller.py", line 708, in _finish_simulation
    self._simulation_process.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)

However, if I launch the case with the following command:

./ldc_2d.py --debug_single_process

It is hanging again:

[  1718  INFO MainProcess] Machine master starting with PID 25261 at 2016-04-07 18:21:15 UTC
[  1718  INFO MainProcess] Simulation started with: ./ldc_2d.py --debug_single_process
[  1728  INFO MainProcess] Sailfish version: f111f6e4a0953357f0871374aa825bc2eaafc2a0
[  1729  INFO MainProcess] Handling subdomains: [0]
[  1729  INFO MainProcess] Subdomain -> GPU map: {0: 0}
[  1730  INFO MainProcess] Selected backend: opencl
[  2273  INFO MainProcess] Initializing subdomain.
[  2273  INFO MainProcess] Required memory: 
[  2273  INFO MainProcess] . distributions: 5 MiB
[  2273  INFO MainProcess] . fields: 0 MiB
[  2448  INFO MainProcess] On-GPU invalid result check disabled as the device does not support all required features.
/home/pepe/Downloads/sailfish/sailfish/backend_opencl.py:159: UserWarning: Received OpenCL source code in Unicode, should be ASCII string. Attempting conversion.
  return cl.Program(self.ctx, preamble + source).build() #'-cl-single-precision-constant -cl-fast-relaxed-math')
[  5546 WARNING MainProcess] Running infinite simulation.
[  5564  INFO MainProcess] Starting simulation.
[  6078  INFO MainProcess] iteration:2000  speed:266.26 MLUPS
[  6288  INFO MainProcess] iteration:3000  speed:304.68 MLUPS
[  6513  INFO MainProcess] iteration:4000  speed:287.41 MLUPS
[  6740  INFO MainProcess] iteration:5000  speed:285.69 MLUPS
[  6966  INFO MainProcess] iteration:6000  speed:286.89 MLUPS
[  7199  INFO MainProcess] iteration:7000  speed:278.13 MLUPS
[  7452  INFO MainProcess] iteration:8000  speed:255.82 MLUPS
[  7703  INFO MainProcess] iteration:9000  speed:257.62 MLUPS
[  7921  INFO MainProcess] iteration:10000  speed:297.96 MLUPS
[  8164  INFO MainProcess] iteration:11000  speed:266.58 MLUPS
[  8382  INFO MainProcess] iteration:12000  speed:296.28 MLUPS
[  8632  INFO MainProcess] iteration:13000  speed:259.16 MLUPS
[  8895  INFO MainProcess] iteration:14000  speed:246.05 MLUPS
[  9125  INFO MainProcess] iteration:15000  speed:282.82 MLUPS
[  9355  INFO MainProcess] iteration:16000  speed:281.31 MLUPS
[  9590  INFO MainProcess] iteration:17000  speed:275.48 MLUPS
[  9839  INFO MainProcess] iteration:18000  speed:260.35 MLUPS
[ 10076  INFO MainProcess] iteration:19000  speed:272.75 MLUPS
[ 10351  INFO MainProcess] iteration:20000  speed:235.59 MLUPS
[ 10625  INFO MainProcess] iteration:21000  speed:236.49 MLUPS
[ 11062  INFO MainProcess] iteration:22000  speed:148.00 MLUPS
[ 11284  INFO MainProcess] iteration:23000  speed:292.25 MLUPS
[ 11503  INFO MainProcess] iteration:24000  speed:295.61 MLUPS
[ 11764  INFO MainProcess] iteration:25000  speed:248.77 MLUPS
[ 12020  INFO MainProcess] iteration:26000  speed:252.55 MLUPS
[ 12274  INFO MainProcess] iteration:27000  speed:254.87 MLUPS
[ 12531  INFO MainProcess] iteration:28000  speed:252.07 MLUPS
[ 12779  INFO MainProcess] iteration:29000  speed:261.26 MLUPS

And this time I cannot cancel the job :-S

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions