deepmodeling
diff --git a/‎doc/examples/expanse.md
Lines changed: 3 additions & 3 deletions b/‎doc/examples/expanse.md
Lines changed: 3 additions & 3 deletions
diff --git a/‎doc/examples/shell.md
Lines changed: 3 additions & 3 deletions b/‎doc/examples/shell.md
Lines changed: 3 additions & 3 deletions
diff --git a/‎doc/getting-started.md
Lines changed: 6 additions & 6 deletions b/‎doc/getting-started.md
Lines changed: 6 additions & 6 deletions
diff --git a/‎dpdispatcher/base_context.py
Lines changed: 7 additions & 1 deletion b/‎dpdispatcher/base_context.py
Lines changed: 7 additions & 1 deletion
diff --git a/‎dpdispatcher/dp_cloud_server.py
Lines changed: 3 additions & 3 deletions b/‎dpdispatcher/dp_cloud_server.py
Lines changed: 3 additions & 3 deletions
diff --git a/‎dpdispatcher/dp_cloud_server_context.py
Lines changed: 56 additions & 34 deletions b/‎dpdispatcher/dp_cloud_server_context.py
Lines changed: 56 additions & 34 deletions
diff --git a/‎dpdispatcher/dpcloudserver/api.py
Lines changed: 23 additions & 10 deletions b/‎dpdispatcher/dpcloudserver/api.py
Lines changed: 23 additions & 10 deletions
diff --git a/‎dpdispatcher/dpcloudserver/config.py
Lines changed: 2 additions & 2 deletions b/‎dpdispatcher/dpcloudserver/config.py
Lines changed: 2 additions & 2 deletions
@@ -2,21 +2,21 @@
 
 [Expanse](https://www.sdsc.edu/support/user_guides/expanse.html) is a cluster operated by the San Diego Supercomputer Center. Here we provide an example to run jobs on the expanse.
 
-The machine parameters are provided below. Expanse uses the SLURM workload manager for job scheduling. `remote_root` has been created in advance. It's worth metioned that we do not recommend to use the password, so [SSH keys](https://www.ssh.com/academy/ssh/key) are used instead to improve security.
+The machine parameters are provided below. Expanse uses the SLURM workload manager for job scheduling. {ref}`remote_root <machine/remote_root>` has been created in advance. It's worth metioned that we do not recommend to use the password, so [SSH keys](https://www.ssh.com/academy/ssh/key) are used instead to improve security.
 
 ```{literalinclude} ../../examples/machine/expanse.json
 :language: json
 :linenos:
 ```
 
-Expanse's standard compute nodes are each powered by two 64-core AMD EPYC 7742 processors and contain 256 GB of DDR4 memory. Here, we request one node with 32 cores and 16 GB memory from the `shared` partition. Expanse does not support `--gres=gpu:0` command, so we use `custom_gpu_line` to customize the statement.
+Expanse's standard compute nodes are each powered by two 64-core AMD EPYC 7742 processors and contain 256 GB of DDR4 memory. Here, we request one node with 32 cores and 16 GB memory from the `shared` partition. Expanse does not support `--gres=gpu:0` command, so we use {ref}`custom_gpu_line <resources[Slurm]/kwargs/custom_gpu_line>` to customize the statement.
 
 ```{literalinclude} ../../examples/resources/expanse_cpu.json
 :language: json
 :linenos:
 ```
 
-The following task parameter runs a DeePMD-kit task, forwarding an input file and backwarding graph files. Here, the data set will be used among all the tasks, so it is not included in the `forward_files`. Instead, it should be included in the submission's `forward_common_files`.
+The following task parameter runs a DeePMD-kit task, forwarding an input file and backwarding graph files. Here, the data set will be used among all the tasks, so it is not included in the {ref}`forward_files <task/forward_files>`. Instead, it should be included in the submission's {ref}`forward_common_files <task/forward_common_files>`.
 
 ```{literalinclude} ../../examples/task/deepmd-kit.json
 :language: json
 
@@ -1,17 +1,17 @@
 # Running multiple MD tasks on a GPU workstation
 
-In this example, we are going to show how to run multiple MD tasks on a GPU workstation. This workstation does not install any job scheduling packages installed, so we will use `Shell` as `batch_type`.
+In this example, we are going to show how to run multiple MD tasks on a GPU workstation. This workstation does not install any job scheduling packages installed, so we will use `Shell` as {ref}`batch_type <machine/batch_type>`.
 
 ```{literalinclude} ../../examples/machine/mandu.json
 :language: json
 :linenos:
 ```
 
-The workstation has 48 cores of CPUs and 8 RTX3090 cards. Here we hope each card runs 6 tasks at the same time, as each task does not consume too many GPU resources. Thus, `strategy/if_cuda_multi_devices` is set to `true` and `para_deg` is set to 6.
+The workstation has 48 cores of CPUs and 8 RTX3090 cards. Here we hope each card runs 6 tasks at the same time, as each task does not consume too many GPU resources. Thus, {ref}`strategy/if_cuda_multi_devices <resources/strategy/if_cuda_multi_devices>` is set to `true` and {ref}`para_deg <resources/para_deg>` is set to 6.
 
 ```{literalinclude} ../../examples/resources/mandu.json
 :language: json
 :linenos:
 ```
 
-Note that `group_size` should be set to `0` (means infinity) to ensure there is only one job and avoid running multiple jobs at the same time.
+Note that {ref}`group_size <resources/group_size>` should be set to `0` (means infinity) to ensure there is only one job and avoid running multiple jobs at the same time.
@@ -2,14 +2,14 @@
 
 DPDispatcher provides the following classes:
 
-- `Task` class, which represents a command to be run on batch job system, as well as the essential files need by the command.
-- `Submission` class, which represents a collection of jobs defined by the HPC system.
+- {class}`Task <dpdispatcher.submission.Task>` class, which represents a command to be run on batch job system, as well as the essential files need by the command.
+- {class}`Submission <dpdispatcher.submission.Submission>` class, which represents a collection of jobs defined by the HPC system.
 And there may be common files to be uploaded by them.
-DPDispatcher will create and submit these jobs when a `submission` instance execute `run_submission` method.
+DPDispatcher will create and submit these jobs when a `submission` instance execute {meth}`run_submission <dpdispatcher.submission.Submission.run_submission>` method.
 This method will poke until the jobs finish and return.  
-- `Job` class, a class used by `Submission` class, which represents a job on the HPC system. 
-`Submission` will generate `job`s' submitting scripts used by HPC systems automatically with the `Task` and `Resources`
-- `Resources` class, which represents the computing resources for each job  within a `submission`.
+- {class}`Job <dpdispatcher.submission.Job>` class, a class used by {class}`Submission <dpdispatcher.submission.Submission>` class, which represents a job on the HPC system. 
+{class}`Submission <dpdispatcher.submission.Submission>` will generate `job`s' submitting scripts used by HPC systems automatically with the {class}`Task <dpdispatcher.submission.Task>` and {class}`Resources <dpdispatcher.submission.Resources>`
+- {class}`Resources <dpdispatcher.submission.Resources>` class, which represents the computing resources for each job  within a `submission`.
 
 You can use DPDispatcher in a Python script to submit five tasks:
 
 
@@ -1,9 +1,10 @@
+from abc import ABCMeta, abstractmethod
 from dargs import Argument
 from typing import List
 
 from dpdispatcher import dlog
 
-class BaseContext(object):
+class BaseContext(metaclass=ABCMeta):
     subclasses_dict = {}
     options = set()
     def __new__(cls, *args, **kwargs):
@@ -37,22 +38,27 @@ def load_from_dict(cls, context_dict):
     def bind_submission(self, submission):
         self.submission = submission
 
+    @abstractmethod
     def upload(self, submission):
         raise NotImplementedError('abstract method')
 
+    @abstractmethod
     def download(self, 
                 submission,
                 check_exists = False,
                 mark_failure = True,
                 back_error=False):
         raise NotImplementedError('abstract method')
 
+    @abstractmethod
     def clean(self):
         raise NotImplementedError('abstract method')
 
+    @abstractmethod
     def write_file(self, fname, write_str):
         raise NotImplementedError('abstract method')
 
+    @abstractmethod
     def read_file(self, fname):
         raise NotImplementedError('abstract method')
 
 
@@ -90,7 +90,7 @@ def do_submit(self, job):
         input_data['command'] = f"bash {job.script_file_name}"
         # input_data['backward_files'] = self._gen_backward_files_list(job)
         if self.context.remote_profile.get('program_id') is None:
-            warnings.warn('program_id will be compulsory in the future.')
+            warnings.warn('program_id is compulsory.')
         job_id, group_id = self.api.job_create(
             job_type=input_data['job_type'],
             oss_path=input_data['job_resources'],
@@ -124,13 +124,13 @@ def check_status(self, job):
         try:
             dp_job_status = check_return["status"]
         except IndexError as e:
-            dlog.error(f"cannot find job information in check_return. job {job.job_id}. check_return:{check_return}; retry one more time after 60 seconds")
+            dlog.error(f"cannot find job information in bohrium for job {job.job_id}. check_return:{check_return}; retry one more time after 60 seconds")
             time.sleep(60)
             retry_return = self.api.get_tasks(job_id, group_id)
             try:
                 dp_job_status = retry_return["status"]
             except IndexError as e:
-                raise RuntimeError(f"cannot find job information in dpcloudserver's database for job {job.job_id} {check_return} {retry_return}")
+                raise RuntimeError(f"cannot find job information in bohrium for job {job.job_id} {check_return} {retry_return}")
 
         job_state = self.map_dp_job_state(dp_job_status)
         if job_state == JobStatus.finished:
 
@@ -1,6 +1,7 @@
 #!/usr/bin/env python
 # coding: utf-8
 # %%
+import time
 import uuid
 
 from dargs.dargs import Argument
@@ -14,7 +15,10 @@
 from .dpcloudserver import zip_file
 import shutil
 import tqdm
+
 # from zip_file import zip_files
+from .dpcloudserver.config import ALI_OSS_BUCKET_URL
+
 DP_CLOUD_SERVER_HOME_DIR = os.path.join(
     os.path.expanduser('~'),
     '.dpdispatcher/',
@@ -23,14 +27,15 @@
 ENDPOINT = 'http://oss-cn-shenzhen.aliyuncs.com'
 BUCKET_NAME = 'dpcloudserver'
 
+
 class DpCloudServerContext(BaseContext):
-    def __init__ (self,
-        local_root,
-        remote_root=None,
-        remote_profile={},
-        *args,
-        **kwargs,
-    ):
+    def __init__(self,
+                 local_root,
+                 remote_root=None,
+                 remote_profile={},
+                 *args,
+                 **kwargs,
+                 ):
         self.init_local_root = local_root
         self.init_remote_root = remote_root
         self.temp_local_root = os.path.abspath(local_root)
@@ -83,6 +88,43 @@ def _gen_oss_path(self, job, zip_filename):
             setattr(job, 'upload_path', path)
             return path
 
+    def upload_job(self, job, common_files=None):
+        MAX_RETRY = 3
+        if common_files is None:
+            common_files = []
+        self.machine.gen_local_script(job)
+        zip_filename = job.job_hash + '.zip'
+        oss_task_zip = self._gen_oss_path(job, zip_filename)
+        zip_task_file = os.path.join(self.local_root, zip_filename)
+
+        upload_file_list = [job.script_file_name, ]
+        upload_file_list.extend(common_files)
+
+        for task in job.job_task_list:
+            for file in task.forward_files:
+                upload_file_list.append(
+                    os.path.join(
+                        task.task_work_path, file
+                    )
+                )
+
+        upload_zip = zip_file.zip_file_list(
+            self.local_root,
+            zip_task_file,
+            file_list=upload_file_list
+        )
+        result = self.api.upload(oss_task_zip, upload_zip, ENDPOINT, BUCKET_NAME)
+        retry_count = 0
+        while True:
+            if self.api.check_file_has_uploaded(ALI_OSS_BUCKET_URL + oss_task_zip):
+                self._backup(self.local_root, upload_zip)
+                break
+            elif retry_count < MAX_RETRY:
+                time.sleep(1 + retry_count)
+                retry_count += 1
+            else:
+                raise ValueError(f"upload retried excess {MAX_RETRY} terminate.")
+
     def upload(self, submission):
         # oss_task_dir = os.path.join('%s/%s/%s.zip' % ('indicate', file_uuid, file_uuid))
         # zip_filename = submission.submission_hash + '.zip'
@@ -100,30 +142,8 @@ def upload(self, submission):
         if len(job_to_be_uploaded) == 0:
             dlog.info("all job has been uploaded, continue")
             return result
-        for job in tqdm.tqdm(job_to_be_uploaded, desc="Uploading to Lebesgue", bar_format=bar_format):
-            self.machine.gen_local_script(job)
-            zip_filename = job.job_hash + '.zip'
-            oss_task_zip = self._gen_oss_path(job, zip_filename)
-            zip_task_file = os.path.join(self.local_root, zip_filename)
-
-            upload_file_list = [job.script_file_name, ]
-            upload_file_list.extend(submission.forward_common_files)
-
-            for task in job.job_task_list:
-                for file in task.forward_files:
-                    upload_file_list.append(
-                        os.path.join(
-                            task.task_work_path, file
-                        )
-                    )
-
-            upload_zip = zip_file.zip_file_list(
-                self.local_root,
-                zip_task_file,
-                file_list=upload_file_list
-            )
-            result = self.api.upload(oss_task_zip, upload_zip, ENDPOINT, BUCKET_NAME)
-            self._backup(self.local_root, upload_zip)
+        for job in tqdm.tqdm(job_to_be_uploaded, desc="Uploading to Lebesgue", bar_format=bar_format, leave=False):
+            self.upload_job(job, submission.forward_common_files)
         return result
         # return oss_task_zip
         # api.upload(self.oss_task_dir, zip_task_file)
@@ -151,7 +171,8 @@ def download(self, submission):
                         job_hash = job_hashs[each['task_id']]
                     job_infos[job_hash] = each
         bar_format = "{l_bar}{bar}| {n:.02f}/{total:.02f} %  [{elapsed}<{remaining}, {rate_fmt}{postfix}]"
-        for job_hash, info in tqdm.tqdm(job_infos.items(), desc="Validating download file from Lebesgue", bar_format=bar_format):
+        for job_hash, info in tqdm.tqdm(job_infos.items(), desc="Validating download file from Lebesgue",
+                                        bar_format=bar_format, leave=False):
             result_filename = job_hash + '_back.zip'
             target_result_zip = os.path.join(self.local_root, result_filename)
             if self._check_if_job_has_already_downloaded(target_result_zip, self.local_root):
@@ -234,7 +255,7 @@ def clean(self):
     #         retcode = cmd_pipes['stdout'].channel.recv_exit_status()
     #         return retcode, cmd_pipes['stdout'], cmd_pipes['stderr']
 
-    def kill(self, cmd_pipes) :
+    def kill(self, cmd_pipes):
         pass
 
     @classmethod
@@ -251,11 +272,12 @@ def machine_subfields(cls) -> List[Argument]:
             Argument("email", str, optional=False, doc="Email"),
             Argument("password", str, optional=False, doc="Password"),
             Argument("program_id", int, optional=False, doc="Program ID"),
+            Argument("keep_backup", bool, optional=True, doc="keep download and upload zip"),
             Argument("input_data", dict, optional=False, doc="Configuration of job"),
         ], doc=doc_remote_profile)]
 
 
 class LebesgueContext(DpCloudServerContext):
     pass
 
-#%%
+# %%
@@ -13,7 +13,9 @@
 from dpdispatcher import dlog
 
 from .retcode import RETCODE
-from .config import HTTP_TIME_OUT, API_HOST
+from .config import HTTP_TIME_OUT, API_HOST, API_LOGGER_STACK_INFO
+
+ENABLE_STACK = True if API_LOGGER_STACK_INFO else False
 
 
 class API:
@@ -35,12 +37,13 @@ def get(self, url, params, retry=0):
                     headers=headers
                 )
             except Exception as e:
-                dlog.error(f"request error {e}")
+                dlog.error(f"request error {e}", stack_info=ENABLE_STACK)
                 continue
             if ret.ok:
                 break
             else:
-                dlog.error(f"request error status_code:{ret.status_code} reason: {ret.reason} body: \n{ret.text}")
+                dlog.error(f"request error status_code:{ret.status_code} reason: {ret.reason} body: \n{ret.text}",
+                           stack_info=ENABLE_STACK)
                 time.sleep(retry_count * 10)
         if ret is None:
             raise ConnectionError("request fail")
@@ -69,7 +72,7 @@ def post(self, url, params, retry=0):
                     headers=headers
                 )
             except Exception as e:
-                dlog.error(f"request error {e}")
+                dlog.error(f"request error {e}", stack_info=ENABLE_STACK)
                 continue
             if ret.ok:
                 break
@@ -132,7 +135,7 @@ def download_from_url(self, url, save_file):
                     stream=True
                 )
             except Exception as e:
-                dlog.error(f"request error {e}")
+                dlog.error(f"request error {e}", stack_info=ENABLE_STACK)
                 continue
             if ret.ok:
                 break
@@ -147,7 +150,6 @@ def download_from_url(self, url, save_file):
                     f.write(chunk)
             ret.close()
 
-
     def upload(self, oss_task_zip, zip_task_file, endpoint, bucket_name):
         dlog.debug(f"debug: upload: oss_task_zip:{oss_task_zip}; zip_task_file:{zip_task_file}")
         bucket = self._get_oss_bucket(endpoint, bucket_name)
@@ -170,7 +172,6 @@ def upload(self, oss_task_zip, zip_task_file, endpoint, bucket_name):
         # print('debug:upload_result:', result, dir())
         return result
 
-
     def job_create(self, job_type, oss_path, input_data, program_id=None, group_id=None):
         post_data = {
             'job_type': job_type,
@@ -244,11 +245,23 @@ def check_job_has_uploaded(self, job_id):
             if len(ret) == 0:
                 return False
             if ret.get('input_data'):
-                return True
+                return self.check_file_has_uploaded(ret.get('input_data'))
             else:
                 return False
         except ValueError as e:
-            dlog.error(e)
+            dlog.error(e, stack_info=ENABLE_STACK)
+            return False
+
+    def check_file_has_uploaded(self, file_url):
+        try:
+            if not file_url:
+                return False
+            resp = requests.head(file_url)
+            if resp.ok:
+                return True
+            return False
+        except Exception as e:
+            dlog.error(e, stack_info=ENABLE_STACK)
             return False
 
     def get_job_result_url(self, job_id):
@@ -264,7 +277,7 @@ def get_job_result_url(self, job_id):
             else:
                 return None
         except ValueError as e:
-            dlog.error(e)
+            dlog.error(e, stack_info=ENABLE_STACK)
             return None
 
 # %%
@@ -1,8 +1,8 @@
 import os
 HTTP_TIME_OUT = 30
 
-API_HOST = os.environ.get('DPDISPATCHER_LEBESGUE_API_HOST', "https://lebesgue.dp.tech")
-
+API_HOST = os.environ.get('DPDISPATCHER_LEBESGUE_API_HOST', "https://bohrium.dp.tech/")
+API_LOGGER_STACK_INFO = os.environ.get('API_LOGGER_STACK_INFO', "")
 ALI_STS_ENDPOINT = os.environ.get('DPDISPATCHER_LEBESGUE_ALI_STS_ENDPOINT', 'http://oss-cn-shenzhen.aliyuncs.com')
 ALI_STS_BUCKET_NAME = os.environ.get('DPDISPATCHER_LEBESGUE_ALI_STS_BUCKET_NAME', "dpcloudserver")
 ALI_OSS_BUCKET_URL = os.environ.get('DPDISPATCHER_LEBESGUE_ALI_OSS_BUCKET_URL', "https://dpcloudserver.oss-cn-shenzhen.aliyuncs.com/")