Fzilan
diff --git a/‎examples/janus/README.md‎
Lines changed: 3 additions & 3 deletions b/‎examples/janus/README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎examples/janus/docs/training.md‎
Lines changed: 64 additions & 9 deletions b/‎examples/janus/docs/training.md‎
Lines changed: 64 additions & 9 deletions
diff --git a/‎examples/janus/janus/__init__.py‎ b/‎examples/janus/janus/__init__.py‎
diff --git a/‎examples/janus/janus/models/modeling_vlm.py‎
Lines changed: 146 additions & 13 deletions b/‎examples/janus/janus/models/modeling_vlm.py‎
Lines changed: 146 additions & 13 deletions
@@ -27,12 +27,12 @@
   <!-- 🤗 Online Demo (<a href="https://huggingface.co/spaces/deepseek-ai/Janus-Pro-7B"><b>Janus-Pro-7B</b></a>, <a href="https://huggingface.co/spaces/deepseek-ai/Janus-1.3B"><b>Janus</b></a>, <a href="https://huggingface.co/spaces/deepseek-ai/JanusFlow-1.3B"><b>JanusFlow</b></a>) -->
 </p>
 
-We provide an efficient MindSpore implementation of [JanusPro](https://github.com/deepseek-ai/Janus). This repository is built on the models and code released by DeepSeek. We are grateful for their exceptional work and generous contribution to open source.
+We provide an efficient MindSpore implementation of [Janus-Pro](https://github.com/deepseek-ai/Janus). This repository is built on the models and code released by DeepSeek. We are grateful for their exceptional work and generous contribution to open source.
 
 
 ## News
 
-**2025.03.12**: We have reproduced the multi-modal training pipelines referring to the JanusPro [paper](https://github.com/deepseek-ai/Janus), see [docs/training.md](docs/training.md).
+**2025.03.12**: We have reproduced the multi-modal training pipelines referring to the Janus-Pro [paper](https://github.com/deepseek-ai/Janus), see [docs/training.md](docs/training.md).
 
 **2025.02.10**: MindSpore implementation of Janus-Pro is released, supporting both multimodal understanding and visual generation on Ascend NPU.
 
@@ -51,7 +51,7 @@ Generation with Data and Model Scaling</b></a>
 
 ## 2. Model Download
 
-JanusPro is available to the public to support a broader and more diverse range of research within both academic and commercial communities.
+Janus-Pro is available to the public to support a broader and more diverse range of research within both academic and commercial communities.
 Please note that the use of this model is subject to the terms outlined in [License section](#5-license). Commercial usage is
 permitted under these terms.
 
 
@@ -1,4 +1,4 @@
-# JanusPro Training
+# Janus-Pro Training
 
 ## Requirements
 
@@ -21,12 +21,32 @@ huggingface-cli download  jasonhuang23/artwork --repo-type dataset --local-dir d
 huggingface-cli download  rbojja/medical-vqa --repo-type dataset --local-dir datasets/medical-vqa
 ```
 
-## Run Training
+Before launching sft training with the scripts under [../scripts/](../scripts/), we need to setup the meta env var `YOUR_DATA_PATH` and `YOUR_DOWNLOADED_JANUS_CKPT_PATH` for each script.
+
+## Run Training for Single Task
+After setting up paths as above, you are good to go.
+
+- Multimodal Understanding Task (VQA)
+
+```shell
+bash scripts/run_sft_vqa.sh
+```
 
 - Text Generation Task
 
 ```shell
-bash scripts/run_sft_text.sh
+bash scripts/run_sft_text.sh  # if no manual patching, by default it should be changed into pynative
+```
+
+Patching `janus/models/modeling_vlm.py`: **Single task for pure text**
+```diff
+# @ L428
+-- def construct(
+++ # def construct( # just comment the whole function out
+
+# @ L476
+-- def construct_graph_single_task(
+++ def construct(
 ```
 
 - Text-to-Image Generation Task (T2I)
@@ -35,22 +55,50 @@ bash scripts/run_sft_text.sh
 bash scripts/run_sft_t2i.sh
 ```
 
-- Multimodal Understanding Task (VQA)
+The default training stage is stage 3, that is, all modules are trainable except for VQ16 for image token decoding. To switch to other stage, you can modify the `--stage` argument in the training script.
+
+For more detailed arguments, please run `python train.py -h`.
+
+### Multi-task Supervised Fune-tuning (Mixed-SFT)
 
 ```shell
-bash scripts/run_sft_vqa.sh
+bash scripts/run_sft_mixed_graph.sh
 ```
 
-The default training stage is stage 3, that is, all modules are trainable except for VQ16 for image token decoding. To switch to other stage, you can modify the `--stage` argument in the training script.
+We also implemented **a stage-3 SFT for medical data aiming for building a radiology expert model**. The datasets can be retrieved from huggingface with from the following repos.
 
-For more detailed arguments, please run `python train.py -h`.
+| | #Data Samples | HuggingFace Source |
+| --- | --- | --- |
+| VQA | 100 | robojja/medical-vqa |
+| pure-text | 20 | qiaojin/PubmeQA |
+| T2I | 80 | mdwiratathya/ROCO-radiology |
 
+#### Graph Mode SFT Training for Mixed Tasks
 
-- Multi-task Fune-tuning
+> [!NOTE]
+> We achieve higher training throughput by enabling graph mode compute. However, to do that we need to predefine a compute graph for the vlm for each of the task out of three in total, as for each task, the vlm takes different types of input arg pairs.
+>
+> To run `scripts/run_sft_mixed_graph.sh`, simply go into `janus/models/modeling_vlm.py`, and patch `construct_*()` into `construct()` as follows.
+```diff
+# @ L428
+-- def construct(
+++ # def construct( # just comment the whole function out
 
-Comming soon
+# @ L570
+-- def construct_graph_mixed_task(
+++ def construct(
+```
 
+#### Pynative Mode SFT Training for Mixed Tasks
+```diff
+# @ L428
+-- def construct(
+++ # def construct( # just comment the whole function out
 
+# @ L516
+-- def construct_pynative_mixed_task(
+++ def construct(
+```
 
 ## Performance
 
@@ -64,3 +112,10 @@ Experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.5.0 pyn
 | Janus-Pro-7B | T2I | 1 | 384x384 | 1024   | 1 | 0.49 |
 | Janus-Pro-7B | VQA | 1 | 384x384 | 1024   | 1 |  0.66 |
 | Janus-Pro-7B | Text | 1 | n.a. | 512   | 1 | 0.53 |
+
+For mixed-SFT:
+
+| model | task | ms_mode | # card(s) | image size | max_length | batch size | step time (s/step)|
+|:-:|:--:| :-:|:-:|:-:|:-:|:-:|:-:|
+| Janus-Pro-1B | mixed | pynative | 1 | 384x384 | 1024   | 6 | 3.05 |
+| Janus-Pro-1B | mixed | graph | 1 | 384x384 | 1024   | 6 | 2.36 |
@@ -289,7 +289,7 @@ def gen_with_loss(
             attention_mask: shape (bs seq_len), where 1 for valid input seq, 0 for padded seq
             image_seq_mask: 1 - image tokens (exclude BOI and EOI)
             pixel_values: images resized to (384, 384), shape (bs n_images 3 h w)
-            image_tokens: image tokens encoded and quantized by VQ16, shape (bs n_images per_img_seq_len)
+            image_tokens: deprecated, image tokens encoded and quantized by VQ16, shape (bs n_images per_img_seq_len)
 
         Note: pre-compute VQ encoded tokens for efficiency
         """
@@ -321,12 +321,14 @@ def gen_with_loss(
         # these reshape ops is to solve the wierd error in InferShape in MS
         inputs_embeds = inputs_embeds.reshape(-1, D)  # (B, S, D) -> (B * S, D)
         image_seq_mask = image_seq_mask.reshape(-1)  # (B, S) -> (B * S)
-        image_embeds = image_embeds.reshape(-1, D)  # (B, S, D) -> (B * S, D)
+        image_embeds = image_embeds.reshape(-1, D)  # (B, T, D) -> (B * T, D)
 
-        # another way: inputs_embeds = inputs_embeds * (1 - image_seq_mask) + ops.stop_gradient(image_embeds) * image_seq_mask.to(ms.int)
-        # FIXME: this inplace op doens't support in graph mode
-        # FIXME: check whether need to bprop the graident from image_embedding to LlamModel.embed_tokens (nn.Embedding)
-        inputs_embeds[image_seq_mask] = ops.stop_gradient(image_embeds)
+        # FIXME ms2.5.0 graph mode does not support _tensor_setitem_by_bool_tensor_with_tensor().
+        # Workaround: _tensor_setitem_by_int_tensor_with_tensor()
+        _image_seq_mask = image_seq_mask.nonzero().squeeze()
+        # above tensor.squeeze() does not work under pynatvie dunno why...
+        # _image_seq_mask = image_seq_mask.nonzero().reshape(-1)  # workaround for both pynative & graph: force flatten
+        inputs_embeds[_image_seq_mask] = image_embeds
 
         inputs_embeds = inputs_embeds.reshape(B, S, D)
         image_seq_mask = image_seq_mask.reshape(B, S)
@@ -342,7 +344,7 @@ def gen_with_loss(
         # 4. gen head projection
         # since Janus use decouple heads for image and text, only image seq is meaningful input to gen head. mask before linear should save compute cost.
         # TODO: tbc influence on gradient ?
-        image_hidden_states = hidden_states[image_seq_mask].reshape(B, -1, D)
+        image_hidden_states = hidden_states[image_seq_mask].reshape(B, T, D)
         logits = self.gen_head(image_hidden_states)
 
         # 5. loss compute
@@ -404,13 +406,13 @@ def und_with_loss(
         # these reshape ops is to solve the wierd error in InferShape in MS
         inputs_embeds = inputs_embeds.reshape(-1, D)  # (B, S, D) -> (B * S, D)
         image_seq_mask = image_seq_mask.reshape(-1)  # (B, S) -> (B * S)
-        image_embeds = image_embeds.reshape(-1, D)  # (B, S, D) -> (B * S, D)
+        image_embeds = image_embeds.reshape(-1, D)  # (B, T, D) -> (B * T, D)
 
-        # FIXME: fix as gen_with_loss to support graph mode
-        inputs_embeds[image_seq_mask] = image_embeds  # ops.stop_gradient(image_embeds)
+        # FIXME same workaround as above, for the ms2.5.0 graph mode constraint
+        image_seq_mask = image_seq_mask.nonzero().squeeze()
+        inputs_embeds[image_seq_mask] = image_embeds
 
         inputs_embeds = inputs_embeds.reshape(B, S, D)
-        image_seq_mask = image_seq_mask.reshape(B, S)
 
         # 3. LlamaForCausalLM forward with loss
         output = self.language_model(
@@ -420,7 +422,6 @@ def und_with_loss(
             return_dict=False,
         )
         loss = output[0]
-        # logit = output[1]
 
         return loss
 
@@ -435,7 +436,7 @@ def construct(
         image_tokens: Optional[Tensor] = None,
     ):
         r"""
-        Added for training, and only used in training!
+        Implemented for single task pynative training. Support branch control for a SINGLE task in task_type.
         Args:
             input_ids: input sequence of tokens, shape (bs seq_len). see transformers docstring for details
             task_type: shape (bs,), 0 - pure text, 1 - vqa, 2 - t2i
@@ -472,6 +473,138 @@ def construct(
 
         return loss
 
+    def construct_graph_single_task(
+        self,
+        task_type: Tensor = None,
+        input_ids: Tensor = None,
+        labels: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        image_seq_mask: Optional[Tensor] = None,
+        pixel_values: Optional[Tensor] = None,
+        image_tokens: Optional[Tensor] = None,
+    ):
+        """
+        Implemented for single task graph mode sft.
+        As task_type tensor cannot be used for branch control, thus this method implements per task forward.
+        """
+
+        # text
+        loss = self.language_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            labels=labels,
+        )[0]
+        # # vqa
+        # loss = self.und_with_loss(
+        #     input_ids=input_ids,
+        #     attention_mask=attention_mask,
+        #     labels=labels,
+        #     image_seq_mask=image_seq_mask,
+        #     pixel_values=pixel_values,
+        # )
+        # # t2i
+        # loss = self.gen_with_loss(
+        #     input_ids=input_ids,
+        #     attention_mask=attention_mask,
+        #     image_seq_mask=image_seq_mask,
+        #     pixel_values=pixel_values,
+        #     image_tokens=image_tokens,
+        #     # labels,
+        # )
+        return loss
+
+    def construct_pynative_mixed_task(
+        self,
+        task_type: Tensor = None,
+        input_ids: Tensor = None,
+        labels: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        image_seq_mask: Optional[Tensor] = None,
+        pixel_values: Optional[Tensor] = None,
+        image_tokens: Optional[Tensor] = None,
+    ):
+        """Implemented for mixed-task pynative mode sft. Support branch control for mixed task. Go with this if you need MULTIPLE task sft."""
+
+        losses = []
+        for ti, task in enumerate(task_type):
+            _input_ids = input_ids[ti][None, ...]
+            _labels = labels[ti][None, ...]
+            _attention_mask = attention_mask[ti][None, ...]
+            _image_seq_mask = image_seq_mask[ti][None, ...]
+            _pixel_values = pixel_values[ti][None, ...]
+            if task == 0:
+                # mm understand
+                loss = self.und_with_loss(
+                    input_ids=_input_ids,
+                    attention_mask=_attention_mask,
+                    labels=_labels,
+                    image_seq_mask=_image_seq_mask,
+                    pixel_values=_pixel_values,
+                )
+            elif task == 1:
+                # text
+                loss = self.language_model(
+                    input_ids=_input_ids,
+                    attention_mask=_attention_mask,
+                    labels=_labels,
+                )[0]
+            elif task == 2:
+                # t2i
+                loss = self.gen_with_loss(
+                    input_ids=_input_ids,
+                    attention_mask=_attention_mask,
+                    image_seq_mask=_image_seq_mask,
+                    pixel_values=_pixel_values,
+                    # image_tokens=image_tokens,
+                    # labels,
+                )
+            else:
+                raise ValueError(f"task type should be one of [0, 1, 2], but get {task_type}")
+
+            losses.append(loss)
+
+        loss = mint.mean(mint.stack(losses))
+
+        return loss
+
+    def construct_graph_mixed_task(
+        self,
+        task_type: Tensor = None,
+        input_ids: Tensor = None,
+        labels: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        image_seq_mask: Optional[Tensor] = None,
+        pixel_values: Optional[Tensor] = None,
+    ):
+        """Implemented for mixed-task pynative mode sft. Support branch control for mixed task under graph mode."""
+
+        is_vqa_index = (task_type == 0).nonzero().squeeze(-1)
+        loss_vqa = self.und_with_loss(
+            input_ids=input_ids[is_vqa_index],
+            attention_mask=attention_mask[is_vqa_index],
+            labels=labels[is_vqa_index],
+            image_seq_mask=image_seq_mask[is_vqa_index],
+            pixel_values=pixel_values[is_vqa_index],
+        )
+
+        is_text_index = (task_type == 1).nonzero().squeeze(-1)
+        loss_text = self.language_model(
+            input_ids=input_ids[is_text_index],
+            attention_mask=attention_mask[is_text_index],
+            labels=labels[is_text_index],
+        )[0]
+
+        is_t2i_index = (task_type == 2).nonzero().squeeze(-1)
+        loss_t2i = self.gen_with_loss(
+            input_ids=input_ids[is_t2i_index],
+            attention_mask=attention_mask[is_t2i_index],
+            image_seq_mask=image_seq_mask[is_t2i_index],
+            pixel_values=pixel_values[is_t2i_index],
+        )
+
+        loss = (loss_vqa + loss_text + loss_t2i) / 3
+        return loss
+
 
 AutoConfig.register("vision", VisionConfig)
 AutoConfig.register("aligner", AlignerConfig)