Titan RTX faster than RTX 3090 but slower if RTX 3090 can use sample packing #1007

Nero10578 · 2023-12-27T22:27:11Z

Nero10578
Dec 27, 2023

Hey everyone, I need some help. I got my hands on two Titan RTX 24GB and also two RTX 3090 24GB cards. As far as I know the biggest differentiating factor between them is that the RTX 3090 has half rate FP16 tensor with FP32 accumulate. Which lands it almost 1/2 the performance of the Titan RTX in that area. In every other metric it wipes the floor with the Titan RTX.

The RTX 3090 is also ampere based so it supports flash attention 2 and therefore sample packing. As well as BFloat16. While the Titan RTX I had to run xformers and no sample packing.

In my testing, with this yaml configuration:

base_model: ./models/llama2-70b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false
device_map: sequential
max_memory: {0: "18GIB", 1: "23GIB"}

datasets:
  - path: (short test dataset of 100 lines 200-500 tokens each)
    type: completion

dataset_prepared_path: ./last_run_prepared
val_set_size: 0.05
output_dir: ./qlora-out

save_safetensors: true

adapter: lora
lora_model_dir: 

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 8
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true (for RTX 3090)
fp16: true (for Titan RTX)
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: false
local_rank:
logging_steps: 1
xformers_attention: true (for Titan RTX)
flash_attention: true (for RTX 3090)

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_table_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

This would result in a 24-step training.
This results in these training times:
Titan RTX: 248 seconds
RTX 3090: 325 seconds

But if I enable sample packing on the RTX 3090, it can do it in one step resulting in:
RTX 3090 sample packing on: 28 seconds

I understand this is because this is a super small dataset that can be optimized to be packed and done in one step instead of 24 steps originally. But the Titan RTX is inherently faster without this optimization? Is there a way to turn on sample packing with the Titan RTX? I am contemplating which of the cards to keep and sell. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Titan RTX faster than RTX 3090 but slower if RTX 3090 can use sample packing #1007

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Select a reply

Uh oh!

Uh oh!

Titan RTX faster than RTX 3090 but slower if RTX 3090 can use sample packing #1007

Uh oh!

Nero10578 Dec 27, 2023

Replies: 1 comment · 2 replies

Nero10578
Dec 27, 2023

Replies: 1 comment 2 replies