TypeError: Object of type Tensor is not JSON serializable for finetune_hf by Lora #937
Closed
BourneKing
started this conversation in
General
Replies: 2 comments
-
求高手指教啊,非常非常感谢!!! |
Beta Was this translation helpful? Give feedback.
0 replies
-
在线等!! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
NVIDIA GeForce RTX 3090单卡执行Lora FT报错
/ChatGLM3-main/finetune_demo$ nvidia-smi
Thu Mar 7 16:10:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 On | N/A |
| 0% 49C P8 39W / 350W | 626MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:04:00.0 Off | N/A |
| 0% 47C P8 23W / 350W | 10MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
samples in train.json after converting
{"conversations": [{"role": "user", "content": "类型#裙版型#宽松版型#显瘦颜色#黑色图案#撞色裙型#直筒裙裙款式#拼接"}, {"role": "assistant", "content": "采用简洁大体的黑色格调,宽松舒适的裙子内里,配上落肩的袖子拼接,不惧夏日的炎热,穿出清凉舒适。用时尚的英文字母,加上撞色的红白搭配,呈现大气时尚的款式。直筒的裙子轮廓,前短后长的长度裁剪,上身拉长宝宝的身体比例,挑高显瘦。"}]}
{"conversations": [{"role": "user", "content": "类型#上衣颜色#黑色颜色#紫色风格#性感图案#字母图案#文字图案#线条图案#刺绣衣样式#卫衣衣长#短款衣袖型#落肩袖衣款式#连帽"}, {"role": "assistant", "content": "卫衣的短款长度设计能够适当地露出腰线,打造出纤瘦的身材十分性感。衣身的字母刺绣图案有着小巧的样式,黑色的绣线在紫色的衣身上显得很出挑显眼。落肩袖的设计柔化了肩部的线条衬托得人很温柔可爱。紫色的颜色彰显出优雅的气质也不失年轻活力感。连帽的设计让卫衣更加丰满造型感很足,长长的帽绳直到腰际处,有着延长衣身的效果显得身材。"}]}
{"conversations": [{"role": "user", "content": "类型#上衣颜色#黑白风格#简约风格#休闲图案#条纹衣样式#风衣*衣样式#外套"}, {"role": "assistant", "content": "设计师以条纹作为风衣外套的主要设计元素,以简约点缀了外套,带来大气休闲的视觉效果。因为采用的是黑白的经典色,所以有着颇为出色的耐看性与百搭性,可以帮助我们更好的驾驭日常的穿着,而且不容易让人觉得它过时。"}]}
lora ft cmd in Ubuntu20.04执行命令
CUDA_VISIBLE_DEVICES=1 /home/yons/miniconda3/envs/apple/bin/python finetune_hf.py data/AdvertiseGen_fix /home/yons/llms/models/chatglm3-6b configs/lora.yaml no
然后报错:
TypeError: Object of type Tensor is not JSON serializable
卡了两天了,求高手指点,十分感谢!!!
具体执行过程及报错信息如下:
~/llms/ChatGLM3-main/finetune_demo$ CUDA_VISIBLE_DEVICES=1 /home/yons/miniconda3/envs/apple/bin/python finetune_hf.py data/AdvertiseGen_fix /home/yons/llms/models/chatglm3-6b configs/lora.yaml no
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 7/7 [00:03<00:00, 1.91it/s]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
--> model has 1.949696M params
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
max_steps is given, it will override any value given in num_train_epochs
[2024-03-07 15:58:43,521] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 15:58:43,642] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.13.4, git-hash=unknown, git-branch=unknown
[2024-03-07 15:58:43,642] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-07 15:58:43,642] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2024-03-07 15:58:44,171] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.2.136, master_port=29500
[2024-03-07 15:58:44,171] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-07 15:58:59,824] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-03-07 15:58:59,824] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-03-07 15:58:59,825] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-03-07 15:58:59,826] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-03-07 15:58:59,826] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-03-07 15:58:59,826] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 2 optimizer
[2024-03-07 15:58:59,826] [INFO] [stage_1_and_2.py:149:init] Reduce bucket size 500000000
[2024-03-07 15:58:59,826] [INFO] [stage_1_and_2.py:150:init] Allgather bucket size 500000000
[2024-03-07 15:58:59,826] [INFO] [stage_1_and_2.py:151:init] CPU Offload: False
[2024-03-07 15:58:59,826] [INFO] [stage_1_and_2.py:152:init] Round robin gradient partitioning: False
[2024-03-07 15:58:59,919] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
[2024-03-07 15:58:59,919] [INFO] [utils.py:801:see_memory_usage] MA 11.67 GB Max_MA 11.67 GB CA 11.68 GB Max_CA 12 GB
[2024-03-07 15:58:59,919] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 7.78 GB, percent = 6.2%
[2024-03-07 15:58:59,980] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
[2024-03-07 15:58:59,981] [INFO] [utils.py:801:see_memory_usage] MA 11.67 GB Max_MA 11.68 GB CA 11.7 GB Max_CA 12 GB
[2024-03-07 15:58:59,981] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 7.78 GB, percent = 6.2%
[2024-03-07 15:58:59,981] [INFO] [stage_1_and_2.py:539:init] optimizer state initialized
[2024-03-07 15:59:00,035] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
[2024-03-07 15:59:00,035] [INFO] [utils.py:801:see_memory_usage] MA 11.67 GB Max_MA 11.67 GB CA 11.7 GB Max_CA 12 GB
[2024-03-07 15:59:00,035] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 7.78 GB, percent = 6.2%
[2024-03-07 15:59:00,036] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2024-03-07 15:59:00,036] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-03-07 15:59:00,036] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-03-07 15:59:00,036] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[(0.9, 0.999)]
[2024-03-07 15:59:00,036] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2024-03-07 15:59:00,036] [INFO] [config.py:1000:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-03-07 15:59:00,036] [INFO] [config.py:1000:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-03-07 15:59:00,036] [INFO] [config.py:1000:print] amp_enabled .................. False
[2024-03-07 15:59:00,036] [INFO] [config.py:1000:print] amp_params ................... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] bfloat16_enabled ............. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f4ac821c7c0>
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] communication_data_type ...... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] dataloader_drop_last ......... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] disable_allgather ............ False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] dump_state ................... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] elasticity_enabled ........... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] fp16_auto_cast ............... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] fp16_enabled ................. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] global_rank .................. 0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] grad_accum_dtype ............. None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] gradient_clipping ............ 1.0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] graph_harvesting ............. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 65536
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] load_universal_checkpoint .... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] loss_scale ................... 0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] memory_breakdown ............. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] mics_shard_size .............. -1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] optimizer_name ............... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] optimizer_params ............. None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] pld_enabled .................. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] pld_params ................... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] prescale_gradients ........... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] scheduler_name ............... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] scheduler_params ............. None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] sparse_attention ............. None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] steps_per_print .............. inf
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] train_batch_size ............. 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] use_data_before_expert_parallel_ False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] use_node_local_storage ....... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] wall_clock_breakdown ......... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] weight_quantization_config ... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] world_size ................... 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_allow_untested_optimizer True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_enabled ................. True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_optimization_stage ...... 2
[2024-03-07 15:59:00,037] [INFO] [config.py:986:print_user_config] json = {
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": false
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"steps_per_print": inf,
"train_batch_size": 1,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false,
"zero_allow_untested_optimizer": true
}
***** Running training *****
Num examples = 114,599
Num Epochs = 1
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 1
Gradient Accumulation steps = 1
Total optimization steps = 3,000
Number of trainable parameters = 1,949,696
{'loss': 4.4402, 'grad_norm': tensor(4.1844, device='cuda:0'), 'learning_rate': 4.9833333333333336e-05, 'epoch': 0.0}
{'loss': 4.9043, 'grad_norm': tensor(3.6091, device='cuda:0'), 'learning_rate': 4.966666666666667e-05, 'epoch': 0.0}
{'loss': 4.6645, 'grad_norm': tensor(4.6015, device='cuda:0'), 'learning_rate': 4.9500000000000004e-05, 'epoch': 0.0}
{'loss': 4.7369, 'grad_norm': tensor(5.0354, device='cuda:0'), 'learning_rate': 4.933333333333334e-05, 'epoch': 0.0}
{'loss': 4.267, 'grad_norm': tensor(4.8521, device='cuda:0'), 'learning_rate': 4.9166666666666665e-05, 'epoch': 0.0}
{'loss': 4.1836, 'grad_norm': tensor(6.1400, device='cuda:0'), 'learning_rate': 4.9e-05, 'epoch': 0.0}
{'loss': 3.7965, 'grad_norm': tensor(6.3060, device='cuda:0'), 'learning_rate': 4.883333333333334e-05, 'epoch': 0.0}
{'loss': 3.7979, 'grad_norm': tensor(6.0029, device='cuda:0'), 'learning_rate': 4.866666666666667e-05, 'epoch': 0.0}
{'loss': 3.51, 'grad_norm': tensor(5.0960, device='cuda:0'), 'learning_rate': 4.85e-05, 'epoch': 0.0}
{'loss': 3.9912, 'grad_norm': tensor(5.5196, device='cuda:0'), 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.0}
{'loss': 3.7437, 'grad_norm': tensor(5.1386, device='cuda:0'), 'learning_rate': 4.8166666666666674e-05, 'epoch': 0.0}
{'loss': 4.0131, 'grad_norm': tensor(5.2996, device='cuda:0'), 'learning_rate': 4.8e-05, 'epoch': 0.0}
{'loss': 3.8314, 'grad_norm': tensor(5.4587, device='cuda:0'), 'learning_rate': 4.7833333333333335e-05, 'epoch': 0.0}
{'loss': 3.6627, 'grad_norm': tensor(7.8379, device='cuda:0'), 'learning_rate': 4.766666666666667e-05, 'epoch': 0.0}
{'loss': 3.4525, 'grad_norm': tensor(6.6903, device='cuda:0'), 'learning_rate': 4.75e-05, 'epoch': 0.0}
{'loss': 3.6549, 'grad_norm': tensor(6.1589, device='cuda:0'), 'learning_rate': 4.7333333333333336e-05, 'epoch': 0.0}
{'loss': 3.6318, 'grad_norm': tensor(7.3151, device='cuda:0'), 'learning_rate': 4.716666666666667e-05, 'epoch': 0.0}
{'loss': 3.9439, 'grad_norm': tensor(6.4505, device='cuda:0'), 'learning_rate': 4.7e-05, 'epoch': 0.0}
{'loss': 3.7131, 'grad_norm': tensor(6.2490, device='cuda:0'), 'learning_rate': 4.683333333333334e-05, 'epoch': 0.0}
{'loss': 3.6848, 'grad_norm': tensor(6.6936, device='cuda:0'), 'learning_rate': 4.666666666666667e-05, 'epoch': 0.0}
{'loss': 3.3516, 'grad_norm': tensor(6.8842, device='cuda:0'), 'learning_rate': 4.6500000000000005e-05, 'epoch': 0.0}
{'loss': 3.7281, 'grad_norm': tensor(7.0181, device='cuda:0'), 'learning_rate': 4.633333333333333e-05, 'epoch': 0.0}
{'loss': 3.5209, 'grad_norm': tensor(6.9573, device='cuda:0'), 'learning_rate': 4.6166666666666666e-05, 'epoch': 0.0}
{'loss': 3.7479, 'grad_norm': tensor(7.1273, device='cuda:0'), 'learning_rate': 4.600000000000001e-05, 'epoch': 0.0}
{'loss': 3.5268, 'grad_norm': tensor(6.9141, device='cuda:0'), 'learning_rate': 4.5833333333333334e-05, 'epoch': 0.0}
{'loss': 3.5688, 'grad_norm': tensor(9.0669, device='cuda:0'), 'learning_rate': 4.566666666666667e-05, 'epoch': 0.0}
{'loss': 3.5719, 'grad_norm': tensor(8.0154, device='cuda:0'), 'learning_rate': 4.55e-05, 'epoch': 0.0}
{'loss': 3.5658, 'grad_norm': tensor(8.4898, device='cuda:0'), 'learning_rate': 4.5333333333333335e-05, 'epoch': 0.0}
{'loss': 3.54, 'grad_norm': tensor(7.5420, device='cuda:0'), 'learning_rate': 4.516666666666667e-05, 'epoch': 0.0}
{'loss': 3.6279, 'grad_norm': tensor(7.9869, device='cuda:0'), 'learning_rate': 4.5e-05, 'epoch': 0.0}
{'loss': 3.6281, 'grad_norm': tensor(7.8166, device='cuda:0'), 'learning_rate': 4.483333333333333e-05, 'epoch': 0.0}
{'loss': 3.4217, 'grad_norm': tensor(7.2795, device='cuda:0'), 'learning_rate': 4.466666666666667e-05, 'epoch': 0.0}
{'loss': 3.3732, 'grad_norm': tensor(8.6140, device='cuda:0'), 'learning_rate': 4.4500000000000004e-05, 'epoch': 0.0}
{'loss': 3.5418, 'grad_norm': tensor(10.2145, device='cuda:0'), 'learning_rate': 4.433333333333334e-05, 'epoch': 0.0}
{'loss': 3.4326, 'grad_norm': tensor(7.4907, device='cuda:0'), 'learning_rate': 4.4166666666666665e-05, 'epoch': 0.0}
{'loss': 3.4521, 'grad_norm': tensor(7.3476, device='cuda:0'), 'learning_rate': 4.4000000000000006e-05, 'epoch': 0.0}
{'loss': 3.5385, 'grad_norm': tensor(7.8114, device='cuda:0'), 'learning_rate': 4.383333333333334e-05, 'epoch': 0.0}
{'loss': 3.5908, 'grad_norm': tensor(9.1731, device='cuda:0'), 'learning_rate': 4.3666666666666666e-05, 'epoch': 0.0}
{'loss': 3.5932, 'grad_norm': tensor(7.8424, device='cuda:0'), 'learning_rate': 4.35e-05, 'epoch': 0.0}
{'loss': 3.3729, 'grad_norm': tensor(7.9566, device='cuda:0'), 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.0}
{'loss': 3.2678, 'grad_norm': tensor(8.0636, device='cuda:0'), 'learning_rate': 4.316666666666667e-05, 'epoch': 0.0}
{'loss': 3.6715, 'grad_norm': tensor(11.0745, device='cuda:0'), 'learning_rate': 4.3e-05, 'epoch': 0.0}
{'loss': 3.5162, 'grad_norm': tensor(9.7909, device='cuda:0'), 'learning_rate': 4.2833333333333335e-05, 'epoch': 0.0}
{'loss': 3.2836, 'grad_norm': tensor(8.5034, device='cuda:0'), 'learning_rate': 4.266666666666667e-05, 'epoch': 0.0}
{'loss': 3.7068, 'grad_norm': tensor(9.4987, device='cuda:0'), 'learning_rate': 4.25e-05, 'epoch': 0.0}
{'loss': 3.7994, 'grad_norm': tensor(9.6808, device='cuda:0'), 'learning_rate': 4.233333333333334e-05, 'epoch': 0.0}
{'loss': 3.8311, 'grad_norm': tensor(13.3884, device='cuda:0'), 'learning_rate': 4.216666666666667e-05, 'epoch': 0.0}
{'loss': 3.2943, 'grad_norm': tensor(10.4600, device='cuda:0'), 'learning_rate': 4.2e-05, 'epoch': 0.0}
{'loss': 3.3723, 'grad_norm': tensor(7.9136, device='cuda:0'), 'learning_rate': 4.183333333333334e-05, 'epoch': 0.0}
{'loss': 3.4182, 'grad_norm': tensor(11.4667, device='cuda:0'), 'learning_rate': 4.166666666666667e-05, 'epoch': 0.0}
17%|█████████████▋ | 500/3000 [00:45<03:55, 10.62it/s]***** Running Evaluation *****
Num examples = 50
Batch size = 16
17%|█████████████▋ | 500/3000 [00:58<03:55, 10.62it/sBuilding prefix dict from the default dictionary ...█████████████████████████████████████████| 4/4 [00:11<00:00, 2.89s/it]
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.260 seconds.
Prefix dict has been built successfully.
{'eval_rouge-1': 30.055642000000002, 'eval_rouge-2': 6.419442, 'eval_rouge-l': 24.02483, 'eval_bleu-4': 0.030992641634256496, 'eval_runtime': 19.3324, 'eval_samples_per_second': 2.586, 'eval_steps_per_second': 0.207, 'epoch': 0.0}
17%|█████████████▋ | 500/3000 [01:05<03:55, 10.62it/sSaving model checkpoint to ./output/tmp-checkpoint-500
tokenizer config file saved in ./output/tmp-checkpoint-500/tokenizer_config.json
Special tokens file saved in ./output/tmp-checkpoint-500/special_tokens_map.json
[2024-03-07 16:00:20,630] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step500 is about to be saved!
/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/torch/nn/modules/module.py:1877: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2024-03-07 16:00:20,634] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./output/tmp-checkpoint-500/global_step500/mp_rank_00_model_states.pt
[2024-03-07 16:00:20,634] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./output/tmp-checkpoint-500/global_step500/mp_rank_00_model_states.pt...
[2024-03-07 16:00:20,659] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./output/tmp-checkpoint-500/global_step500/mp_rank_00_model_states.pt.
[2024-03-07 16:00:20,659] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./output/tmp-checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-03-07 16:00:20,704] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./output/tmp-checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-03-07 16:00:20,705] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved ./output/tmp-checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-03-07 16:00:20,705] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step500 is ready now!
Traceback (most recent call last):
File "/home/yons/llms/ChatGLM3-main/finetune_demo/finetune_hf.py", line 587, in
app()
File "/home/yons/llms/ChatGLM3-main/finetune_demo/finetune_hf.py", line 543, in main
trainer.train()
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer.py", line 2525, in _save_checkpoint
self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/init.py", line 238, in dumps
**kw).encode(obj)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type Tensor is not JSON serializable
17%|█████████████▋ | 500/3000 [01:20<06:43, 6.19it/s]
Beta Was this translation helpful? Give feedback.
All reactions