首页 > 资讯 > > 内容页

每日热门:使用 LoRA 技术对 LLaMA 65B 大模型进行微调及推理

发表时间:2023-06-02 18:36:09 来源:个人图书馆-黄爸爸好

前几天,Meta 发布了 LIMA 大模型,在LLaMA-65B的基础上,无需使用 RLHF,只用了 1000 个精心准备的样本数据进行微调,就达到了和 GPT-4 相媲美的程度。这激发了我探索 LLaMA 65B 大模型的兴趣。

之前的一系列大模型相关文章都是在LLaMA 7B/13B模型参数上面进行微调,文本使用 LoRA 技术对 LLaMA 30B/65B 大模型进行微调。相关代码放置在GitHub上面:llm-action。


【资料图】

环境准备

基础环境配置如下:

操作系统: CentOS 7

CPUs: 单个节点具有 1TB 内存的 Intel CPU,物理CPU个数为64,每颗CPU核数为16

GPUs: 8 卡 A800 80GB GPUs

Python: 3.10 (需要先升级OpenSSL到1.1.1t版本(点击下载OpenSSL),然后再编译安装Python),点击下载Python

NVIDIA驱动程序版本: 515.65.01,根据不同型号选择不同的驱动程序,点击下载。

CUDA工具包: 11.7,点击下载

NCCL: nccl_2.14.3-1+cuda11.7,点击下载

cuDNN: 8.8.1.3_cuda11,点击下载

本文的实验环境与足够惊艳,使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调,效果比肩斯坦福羊驼一文中的实验环境一致,因此不再赘述。

直接激活虚拟环境。

source /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/bin/activate

数据集准备

数据集直接使用alpaca-lora项目提供的alpaca_data.jsonalpaca_data_cleaned_archive.jsonalpaca_data_gpt4.json即可。除此之外,可参考GPT-4-LLM项目,该项目还提供了使用Alpaca的Prompt翻译成中文使用 GPT4 生成了 5.2 万条指令跟随数据。

模型格式转换

首先,对原始的 LLaMA 30B/65B 大模型进行模型格式转换。模型转换的具体步骤请参考之前的文章:从0到1复现斯坦福羊驼(Stanford Alpaca 7B)。

原始 LLaMA 65B模型权重:

> tree llama-model/65B/llama-model/65B/├── checklist.chk├── consolidated.00.pth...├── consolidated.07.pth└── params.json0 directories, 10 files

转换HF格式后的 LLaMA 65B 模型权重:

ls -al hf-llama-model/llama-65b/ hf-llama-model/tokenizer/hf-llama-model/llama-65b/:total 127511452drwxrwxr-x 1 nobody nobody 0 Mar 27 20:44 .drwxrwxr-x 1 nobody nobody 0 Mar 27 20:35 ..-rw-rw-r-- 1 nobody nobody 426 Mar 27 20:44 config.json-rw-rw-r-- 1 nobody nobody 124 Mar 27 20:44 generation_config.json-rw-rw-r-- 1 nobody nobody 1619037191 Mar 27 20:38 pytorch_model-00001-of-00081.bin...-rw-rw-r-- 1 nobody nobody 1048593571 Mar 27 20:44 pytorch_model-00081-of-00081.bin-rw-rw-r-- 1 nobody nobody 63494 Mar 27 20:44 pytorch_model.bin.index.jsonhf-llama-model/tokenizer/:total 500drwxrwxr-x 1 nobody nobody 0 Mar 30 10:53 .drwxrwxr-x 1 nobody nobody 0 Mar 27 20:35 ..-rw-rw-r-- 1 nobody nobody 2 Mar 30 10:53 special_tokens_map.json-rw-rw-r-- 1 nobody nobody 141 Mar 30 10:53 tokenizer_config.json-rw-rw-r-- 1 nobody nobody 499723 Mar 30 10:53 tokenizer.model

然后,将tokenizer目录的文件拷贝到llama-65B目录下。

cp hf-llama-model/tokenizer/* hf-llama-model/llama-65b/

LLaMA 30B 的转换工作与之类似,不再赘述。

模型微调

LLaMA-30B

首先,对 LLaMA 30B 进行微调,30B 参数的模型大约60G左右。在A800上面 micro_batch_size 为 6 能够充分利用显存资源。

模型训练过程:

torchrun --nproc_per_node=8 --master_port=29005 finetune.py \> --base_model "/data/nfs/guodong.li/pretrain/hf-llama-model/llama-30b" \> --data_path "/data/nfs/guodong.li/data/alpaca_data_cleaned.json" \> --output_dir "/home/guodong.li/output/alpaca-lora-30b-dp" \> --batch_size 96 \> --micro_batch_size 6 \> --num_epochs 2CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.soCUDA SETUP: Highest compute capability among GPUs detected: 8.0CUDA SETUP: Detected CUDA version 117CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...Training Alpaca-LoRA model with params:base_model: /data/nfs/guodong.li/pretrain/hf-llama-model/llama-30bdata_path: /data/nfs/guodong.li/data/alpaca_data_cleaned.jsonoutput_dir: /home/guodong.li/output/alpaca-lora-30b-dpbatch_size: 96micro_batch_size: 6num_epochs: 2learning_rate: 0.0003cutoff_len: 256val_set_size: 2000lora_r: 8lora_alpha: 16lora_dropout: 0.05lora_target_modules: ["q_proj", "v_proj"]train_on_inputs: Truegroup_by_length: Falsewandb_project:wandb_run_name:wandb_watch:wandb_log_model:resume_from_checkpoint: Falseprompt template: alpaca...Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [02:11<00:00, 2.16s/it]Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [02:12<00:00, 2.17s/it]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 187.05it/s]trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777...Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrowMap: 4%|█████▍ | 1904/49942 [00:01<00:38, 1244.61 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 193.31it/s]trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777Map: 9%|████████████▊ | 4513/49942 [00:03<00:32, 1402.69 examples/s]Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrowMap: 66%|█████████████████████████████████████████████████████████████████████████████████████████████▌ | 33152/49942 [00:24<00:12, 1340.03 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 561.56it/s]trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrowMap: 67%|██████████████████████████████████████████████████████████████████████████████████████████████▍ | 33433/49942 [00:24<00:12, 1371.96 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 627.33it/s]Map: 40%|█████████████████████████████████████████████████████████ | 20222/49942 [00:16<00:26, 1104.62 examples/s]trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow{"loss": 2.0954, "learning_rate": 2.9999999999999997e-05, "epoch": 0.02}{"loss": 1.984, "learning_rate": 5.6999999999999996e-05, "epoch": 0.04}{"loss": 1.7062, "learning_rate": 8.4e-05, "epoch": 0.06}{"loss": 1.3441, "learning_rate": 0.00011399999999999999, "epoch": 0.08}{"loss": 1.1435, "learning_rate": 0.00014099999999999998, "epoch": 0.1}{"loss": 0.9968, "learning_rate": 0.00017099999999999998, "epoch": 0.12}{"loss": 0.9275, "learning_rate": 0.000201, "epoch": 0.13}...{"loss": 0.812, "learning_rate": 0.00026904255319148935, "epoch": 0.38}{"eval_loss": 0.8141900897026062, "eval_runtime": 28.5046, "eval_samples_per_second": 70.164, "eval_steps_per_second": 1.123, "epoch": 0.38}{"loss": 0.8016, "learning_rate": 0.0002658510638297872, "epoch": 0.4}{"loss": 0.8024, "learning_rate": 0.0002626595744680851, "epoch": 0.42}{"loss": 0.7938, "learning_rate": 0.000259468085106383, "epoch": 0.44}...{"loss": 0.793, "learning_rate": 0.00021478723404255316, "epoch": 0.71}{"loss": 0.7884, "learning_rate": 0.00021159574468085105, "epoch": 0.73}{"loss": 0.7748, "learning_rate": 0.00020840425531914894, "epoch": 0.75}{"loss": 0.7869, "learning_rate": 0.00020521276595744677, "epoch": 0.77}{"eval_loss": 0.8041278719902039, "eval_runtime": 28.2371, "eval_samples_per_second": 70.829, "eval_steps_per_second": 1.133, "epoch": 0.77}{"loss": 0.7846, "learning_rate": 0.00020202127659574466, "epoch": 0.79}{"loss": 0.791, "learning_rate": 0.00019882978723404255, "epoch": 0.81}{"loss": 0.7923, "learning_rate": 0.00019563829787234039, "epoch": 0.83}...{"loss": 0.7775, "learning_rate": 0.0001573404255319149, "epoch": 1.06}{"loss": 0.7883, "learning_rate": 0.00015414893617021278, "epoch": 1.08}{"loss": 0.7805, "learning_rate": 0.0001509574468085106, "epoch": 1.1}{"loss": 0.7955, "learning_rate": 0.0001477659574468085, "epoch": 1.11}{"loss": 0.7801, "learning_rate": 0.00014457446808510636, "epoch": 1.13}{"loss": 0.7933, "learning_rate": 0.00014138297872340425, "epoch": 1.15}{"eval_loss": 0.8008487820625305, "eval_runtime": 28.9576, "eval_samples_per_second": 69.066, "eval_steps_per_second": 1.105, "epoch": 1.15}{"loss": 0.785, "learning_rate": 0.0001381914893617021, "epoch": 1.17}{"loss": 0.7686, "learning_rate": 0.000135, "epoch": 1.19}{"loss": 0.7717, "learning_rate": 0.00013180851063829786, "epoch": 1.21}...{"loss": 0.7688, "learning_rate": 8.393617021276595e-05, "epoch": 1.5}{"loss": 0.7785, "learning_rate": 8.074468085106383e-05, "epoch": 1.52}{"loss": 0.7767, "learning_rate": 7.75531914893617e-05, "epoch": 1.54}{"eval_loss": 0.7986326813697815, "eval_runtime": 28.3196, "eval_samples_per_second": 70.622, "eval_steps_per_second": 1.13, "epoch": 1.54}{"loss": 0.7907, "learning_rate": 7.436170212765956e-05, "epoch": 1.56}{"loss": 0.7691, "learning_rate": 7.117021276595744e-05, "epoch": 1.58}...{"loss": 0.7649, "learning_rate": 1.6914893617021273e-05, "epoch": 1.9}{"loss": 0.7624, "learning_rate": 1.3723404255319146e-05, "epoch": 1.92}{"eval_loss": 0.7973329424858093, "eval_runtime": 29.2014, "eval_samples_per_second": 68.49, "eval_steps_per_second": 1.096, "epoch": 1.92}{"loss": 0.7824, "learning_rate": 1.0531914893617022e-05, "epoch": 1.94}{"loss": 0.7772, "learning_rate": 7.3404255319148934e-06, "epoch": 1.96}{"loss": 0.7762, "learning_rate": 4.148936170212765e-06, "epoch": 1.98}{"loss": 0.7572, "learning_rate": 9.574468085106382e-07, "epoch": 2.0}100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040/1040 [1:18:34<00:00, 4.44s/it]{"train_runtime": 4716.2302, "train_samples_per_second": 21.179, "train_steps_per_second": 0.221, "train_loss": 0.8336130522764646, "epoch": 2.0}100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040/1040 [1:18:34<00:00, 4.53s/it]

模型权重文件:

> tree -h  /home/guodong.li/output/alpaca-lora-30b-dp/home/guodong.li/output/alpaca-lora-30b-dp├── [ 424]  adapter_config.json├── [ 49M]  adapter_model.bin└── [4.0K]  checkpoint-1000    ├── [ 98M]  optimizer.pt    ├── [ 49M]  pytorch_model.bin    ├── [ 14K]  rng_state_0.pth    ├── [ 14K]  rng_state_1.pth    ├── [ 14K]  rng_state_2.pth    ├── [ 14K]  rng_state_3.pth    ├── [ 14K]  rng_state_4.pth    ├── [ 14K]  rng_state_5.pth    ├── [ 14K]  rng_state_6.pth    ├── [ 14K]  rng_state_7.pth    ├── [ 557]  scaler.pt    ├── [ 627]  scheduler.pt    ├── [ 13K]  trainer_state.json    └── [3.5K]  training_args.bin1 directory, 16 files

可以看到在A800上面,数据并行为8,5万条数据,单次epoch大约需要40分钟左右。

LLaMA-65B

首先,对 LLaMA 65B 进行微调,65B 参数的模型大约120G左右。为了让单卡A800能够跑65B的大模型,这里将micro_batch_size设置为1。

模型训练过程:

torchrun --nproc_per_node=8 --master_port=29005 finetune.py \> --base_model "/data/nfs/guodong.li/pretrain/hf-llama-model/llama-65b" \> --data_path "/data/nfs/guodong.li/data/alpaca_data_cleaned.json" \> --output_dir "/home/guodong.li/output/alpaca-lora-65b-dp" \> --batch_size 8 \> --micro_batch_size 1 \> --num_epochs 1...CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.soCUDA SETUP: Highest compute capability among GPUs detected: 8.0CUDA SETUP: Detected CUDA version 117CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...Training Alpaca-LoRA model with params:base_model: /data/nfs/guodong.li/pretrain/hf-llama-model/llama-65bdata_path: /data/nfs/guodong.li/data/alpaca_data_cleaned.jsonoutput_dir: /home/guodong.li/output/alpaca-lora-65b-dpbatch_size: 8micro_batch_size: 1num_epochs: 1learning_rate: 0.0003cutoff_len: 256val_set_size: 2000lora_r: 8lora_alpha: 16lora_dropout: 0.05lora_target_modules: ["q_proj", "v_proj"]train_on_inputs: Truegroup_by_length: Falsewandb_project:wandb_run_name:wandb_watch:wandb_log_model:resume_from_checkpoint: Falseprompt template: alpacaLoading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [02:06<00:00, 1.56s/it]Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [02:20<00:00, 1.74s/it]...Map: 13%|█████████████████▉ | 6312/49942 [00:04<00:30, 1410.98 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 196.47it/s]trainable params: 20971520 || all params: 65306632192 || trainable%: 0.03211238934867168Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow{"loss": 2.1086, "learning_rate": 2.3999999999999997e-05, "epoch": 0.0}{"loss": 2.0261, "learning_rate": 5.399999999999999e-05, "epoch": 0.0}{"loss": 1.7054, "learning_rate": 8.4e-05, "epoch": 0.0}{"loss": 1.2423, "learning_rate": 0.00011099999999999999, "epoch": 0.01}{"loss": 0.9976, "learning_rate": 0.00013199999999999998, "epoch": 0.01}{"loss": 0.801, "learning_rate": 0.000162, "epoch": 0.01}{"loss": 0.839, "learning_rate": 0.00019199999999999998, "epoch": 0.01}{"loss": 0.8134, "learning_rate": 0.00022199999999999998, "epoch": 0.01}{"loss": 0.7575, "learning_rate": 0.00025199999999999995, "epoch": 0.01}...{"loss": 0.769, "learning_rate": 0.0001992023441315318, "epoch": 0.35}{"loss": 0.7393, "learning_rate": 0.00019871398339573498, "epoch": 0.35}{"loss": 0.7269, "learning_rate": 0.0001982256226599381, "epoch": 0.35}{"loss": 0.6783, "learning_rate": 0.00019773726192414128, "epoch": 0.35}{"eval_loss": 0.7974867820739746, "eval_runtime": 48.5181, "eval_samples_per_second": 41.222, "eval_steps_per_second": 0.66, "epoch": 0.35}{"loss": 0.6891, "learning_rate": 0.00019724890118834445, "epoch": 0.35}{"loss": 0.7216, "learning_rate": 0.0001967605404525476, "epoch": 0.36}{"loss": 0.7114, "learning_rate": 0.00019627217971675075, "epoch": 0.36}{"loss": 0.7089, "learning_rate": 0.0001957838189809539, "epoch": 0.36}...{"loss": 0.6985, "learning_rate": 5.323132020185577e-06, "epoch": 0.98}{"loss": 0.7167, "learning_rate": 4.834771284388734e-06, "epoch": 0.99}{"loss": 0.7433, "learning_rate": 4.346410548591893e-06, "epoch": 0.99}{"loss": 0.6875, "learning_rate": 3.8580498127950505e-06, "epoch": 0.99}{"loss": 0.7104, "learning_rate": 3.369689076998209e-06, "epoch": 0.99}{"loss": 0.7346, "learning_rate": 2.881328341201367e-06, "epoch": 0.99}{"loss": 0.7062, "learning_rate": 2.3929676054045255e-06, "epoch": 0.99}{"eval_loss": 0.787121593952179, "eval_runtime": 48.4232, "eval_samples_per_second": 41.303, "eval_steps_per_second": 0.661, "epoch": 0.99}{"loss": 0.701, "learning_rate": 1.9046068696076832e-06, "epoch": 0.99}{"loss": 0.7169, "learning_rate": 1.4162461338108414e-06, "epoch": 1.0}{"loss": 0.763, "learning_rate": 9.278853980139996e-07, "epoch": 1.0}{"loss": 0.6903, "learning_rate": 4.3952466221715773e-07, "epoch": 1.0}100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6243/6243 [4:36:50<00:00, 2.42s/it]{"train_runtime": 16612.2434, "train_samples_per_second": 3.006, "train_steps_per_second": 0.376, "train_loss": 0.7368283385404043, "epoch": 1.0}100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6243/6243 [4:36:50<00:00, 2.66s/it]

显存占用:

Tue May 23 17:05:37 2023+-----------------------------------------------------------------------------+| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     ||-------------------------------+----------------------+----------------------+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC || Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. ||                               |                      |               MIG M. ||===============================+======================+======================||   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 || N/A   67C    P0   296W / 300W |  78543MiB / 81920MiB |    100%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------+|   1  NVIDIA A800 80G...  Off  | 00000000:35:00.0 Off |                    0 || N/A   69C    P0   303W / 300W |  78577MiB / 81920MiB |    100%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------+|   2  NVIDIA A800 80G...  Off  | 00000000:36:00.0 Off |                    0 || N/A   70C    P0   300W / 300W |  78657MiB / 81920MiB |    100%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------+|   3  NVIDIA A800 80G...  Off  | 00000000:37:00.0 Off |                    0 || N/A   72C    P0   297W / 300W |  78577MiB / 81920MiB |    100%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------+|   4  NVIDIA A800 80G...  Off  | 00000000:9B:00.0 Off |                    0 || N/A   71C    P0   292W / 300W |  78641MiB / 81920MiB |    100%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------+|   5  NVIDIA A800 80G...  Off  | 00000000:9C:00.0 Off |                    0 || N/A   71C    P0   305W / 300W |  78629MiB / 81920MiB |    100%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------+|   6  NVIDIA A800 80G...  Off  | 00000000:9D:00.0 Off |                    0 || N/A   68C    P0   296W / 300W |  78625MiB / 81920MiB |    100%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------+|   7  NVIDIA A800 80G...  Off  | 00000000:9E:00.0 Off |                    0 || N/A   68C    P0   298W / 300W |  78799MiB / 81920MiB |    100%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+| Processes:                                                                  ||  GPU   GI   CI        PID   Type   Process name                  GPU Memory ||        ID   ID                                                   Usage      ||=============================================================================||    0   N/A  N/A     33369      C   ...nv-py310-cu117/bin/python    78541MiB ||    1   N/A  N/A     33370      C   ...nv-py310-cu117/bin/python    78575MiB ||    2   N/A  N/A     33371      C   ...nv-py310-cu117/bin/python    78655MiB ||    3   N/A  N/A     33372      C   ...nv-py310-cu117/bin/python    78575MiB ||    4   N/A  N/A     33373      C   ...nv-py310-cu117/bin/python    78639MiB ||    5   N/A  N/A     33374      C   ...nv-py310-cu117/bin/python    78627MiB ||    6   N/A  N/A     33375      C   ...nv-py310-cu117/bin/python    78623MiB ||    7   N/A  N/A     33376      C   ...nv-py310-cu117/bin/python    78797MiB |+-----------------------------------------------------------------------------+

模型权重:

> tree -h /home/guodong.li/output/alpaca-lora-65b-dp/home/guodong.li/output/alpaca-lora-65b-dp├── [ 424] adapter_config.json├── [ 80M] adapter_model.bin└── [4.0K] checkpoint-6200 ├── [160M] optimizer.pt ├── [ 80M] pytorch_model.bin ├── [ 14K] rng_state_0.pth ├── [ 14K] rng_state_1.pth ├── [ 14K] rng_state_2.pth ├── [ 14K] rng_state_3.pth ├── [ 14K] rng_state_4.pth ├── [ 14K] rng_state_5.pth ├── [ 14K] rng_state_6.pth ├── [ 14K] rng_state_7.pth ├── [ 557] scaler.pt ├── [ 627] scheduler.pt ├── [ 80K] trainer_state.json └── [3.5K] training_args.bin1 directory, 16 files

可以看到在A800上面,数据并行为8,5万条数据,单次epoch大约需要4.5小时左右。

将 LoRA 权重合并回基础模型

下面将 LoRA 权重合并回基础模型,以便于进行模型推理。具体可参考足够惊艳,使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调,效果比肩斯坦福羊驼一文修改export_hf_checkpoint.py文件。

权重合并过程:

BASE_MODEL=/data/nfs/guodong.li/pretrain/hf-llama-model/llama-65b \> LORA_MODEL=/home/guodong.li/output/alpaca-lora-65b-dp \> HF_CHECKPOINT=/home/guodong.li/output/hf_65b_ckpt \> python export_hf_checkpoint.py===================================BUG REPORT===================================Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues================================================================================/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath("/opt/rh/devtoolset-9/root/usr/lib/dyninst"), PosixPath("/opt/rh/devtoolset-7/root/usr/lib/dyninst")}  warn(msg)CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.soCUDA SETUP: Highest compute capability among GPUs detected: 8.0CUDA SETUP: Detected CUDA version 117CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [01:15<00:00,  1.08it/s]

合并后的权重文件:

> tree -h hf_65b_ckpthf_65b_ckpt├── [ 580] config.json├── [ 137] generation_config.json├── [ 537] pytorch_model-00001-of-00403.bin├── [500M] pytorch_model-00002-of-00403.bin├── [256M] pytorch_model-00003-of-00403.bin├── [256M] pytorch_model-00004-of-00403.bin├── [344M] pytorch_model-00005-of-00403.bin├── [344M] pytorch_model-00006-of-00403.bin├── [344M] pytorch_model-00007-of-00403.bin...├── [344M] pytorch_model-00400-of-00403.bin├── [344M] pytorch_model-00401-of-00403.bin├── [344M] pytorch_model-00402-of-00403.bin├── [500M] pytorch_model-00403-of-00403.bin└── [ 65K] pytorch_model.bin.index.json0 directories, 406 files

模型推理

接下来使用转换后的模型权重进行模型推理,具体的模型推理(inference.py)代码如下所示:

import sysfrom transformers import LlamaForCausalLM, AutoTokenizerimport torchdevice = torch.device("cuda:2") if torch.cuda.is_available() else torch.device("cpu")tokenizer_path="/data/nfs/guodong.li/pretrain/hf-llama-model/tokenizer"model_path = "/home/guodong.li/output/hf_65b_ckpt" # You can modify the path for storing the local modelmodel =  LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, load_in_8bit=True, device_map="auto")tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)print("Human:")line = input()while line:        inputs = "Human: " + line.strip() + "\n\nAssistant:"        input_ids = tokenizer(inputs, return_tensors="pt").input_ids        input_ids = input_ids.to(device)        outputs = model.generate(input_ids, max_new_tokens=500, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5, repetition_penalty=1., eos_token_id=2, bos_token_id=1, pad_token_id=0)        rets = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)        print("Assistant:\n" + rets[0].strip().replace(inputs, ""))        print("\n------------------------------------------------\nHuman:")        line = input()

运行推理代码:

> python inference.py===================================BUG REPORT===================================Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues================================================================================/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath("/opt/rh/devtoolset-9/root/usr/lib/dyninst"), PosixPath("/opt/rh/devtoolset-7/root/usr/lib/dyninst")} warn(msg)CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.soCUDA SETUP: Highest compute capability among GPUs detected: 8.0CUDA SETUP: Detected CUDA version 117CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 402/402 [01:48<00:00, 3.72it/s]Human:What are the five characteristics of a good argument?Assistant: A good argument should be clear, concise, logical, supported by evidence, and respectful of the opposing view.------------------------------------------------Human:Generate a list of ten common idioms related to animals.Assistant: 1. "Like a fish out of water" 2. "Birds of a feather flock together" 3. "Let the cat out of the bag" 4. "Herding cats" 5. "Barking up the wrong tree" 6. "Sly as a fox" 7. "A lion"s share" 8. "A bird in the hand is worth two in the bush" 9. "A wolf in sheep"s clothing" 10. "A wild goose chase".------------------------------------------------Human:Evaluate the following expression: (6+2)*(2-2).Assistant: 10.------------------------------------------------Human:Compute the derivative of 3x^3 + 10x.Assistant: The derivative of 3x^3 + 10x is 9x^2 + 10.------------------------------------------------Human:

显存占用:

+-----------------------------------------------------------------------------+| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     ||-------------------------------+----------------------+----------------------+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC || Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. ||                               |                      |               MIG M. ||===============================+======================+======================||   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 || N/A   44C    P0    69W / 300W |  66927MiB / 81920MiB |      0%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------+...+-------------------------------+----------------------+----------------------+|   7  NVIDIA A800 80G...  Off  | 00000000:9E:00.0 Off |                    0 || N/A   47C    P0    71W / 300W |   7224MiB / 81920MiB |      0%      Default ||                               |                      |             Disabled |+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+| Processes:                                                                  ||  GPU   GI   CI        PID   Type   Process name                  GPU Memory ||        ID   ID                                                   Usage      ||=============================================================================||    0   N/A  N/A     43499      C   python                          66925MiB ||    1   N/A  N/A     43499      C   python                            949MiB |...|    7   N/A  N/A     43499      C   python                            949MiB |+-----------------------------------------------------------------------------+

可以看到即使使用了FP16加载模型,单卡的显存占用也高达60多G。如果硬件资源不足,可以考虑使用模型并行推理。具体可参考:tensor_parallel 和 FasterTransformer 这两个项目,使用模型并行对 LLaMA 进行推理。当然,从提升模型的推理速度以及吞吐量的角度来说,对百亿级以上的大模型,也应该使用模型并行进行推理。

结语

本文讲述了使用 LoRA 高效微调技术对 LLaMA 30B/65B 进行模型训练及推理,希望能够给你带来帮助。

参考文档:

从0到1复现斯坦福羊驼(Stanford Alpaca 7B)

足够惊艳,使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调,效果比肩斯坦福羊驼

Alpaca-LoRA

最近更新