StableDiffusion大模型(Dreambooth)云上训练以及安装CODA指定版本

创建阿里云PAI DSW实例跑kohya

镜像我这里选:

dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.10.0-pytorch2.1.0tensorflow2.14.0-gpu-py310-cu118-ubuntu22.04
我这里用默认的镜像,实测截至2024.01.10时,直接拉kohya的github可以直接用,不需要改动cuda之类的操作

安装kohya_ss

kohya_ss仓库地址:
https://github.com/bmaltais/kohya_ss
kohya_ss是个webui训练器,SD web_ui里也有对应的Dreambooth训练插件移植,如果只拿来训练不需要跑完整的sd webui服务,只需要kohya就可以了,kohay也可以练lora
在workspace根目录直接:

1
git clone https://github.com/bmaltais/kohya_ss.git

很快就能完成,接着依次执行

1
2
3
4
cd ./kohya_ss
apt update -y && apt install -y python3-tk
chmod +x ./setup.sh
./setup.sh

虽然镜像里带py310,但是似乎还是要装一下python3-tk
之后安装脚本会自动完成,我大概花了5分钟

然后运行启动webui

1
HF_ENDPOINT=https://hf-mirror.com ./gui.sh --headless


点击这个本地的IP即能点开webui了
HF_ENDPOINT=https://hf-mirror.com是为了防止抱脸会更新卡住而用的镜像网站(我确实因为这个卡过)或者见本站另一篇专门说代理的文章:
https://winotmk.github.io/240109_Linux%E4%B8%8A%E7%9A%84%E5%91%BD%E4%BB%A4%E8%A1%8C%E4%BB%A3%E7%90%86%E5%B7%A5%E5%85%B7/

kohya_ss的Dreambooth训练参数

source model


我这里用自己上传的模型,可以先上传至阿里云oss再挂载进来,所以这里这样选,然后填模型路径就好了

floders

这个tag里比较简单,没什么好说的
Image folder里写上训练集目录,注意写上的目录底下应该是例如10_ABC目录,然后再放图和txt文件,这个10就step的10,和lora训练时候一样

parameters

basic

和lora训练设置大同小异,但是参数要比lora小得多,因为Dreambooth比lora性能消耗要大得多而且非常容易过拟合,图出来一滩浆糊,比如我尝试epoch可能10以内就足够,由于文件比较大Save every N epochs我一般也就3-4,其他参数看个人需求吧

samples

这里能填关键词和每多少轮出个预览图,玩玩用
都准备好了就可以点 start training,但webui不会有任何提示..要看之前启webui的终端
这样就是开始训练了:

不过我第一次成功启动了webui但是点开始训练以后,报过类似这样的错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-11 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
The following directories listed in your path were found to be non-existent: {PosixPath('//license-pai.cn-shanghai.data.aliyun.com'), PosixPath('http')}
The following directories listed in your path were found to be non-existent: {PosixPath('dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/cloud-dsw/eas-service'), PosixPath('aigc-torch113-cu117-ubuntu22.04-v0.2.1')}
The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('8088/dsw-301739'), PosixPath('//127.0.0.1')}
The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')}
The following directories listed in your path were found to be non-existent: {PosixPath('tcp'), PosixPath('443'), PosixPath('//10.192.0.1')}
The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//dsw-cn-shanghai.data.aliyun.com')}
The following directories listed in your path were found to be non-existent: {PosixPath('tcp'), PosixPath('443'), PosixPath('//10.192.0.1')}
The following directories listed in your path were found to be non-existent: {PosixPath('/home/pai/bin/python')}
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=118, Highest Compute Capability: 7.0.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! If you run into issues with 8-bit matmul, you can try 4-bit quantization: https://huggingface.co/blog/4bit-transformers-bitsandbytes
warn(msg)
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so...
libcusparse.so.11: cannot open shared object file: No such file or directory
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x_nomatmul
python setup.py install
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.10/runpy.py", line 146, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
__import__(pkg_name)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/__init__.py", line 6, in <module>
from . import cuda_setup, utils, research
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/research/__init__.py", line 1, in <module>
from . import nn
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/research/nn/__init__.py", line 1, in <module>
from .modules import LinearFP8Mixed, LinearFP8Global
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/research/nn/modules.py", line 8, in <module>
from bitsandbytes.optim import GlobalOptimManager
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/optim/__init__.py", line 6, in <module>
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 20, in <module>
raise RuntimeError('''

以及如果遇到类似:

1
Could not load dynamic library 'libcudart.so.11.0'

重新装适合的CUDA版本即可解决,如果要装CUDA:

CUDA相关

安装CUDA指定版本

遇到过cuda版本不匹配的问题,记一下配置过程
cuda下载:https://developer.nvidia.com/cuda-downloads
但是有时候需要特定版本:https://developer.nvidia.com/cuda-toolkit-archive
以11.8为例,系统是ubuntu22.04,所以这样选:

下载安装cuda:

1
2
3
cd /mnt/workspace
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run

大约3-4GB左右,运行后需要等一会,会弹出交互界面

这里去掉安装驱动,因为我们已经有驱动了只是想要不同版本的cuda,然后选安装

ps如果遇到装了多份驱动需要卸一个的情况:
https://www.jianshu.com/p/54928967e417

装好以后他会提示:

需要往 LD_LIBRARY_PATHPATH 里添加两条环境变量

1
2
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.8/lib64
export PATH=$PATH:/usr/local/cuda-11.8/bin

之后我使用

1
python -m bitsandbytes

如果没有报错应该就是好用的

ps如果是windows上的wsl:

1
export LD_LIBRARY_PATH=/usr/lib/wsl/lib/

切换cuda版本

1
2
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.7
export BNB_CUDA_VERSION=117

改环境变量可以手动切换版本(当然得已经装了)

查看cuda版本

1
nvidia-smi

或者可以:
参考:https://blog.csdn.net/Kefenggewu_/article/details/117675079
默认cuda会装在/usr/local,所以查看安装版本可以这样:

1
ls -l /usr/local | grep cuda

或者据说可以nvcc -V # (V大写)

本节另外的参考:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md

链接

kohya_ss:
https://github.com/bmaltais/kohya_ss
一个封装好的kohya-docker的镜像
https://github.com/ashleykleynhans/kohya-docker

dreambooth相关介绍:
https://huggingface.co/docs/diffusers/training/dreambooth
https://github.com/google/dreambooth