2021-10-20 11:44:14.662895: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-20 11:44:14.662970: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-10-20 11:44:14.663017: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-10-20 11:44:14.664822: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-10-20 11:44:14.665156: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-10-20 11:44:14.667228: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-10-20 11:44:14.667345: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-10-20 11:44:14.667399: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-10-20 11:44:14.667411: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices..
이런 오류가 계속 뜨고 gpu가 제대로 동작하고 있는지 모르겠다
cpu로 돌렸을때는 7분이 떴는데 gpu로 분명 돌아가고있어서 4분이 걸리지만 그래도 뭔가 이상해서 코드 오류를 찾아봤더니 역시나 버전 문제였다. 지금 깔려있는 버전은 tensorflow-gpu ==2.4.1이고 원래 작업하던 환경은 tensorflow-gpu==2.1.0이었다. 2.1.0으로는 설치가 안된다는 경고메세지가 떠서 2.2.0으로 설치를 해주고 코드를 실행하니까
/root/miniconda3/lib/python3.8/site-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.4.0 and strictly below 2.7.0 (nightly versions are not supported).
The versions of TensorFlow you are currently using is 2.2.0 and is not supported.
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
2.2으로는 안되니까 tensorlfow 버전을 바꾸던지 tensorflow Addon 버전을 바꾸라고 했다.
잘 돌아가던 내 작업환경
tensorflow-gpu==2.1.0
tensorflow == 2.3.0
tensorflow-addons ==0.13.0
데모로 배포용으로 다시 만들고 있는 환경
tensorflow-gpu==2.4.1
tensorflow == 2.4.1
tensorflow-addons ==0.14.0
그래도 안되네 ^**^
Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
방법을 더 찾아봤다
2021-10-20 14:01:04.732484: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2200000000 Hz
이렇게 뜨면 CPU를 사용하고 있는거라네
watch -n 1 nvidia-smi
에서 코드 돌릴때 1초마다 GPU utility 측정을 했을때 10퍼센트 미만으로 돌아가고있으면 뭔가 문제가 있다라는 걸 알게됐다
nvcc --version 해도 안나옴
sudo apt update
sudo apt-get install cuda-11-
conda install cudatoolkit
https://www.tensorflow.org/install/source#tested_build_configurations
버전 맞춰줘야함 ㅎㅎ
cuda toolkit 과 driver버전이 안맞아서 에러가 떴다