Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled. DEBUG:cmd:2024-11-05 14:39:22:Popen(['git', 'rev-parse', '--show-toplevel'], cwd=/home/luyanfeng/my_code/github/pybind11-OpenKE, stdin=None, shell=False, universal_newlines=False) DEBUG:cmd:2024-11-05 14:39:22:Popen(['git', 'rev-parse', '--show-toplevel'], cwd=/home/luyanfeng/my_code/github/pybind11-OpenKE, stdin=None, shell=False, universal_newlines=False) DEBUG:cmd:2024-11-05 14:39:22:Popen(['git', 'rev-parse', '--show-toplevel'], cwd=/home/luyanfeng/my_code/github/pybind11-OpenKE, stdin=None, shell=False, universal_newlines=False) DEBUG:cmd:2024-11-05 14:39:22:Popen(['git', 'rev-parse', '--show-toplevel'], cwd=/home/luyanfeng/my_code/github/pybind11-OpenKE, stdin=None, shell=False, universal_newlines=False) DEBUG:connectionpool:2024-11-05 14:39:23:Starting new HTTPS connection (1): api.wandb.ai:443 DEBUG:connectionpool:2024-11-05 14:39:23:Starting new HTTPS connection (1): api.wandb.ai:443 DEBUG:connectionpool:2024-11-05 14:39:24:https://api.wandb.ai:443 "POST /graphql HTTP/1.1" 200 None DEBUG:connectionpool:2024-11-05 14:39:24:https://api.wandb.ai:443 "POST /graphql HTTP/1.1" 200 None DEBUG:connectionpool:2024-11-05 14:39:24:https://api.wandb.ai:443 "POST /graphql HTTP/1.1" 200 None wandb: Currently logged in as: 3555028709. Use `wandb login --relogin` to force relogin DEBUG:cmd:2024-11-05 14:39:24:Popen(['git', 'cat-file', '--batch-check'], cwd=/home/luyanfeng/my_code/github/pybind11-OpenKE, stdin=, shell=False, universal_newlines=False) DEBUG:connectionpool:2024-11-05 14:39:25:https://api.wandb.ai:443 "POST /graphql HTTP/1.1" 200 None wandb: Currently logged in as: 3555028709. Use `wandb login --relogin` to force relogin DEBUG:cmd:2024-11-05 14:39:25:Popen(['git', 'cat-file', '--batch-check'], cwd=/home/luyanfeng/my_code/github/pybind11-OpenKE, stdin=, shell=False, universal_newlines=False) wandb: - Waiting for wandb.init()... wandb: - Waiting for wandb.init()... wandb: \ Waiting for wandb.init()... wandb: \ Waiting for wandb.init()... wandb: | Waiting for wandb.init()... wandb: wandb version 0.17.0 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.16.6 wandb: Run data is saved locally in /home/luyanfeng/my_code/github/pybind11-OpenKE/examples/TransE/wandb/run-20240511_143925-89550puv wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run TransE-FB15K-multi wandb: ⭐️ View project at https://wandb.ai/3555028709/pybind11-ke wandb: 🚀 View run at https://wandb.ai/3555028709/pybind11-ke/runs/89550puv wandb: wandb version 0.17.0 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.16.6 wandb: Run data is saved locally in /home/luyanfeng/my_code/github/pybind11-OpenKE/examples/TransE/wandb/run-20240511_143924-7dqahmj4 wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run TransE-FB15K-multi wandb: ⭐️ View project at https://wandb.ai/3555028709/pybind11-ke wandb: 🚀 View run at https://wandb.ai/3555028709/pybind11-ke/runs/7dqahmj4 INFO:distributed_c10d:2024-11-05 14:39:32:Added key: store_based_barrier_key:1 to store for rank: 1 wandb: Network error (TransientError), entering retry loop. INFO:distributed_c10d:2024-11-05 14:39:35:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:distributed_c10d:2024-11-05 14:39:35:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. INFO:distributed_c10d:2024-11-05 14:39:35:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. INFO:Trainer:2024-11-05 14:39:36:[cuda:0] Initialization completed, start model training. INFO:Trainer:2024-11-05 14:39:36:[cuda:1] Initialization completed, start model training. INFO:distributed:2024-11-05 14:39:38:Reducer buckets have been rebuilt in this iteration. INFO:distributed:2024-11-05 14:39:38:Reducer buckets have been rebuilt in this iteration. INFO:Trainer:2024-11-05 14:47:39:[cuda:0] Epoch 100 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00 0.693000). Saving model ... INFO:Trainer:2024-11-05 14:47:45:[cuda:0] Epoch 100 | Training checkpoint saved at ../../checkpoint/transe-100.pth INFO:Trainer:2024-11-05 14:47:45:[cuda:1] Epoch [ 100/1000] | Batchsize: 8192 | loss: 0.091870 | 4.83229 seconds/epoch INFO:Trainer:2024-11-05 14:47:45:[cuda:0] Epoch [ 100/1000] | Batchsize: 8192 | loss: 0.092685 | 4.83236 seconds/epoch INFO:Trainer:2024-11-05 14:55:41:[cuda:0] Epoch 200 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00 0.722000). Saving model ... INFO:Trainer:2024-11-05 14:55:46:[cuda:0] Epoch 200 | Training checkpoint saved at ../../checkpoint/transe-200.pth INFO:Trainer:2024-11-05 14:55:46:[cuda:1] Epoch [ 200/1000] | Batchsize: 8192 | loss: 0.074659 | 4.82544 seconds/epoch INFO:Trainer:2024-11-05 14:55:46:[cuda:0] Epoch [ 200/1000] | Batchsize: 8192 | loss: 0.073929 | 4.82542 seconds/epoch INFO:Trainer:2024-11-05 15:03:43:[cuda:0] Epoch 300 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00 0.739000). Saving model ... INFO:Trainer:2024-11-05 15:03:48:[cuda:0] Epoch 300 | Training checkpoint saved at ../../checkpoint/transe-300.pth INFO:Trainer:2024-11-05 15:03:48:[cuda:1] Epoch [ 300/1000] | Batchsize: 8192 | loss: 0.063407 | 4.82434 seconds/epoch INFO:Trainer:2024-11-05 15:03:48:[cuda:0] Epoch [ 300/1000] | Batchsize: 8192 | loss: 0.064902 | 4.82432 seconds/epoch INFO:Trainer:2024-11-05 15:11:45:[cuda:0] Epoch 400 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00 0.764000). Saving model ... INFO:Trainer:2024-11-05 15:11:50:[cuda:0] Epoch 400 | Training checkpoint saved at ../../checkpoint/transe-400.pth INFO:Trainer:2024-11-05 15:11:50:[cuda:1] Epoch [ 400/1000] | Batchsize: 8192 | loss: 0.050420 | 4.82323 seconds/epoch INFO:Trainer:2024-11-05 15:11:50:[cuda:0] Epoch [ 400/1000] | Batchsize: 8192 | loss: 0.050849 | 4.82319 seconds/epoch INFO:Trainer:2024-11-05 15:19:46:[cuda:0] Epoch 500 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00