Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled. INFO:distributed_c10d:2024-11-05 13:17:49:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:distributed_c10d:2024-11-05 13:17:49:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:distributed_c10d:2024-11-05 13:17:49:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. INFO:distributed_c10d:2024-11-05 13:17:49:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. INFO:Trainer:2024-11-05 13:18:00:[cuda:0] Initialization completed, start model training. INFO:Trainer:2024-11-05 13:18:00:[cuda:1] Initialization completed, start model training. INFO:distributed:2024-11-05 13:18:02:Reducer buckets have been rebuilt in this iteration. INFO:distributed:2024-11-05 13:18:02:Reducer buckets have been rebuilt in this iteration. INFO:Trainer:2024-11-05 13:25:48:[cuda:0] Epoch 100 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00 0.690000). Saving model ... INFO:Trainer:2024-11-05 13:25:54:[cuda:0] Epoch 100 | Training checkpoint saved at ../../checkpoint/transe-100.pth INFO:Trainer:2024-11-05 13:25:54:[cuda:0] Epoch [ 100/1000] | Batchsize: 8192 | loss: 0.092149 | 4.68673 seconds/epoch INFO:Trainer:2024-11-05 13:25:54:[cuda:1] Epoch [ 100/1000] | Batchsize: 8192 | loss: 0.092458 | 4.68671 seconds/epoch INFO:Trainer:2024-11-05 13:33:47:[cuda:0] Epoch 200 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00 0.719000). Saving model ... INFO:Trainer:2024-11-05 13:33:52:[cuda:0] Epoch 200 | Training checkpoint saved at ../../checkpoint/transe-200.pth INFO:Trainer:2024-11-05 13:33:52:[cuda:0] Epoch [ 200/1000] | Batchsize: 8192 | loss: 0.074799 | 4.73561 seconds/epoch INFO:Trainer:2024-11-05 13:33:52:[cuda:1] Epoch [ 200/1000] | Batchsize: 8192 | loss: 0.074608 | 4.73551 seconds/epoch INFO:Trainer:2024-11-05 13:41:43:[cuda:0] Epoch 300 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00 0.738000). Saving model ... INFO:Trainer:2024-11-05 13:41:49:[cuda:0] Epoch 300 | Training checkpoint saved at ../../checkpoint/transe-300.pth INFO:Trainer:2024-11-05 13:41:49:[cuda:0] Epoch [ 300/1000] | Batchsize: 8192 | loss: 0.065236 | 4.74605 seconds/epoch INFO:Trainer:2024-11-05 13:41:49:[cuda:1] Epoch [ 300/1000] | Batchsize: 8192 | loss: 0.064975 | 4.74603 seconds/epoch INFO:Trainer:2024-11-05 13:49:39:[cuda:0] Epoch 400 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00 0.761000). Saving model ... INFO:Trainer:2024-11-05 13:49:44:[cuda:0] Epoch 400 | Training checkpoint saved at ../../checkpoint/transe-400.pth INFO:Trainer:2024-11-05 13:49:44:[cuda:0] Epoch [ 400/1000] | Batchsize: 8192 | loss: 0.050724 | 4.74811 seconds/epoch INFO:Trainer:2024-11-05 13:49:44:[cuda:1] Epoch [ 400/1000] | Batchsize: 8192 | loss: 0.050384 | 4.74808 seconds/epoch INFO:Trainer:2024-11-05 13:57:32:[cuda:0] Epoch 500 | The model starts evaluation on the validation set. 0%| | 0/196 [00:00