Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled. INFO:Trainer:2024-13-05 20:36:03:[cuda:0] Initialization completed, start model training. INFO:Trainer:2024-13-05 20:36:04:[cuda:0] Initialization completed, start model training. INFO:Trainer:2024-13-05 20:36:06:[cuda:0] The model training is completed, taking a total of 3.43352 seconds. INFO:distributed_c10d:2024-13-05 20:36:07:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:Trainer:2024-13-05 20:36:08:[cuda:0] The model training is completed, taking a total of 3.42585 seconds. INFO:distributed_c10d:2024-13-05 20:36:09:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:distributed_c10d:2024-13-05 20:36:09:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. INFO:distributed_c10d:2024-13-05 20:36:09:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. INFO:Trainer:2024-13-05 20:36:10:[cuda:0] Initialization completed, start model training. INFO:Trainer:2024-13-05 20:36:10:[cuda:1] Initialization completed, start model training. INFO:distributed:2024-13-05 20:36:12:Reducer buckets have been rebuilt in this iteration. INFO:distributed:2024-13-05 20:36:12:Reducer buckets have been rebuilt in this iteration. INFO:Trainer:2024-13-05 20:46:30:[cuda:0] Epoch 100 | The model starts evaluation on the validation set. 0%| | 0/1754 [00:00 0.430000). Saving model ... INFO:Trainer:2024-13-05 20:47:20:[cuda:0] Epoch 100 | Training checkpoint saved at ../../checkpoint/transr-100.pth INFO:Trainer:2024-13-05 20:47:20:[cuda:0] Epoch [ 100/1000] | Batchsize: 2048 | loss: 5.636623 | 6.19339 seconds/epoch INFO:Trainer:2024-13-05 20:47:20:[cuda:1] Epoch [ 100/1000] | Batchsize: 2048 | loss: 5.639350 | 6.19336 seconds/epoch INFO:Trainer:2024-13-05 20:56:45:[cuda:0] Epoch 200 | The model starts evaluation on the validation set. 0%| | 0/1754 [00:00 0.447000). Saving model ... INFO:Trainer:2024-13-05 20:57:35:[cuda:0] Epoch 200 | Training checkpoint saved at ../../checkpoint/transr-200.pth INFO:Trainer:2024-13-05 20:57:35:[cuda:0] Epoch [ 200/1000] | Batchsize: 2048 | loss: 4.212298 | 6.17439 seconds/epoch INFO:Trainer:2024-13-05 20:57:35:[cuda:1] Epoch [ 200/1000] | Batchsize: 2048 | loss: 4.219561 | 6.17446 seconds/epoch INFO:Trainer:2024-13-05 21:07:02:[cuda:0] Epoch 300 | The model starts evaluation on the validation set. 0%| | 0/1754 [00:00 0.460000). Saving model ... INFO:Trainer:2024-13-05 21:18:10:[cuda:0] Epoch 400 | Training checkpoint saved at ../../checkpoint/transr-400.pth INFO:Trainer:2024-13-05 21:18:10:[cuda:1] Epoch [ 400/1000] | Batchsize: 2048 | loss: 3.471031 | 6.17463 seconds/epoch INFO:Trainer:2024-13-05 21:18:10:[cuda:0] Epoch [ 400/1000] | Batchsize: 2048 | loss: 3.458506 | 6.17454 seconds/epoch INFO:Trainer:2024-13-05 21:27:40:[cuda:0] Epoch 500 | The model starts evaluation on the validation set. 0%| | 0/1754 [00:00