Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled. INFO:distributed_c10d:2024-12-05 16:58:22:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:distributed_c10d:2024-12-05 16:58:23:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:distributed_c10d:2024-12-05 16:58:23:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. INFO:distributed_c10d:2024-12-05 16:58:23:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. INFO:Trainer:2024-12-05 16:58:24:[cuda:1] Initialization completed, start model training. INFO:Trainer:2024-12-05 16:58:24:[cuda:0] Initialization completed, start model training. INFO:distributed:2024-12-05 16:58:26:Reducer buckets have been rebuilt in this iteration. INFO:distributed:2024-12-05 16:58:26:Reducer buckets have been rebuilt in this iteration. INFO:Trainer:2024-12-05 17:04:38:[cuda:0] Epoch 100 | The model starts evaluation on the validation set. 0%| | 0/69 [00:00 0.460000). Saving model ... INFO:Trainer:2024-12-05 17:04:45:[cuda:0] Epoch 100 | Training checkpoint saved at ../../checkpoint/transh-100.pth INFO:Trainer:2024-12-05 17:04:45:[cuda:0] Epoch [ 100/1000] | Batchsize: 8192 | loss: 0.499170 | 3.74025 seconds/epoch INFO:Trainer:2024-12-05 17:04:45:[cuda:1] Epoch [ 100/1000] | Batchsize: 8192 | loss: 0.498411 | 3.74026 seconds/epoch INFO:Trainer:2024-12-05 17:10:11:[cuda:0] Epoch 200 | The model starts evaluation on the validation set. 0%| | 0/69 [00:00 0.484000). Saving model ... INFO:Trainer:2024-12-05 17:10:18:[cuda:0] Epoch 200 | Training checkpoint saved at ../../checkpoint/transh-200.pth INFO:Trainer:2024-12-05 17:10:18:[cuda:1] Epoch [ 200/1000] | Batchsize: 8192 | loss: 0.373016 | 3.53623 seconds/epoch INFO:Trainer:2024-12-05 17:10:18:[cuda:0] Epoch [ 200/1000] | Batchsize: 8192 | loss: 0.374192 | 3.53624 seconds/epoch INFO:Trainer:2024-12-05 17:15:53:[cuda:0] Epoch 300 | The model starts evaluation on the validation set. 0%| | 0/69 [00:00 0.497000). Saving model ... INFO:Trainer:2024-12-05 17:21:41:[cuda:0] Epoch 400 | Training checkpoint saved at ../../checkpoint/transh-400.pth INFO:Trainer:2024-12-05 17:21:41:[cuda:1] Epoch [ 400/1000] | Batchsize: 8192 | loss: 0.243966 | 3.47454 seconds/epoch INFO:Trainer:2024-12-05 17:21:41:[cuda:0] Epoch [ 400/1000] | Batchsize: 8192 | loss: 0.242802 | 3.47454 seconds/epoch INFO:Trainer:2024-12-05 17:27:06:[cuda:0] Epoch 500 | The model starts evaluation on the validation set. 0%| | 0/69 [00:00