DDP/GPU

Topic	Replies	Views	Activity
How to ensure all ranks flush their caches during training using DeepSpeed Stage3	2	5157	May 25, 2023
Manual Optimization with Deepspeed	0	366	May 19, 2023
Module not able to find parameters requiring a gradient	1	2102	May 5, 2023
Is it possible to run part of the model in deepspeed/fsdp and rest in ddp	1	701	April 28, 2023
Lack of documentation on deepspeed / fsdp	0	844	April 24, 2023
Converting deepspeed checkpoints to fp32 checkpoint	2	2039	April 22, 2023
FSDP for both pretrained teacher and trainable student	4	1200	April 18, 2023
How to implement the Dataset or Data module to achieve the following goals?	0	201	April 15, 2023
Validation sanity check hangs after `all_gather`	2	3361	March 31, 2023
DDP and pl.LightningDataModule parallelization Issues	1	692	March 29, 2023
Single-Node multi-GPU Deepspeed training fails with cuda OOM on Azure	0	1865	March 24, 2023
Parallelizing batchsize-1 fully-convolutional training on multiple GPUs (one triplet per GPU)	1	537	March 15, 2023
DistributedDataParallel multi GPU barely faster than single GPU	2	1811	March 10, 2023
RAM Held by workers after validation	1	675	March 10, 2023
SLURM Runtime Error due to "ntasks" variable	3	2625	March 6, 2023
Runing ddp accross two machines	3	1444	March 3, 2023
Multi-GPU/Multi-Node training with WebDataset	3	5101	March 2, 2023
Try... except statement with DDPSpawn	2	530	February 24, 2023
Cannot pickle torch._C.Generator object — Multi-GPU training	2	2824	February 20, 2023
End all distributed process after ddp	4	2303	February 10, 2023
Rank_zero_only Callback in ddp	2	3040	January 30, 2023
Multi-GPU, TorchMetrics, incorrect aggregation	0	537	January 24, 2023
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation	3	4186	January 18, 2023
How to apply multiple GPUs on not `training_step`?	3	1042	January 4, 2023
RuntimeError: Cannot re-initialize CUDA in forked subprocess	6	8539	December 15, 2022
0/1% GPU Utilization when using 1 GPU, but Higher GPU Utilization with 2+ GPUS	0	1394	December 8, 2022
FullyShardedDataParallel no memory decrease	7	1884	December 8, 2022
Multi-GPU training crashes after some time due to NVLink error (xid74)	2	1751	November 26, 2022
Difference between the checkpoint val_cer and real val_cer on the validation set	0	459	November 15, 2022
How to propagate errors async in distributed training	1	1065	November 10, 2022