current position:Home>Python dataloader error "dataloader worker (PID XXX) is killed by signal" solution

Python dataloader error "dataloader worker (PID XXX) is killed by signal" solution

2022-01-31 15:42:04 Why

This is my participation 11 The fourth of the yuegengwen challenge 13 God , Check out the activity details :2021 One last more challenge

In the use of pytorch dataloader when , Appeared when num_workers Set not to 0 That is, the problem of reporting errors , This article documents two solutions to such errors .

Dataloader - num_workers

  • Pytorch Module that loads data in Dataloader A parameter num_workers, Use this parameter to indicate dataloader The number of processes that load data when , It can be understood as the number of workers handling data on the network ;

  • So if dataloader More complicated , When there are many workers, it can naturally save a lot of data loading time , They can load data simultaneously during network training , When the network training is over, the loaded data will be directly taken from the memory , So when num_worker Greater than 1 Data loading can be accelerated , When the number is so large that the network does not need to load data, it is the ultimate benefit for workers to work for accelerated training ;

  • Use greater than 1 More workers will take up more memory and cpu, It also takes up more shared memory (share memory);

  • Use greater than 1 Workers will call multithreading .

Problem specification

according to num_worker Work ideas , There may be two kinds of mistakes in your work ( I met two ):

  • Insufficient shared memory :
RuntimeError: DataLoader worker (pid XXX) is killed by signal: Bus error
 Copy code 
  • Multi thread segment error causes deadlock , This causes the program to get stuck , Thread blocking :
ERROR: Unexpected segmentation fault encountered in worker.
 Copy code 

or

RuntimeError: DataLoader worker (pid 4499) is killed by signal: Segmentation fault. 
 Copy code 

or

RuntimeError: DataLoader worker (pid(s) ****) exited unexpectedly
 Copy code 

Here are the solutions to two problems .

problem 1 RuntimeError: DataLoader worker (pid XXX) is killed by signal: Bus error

Question why

  • Generally, this kind of problem occurs in docker in , because docker The default shared memory is 64M, Resulting in a large number of workers and insufficient space , An error occurred .

Solution

1 He abandoned his martial arts
  • take num_workers Set to 0
2 solve the problem
  • stay establish docker Configure larger shared memory when , Add parameters --shm-size="15g", Set up 15g( Set according to the actual situation ) Shared memory for :
nvidia-docker run -it --name [container_name] --shm-size="15g" ...
 Copy code 
  • adopt df -h see
# df -h
Filesystem                                          Size  Used Avail Use% Mounted on
overlay                                             3.6T  3.1T  317G  91% /
tmpfs                                                64M     0   64M   0% /dev
tmpfs                                                63G     0   63G   0% /sys/fs/cgroup
/dev/sdb1                                           3.6T  3.1T  317G  91% /workspace/tmp
shm                                                  15G  8.1G  7.0G  54% /dev/shm
tmpfs                                                63G   12K   63G   1% /proc/driver/nvidia
/dev/sda1                                           219G  170G   39G  82% /usr/bin/nvidia-smi
udev                                                 63G     0   63G   0% /dev/nvidia3
tmpfs                                                63G     0   63G   0% /proc/acpi
tmpfs                                                63G     0   63G   0% /proc/scsi
tmpfs                                                63G     0   63G   0% /sys/firmware
 Copy code 
  • among shm Shared memory space

problem 2 RuntimeError: DataLoader worker (pid(s) ****) exited unexpectedly

Question why

  • because dataloader Using multithreaded operations , If there are other multithreading operations with some problems in the program, it may lead to thread nesting , It is easy to deadlock
  • The specific situation may vary according to the specific environment , Mine is because opencv Multithreading and dataloader There is a problem with the mixture of ;
  • here cv edition 3.4.2, Same code in 4.2.0.34 Of cv There are no problems in .

Solution

1 He abandoned his martial arts
  • take num_workers Set to 0
2 solve the problem
  • stay dataloader Of __getitem__ Method opencv The multithreading :
def __getitem__(self, idx):
	import cv2
	cv2.setNumThreads(0)
	...
 Copy code 

Reference material

copyright notice
author[Why],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311542017465.html

Random recommended