Exploratory Data Analysis on RSNA Pneumonia Dataset

Rahul S
6 min readNov 19, 2022

The dataset consists of CSV labelled data and chest radiograph (CXR) images. The CSV has patient id’s with XY coordinates of center of bounding box along with height and width of box. The CSV file also contain class label/target variable whether the patient has pneumonia or not.

SUMMARY

• RSNA — CXR Dataset contains 30227 X-ray images in DICOM format.

• There are three classes with 31.61% lung opacity, 39.11% -no lung opacity, 29.28% normal images.

• In the target class, there are 31.61% of pneumonia class, 68.38% of non-pneumonia images.

• Bounding boxes for patients having pneumonia are defined in the train labels file. There are 9555 positive patients in this file. Each X-ray image has metadata associated with it. It gives information about the patient, the view position etc. 3543 duplicate entries suggest presence of different X-ray views for the same patient.

• Number of images in train set is: 26684
Number of patients in csv file as per their Id is: 26684
Number of images in test set is: 3000

TRAIN LABELS FILE

1) Each record in the train_labels table contains

1. patientId — A patientId. Each patientId corresponds to a unique image.

2. x — the upper-left x coordinate of the bounding box.

--

--