Exploratory Data Analysis on RSNA Pneumonia Dataset

Rahul S
6 min readNov 19, 2022

The dataset consists of CSV labelled data and chest radiograph (CXR) images. The CSV has patient id’s with XY coordinates of center of bounding box along with height and width of box. The CSV file also contain class label/target variable whether the patient has pneumonia or not.


• RSNA — CXR Dataset contains 30227 X-ray images in DICOM format.

• There are three classes with 31.61% lung opacity, 39.11% -no lung opacity, 29.28% normal images.

• In the target class, there are 31.61% of pneumonia class, 68.38% of non-pneumonia images.

• Bounding boxes for patients having pneumonia are defined in the train labels file. There are 9555 positive patients in this file. Each X-ray image has metadata associated with it. It gives information about the patient, the view position etc. 3543 duplicate entries suggest presence of different X-ray views for the same patient.

• Number of images in train set is: 26684
Number of patients in csv file as per their Id is: 26684
Number of images in test set is: 3000


1) Each record in the train_labels table contains

1. patientId — A patientId. Each patientId corresponds to a unique image.

2. x — the upper-left x coordinate of the bounding box.

3. y — the upper-left y coordinate of the bounding box.

4. width — the width of the bounding box.

5. height — the height of the bounding box.

6. Target — the binary Target, indicating whether this sample has evidence of pneumonia. (Either 0 or 1 for absence or presence of pneumonia, respectively)

2) There are no duplicate records.

3) There are many NaN values in four columns. But they seem to follow a trend. For each record in which we have any of value in the tuple (x, y, width, height) as NaN, the other 3 will also be NaN.

4) This seems plausible. x and y are values of a tuple. Together they stand for a location, and width and height also form a pair. All four, together, define the bound of the abnormality in on the lung. All will be NaN together, or none will. If the values in the tuple (x, y, width, height) is NaN, then the value in the…