📊 Data description¶

Dataset overview¶

The REG² challenge dataset consists of whole slide images (WSIs) and chain-of-thought (CoT) data paired with each image. Participants are tasked with generating pathology reports based on WSIs, and their models are evaluated on the reasoning process underlying report generation.


🔬 Modalities & formats¶

Paired WSI Chain-of-Thought

Figure 1. Dataset example

WSIs are provided as TIFF files and have been preprocessed to anonymize patient information and retain only 20x magnification images. The WSIs were sourced from institutions in Korea, Türkiye, Japan, India, and Germany. The original scanners used by each institution are listed below.

Institution Scanner
Korea University Medical Center Aperio & Generic TIFF
Kameda Generic TIFF
Memorial Health Group Aperio
All india institute of medical sciences (AIIMS) Hamamatsu NanoZoomer
University Hospital Cologne Aperio
TUM Hospital Leica GT450DX

Pathology reports are structured texts derived from actual pathological diagnostic records. Each report includes fields such as organ, procedure, histologic type, and histologic grade. Reports are standardized according to the College of American Pathologists (CAP) protocol, and diagnoses follow the histologic type and subtype nomenclature defined by the WHO Classification of Tumours.

CoT data consists of a series of question-answer pairs constructed based on the actual pathology report writing process. Each Q&A pair is assigned a subsequent question following a logical sequence, and the series concludes with a question that produces the final report. Both the pathology reports and CoT dataset are provided as JSON objects.


📂 Data characteristics & splits¶

The total dataset comprises about 12K cases spanning 7 organs and covering a diverse range of diagnostic categories, including malignant, pre-malignant, benign, and non-neoplastic entities.

Organ Representative Diagnostic Categories
Breast Invasive breast carcinoma of no special type, Invasive lobular carcinoma, Ductal carcinoma in situ, Fibroepithelial tumor, Papillary neoplasm, etc.
Colon Adenocarcinoma, Tubular adenoma with low grade dysplasia, Tubular adenoma with high grade dysplasia, Hyperplastic polyp, Chronic active colitis, etc.
Lung Adenocarcinoma, Squamous cell carcinoma, Non-small cell carcinoma NOS, Small cell carcinoma, Chronic granulomatous inflammation, etc.
Prostate Acinar adenocarcinoma, Small cell carcinoma, Malignant lymphoma, Chronic granulomatous inflammation, Normal, etc.
Stomach Tubular adenocarcinoma, Poorly cohesive carcinoma, Mixed adenocarcinoma, Tubular adenoma with low grade dysplasia, Chronic gastritis, etc.
Urinary Bladder Invasive urothelial carcinoma, Non-invasive papillary urothelial carcinoma, Urothelial carcinoma in situ, Chronic granulomatous inflammation, etc.
Uterus Squamous cell carcinoma, Adenocarcinoma, Squamous intraepithelial lesion, Endometrioid carcinoma, Leiomyoma, etc.

Table 1. Diagnostic Categories and Number of Cases by Organ in the REG^2 Dataset

The dataset is divided into three splits: Training, Test 1, and Test 2.

Split Cases
Training about 12000
Test Phase 1 350
Test Phase 2 70

The exact number of cases is subject to change during the challenge.

Data license¶

The dataset is released under the CC-BY-NC-SA license (Creative Commons Attribution-NonCommercial-ShareAlike).