📊 Data description¶
Dataset overview¶
The REG² challenge dataset consists of whole slide images (WSIs) and chain-of-thought (CoT) data paired with each image. Participants are tasked with generating pathology reports based on WSIs, and their models are evaluated on the reasoning process underlying report generation.
🔬 Modalities & formats¶
| Paired WSI | Chain-of-Thought |
|---|---|
![]() |
![]() |
Figure 1. Dataset example
WSIs are provided as TIFF files and have been preprocessed to anonymize patient information and retain only 20x magnification images. The WSIs were sourced from institutions in Korea, Türkiye, Japan, India, and Germany. The original scanners used by each institution are listed below.
| Institution | Scanner |
|---|---|
| Korea University Medical Center | Aperio & Generic TIFF |
| Kameda | Generic TIFF |
| Memorial Health Group | Aperio |
| All india institute of medical sciences (AIIMS) | Hamamatsu NanoZoomer |
| University Hospital Cologne | Aperio |
| TUM Hospital | Leica GT450DX |
Pathology reports are structured texts derived from actual pathological diagnostic records. Each report includes fields such as organ, procedure, histologic type, and histologic grade. Reports are standardized according to the College of American Pathologists (CAP) protocol, and diagnoses follow the histologic type and subtype nomenclature defined by the WHO Classification of Tumours.
CoT data consists of a series of question-answer pairs constructed based on the actual pathology report writing process. Each Q&A pair is assigned a subsequent question following a logical sequence, and the series concludes with a question that produces the final report. Both the pathology reports and CoT dataset are provided as JSON objects.
📂 Data characteristics & splits¶
The total dataset comprises about 12K cases spanning 7 organs and covering a diverse range of diagnostic categories, including malignant, pre-malignant, benign, and non-neoplastic entities.
| Organ | Representative Diagnostic Categories |
|---|---|
| Breast | Invasive breast carcinoma of no special type, Invasive lobular carcinoma, Ductal carcinoma in situ, Fibroepithelial tumor, Papillary neoplasm, etc. |
| Colon | Adenocarcinoma, Tubular adenoma with low grade dysplasia, Tubular adenoma with high grade dysplasia, Hyperplastic polyp, Chronic active colitis, etc. |
| Lung | Adenocarcinoma, Squamous cell carcinoma, Non-small cell carcinoma NOS, Small cell carcinoma, Chronic granulomatous inflammation, etc. |
| Prostate | Acinar adenocarcinoma, Small cell carcinoma, Malignant lymphoma, Chronic granulomatous inflammation, Normal, etc. |
| Stomach | Tubular adenocarcinoma, Poorly cohesive carcinoma, Mixed adenocarcinoma, Tubular adenoma with low grade dysplasia, Chronic gastritis, etc. |
| Urinary Bladder | Invasive urothelial carcinoma, Non-invasive papillary urothelial carcinoma, Urothelial carcinoma in situ, Chronic granulomatous inflammation, etc. |
| Uterus | Squamous cell carcinoma, Adenocarcinoma, Squamous intraepithelial lesion, Endometrioid carcinoma, Leiomyoma, etc. |
Table 1. Diagnostic Categories and Number of Cases by Organ in the REG^2 Dataset
The dataset is divided into three splits: Training, Test 1, and Test 2.
| Split | Cases |
|---|---|
| Training | about 12000 |
| Test Phase 1 | 350 |
| Test Phase 2 | 70 |
The exact number of cases is subject to change during the challenge.
Data license¶
The dataset is released under the CC-BY-NC-SA license (Creative Commons Attribution-NonCommercial-ShareAlike).

