DiGSeg

Overview

What is DiGSeg?

TL;DR: DiGSeg repurposes a pretrained diffusion model into a single generalist segmenter, driven by text prompts for semantic and open-vocabulary segmentation — reaching state-of-the-art on standard benchmarks and transferring across domains (medical, remote sensing, agriculture) with no domain-specific architecture changes.

State-of-the-art segmentation

SOTA on standard semantic segmentation, with strong open-vocabulary and cross-domain transfer.

Diffusion priors for segmentation

Repurposes a pretrained diffusion U-Net, conditioned on image latents and a CLIP-aligned text pathway.

Generation meets understanding

One diffusion backbone generalizes across tasks and domains—no per-domain architecture changes.

DiGSeg teaser — DiGSeg turns a pretrained diffusion model into one generalist segmenter—semantic, open-vocabulary, and cross-domain.

Abstract

Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios—without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.

Capabilities

See it in action

Drag the slider to compare. Hover any card to reveal the prediction (tap on mobile).

Input DiGSeg

⇄

Drag the handle — click any result below to load it here

Input DiGSeg

Semantic & open-vocabulary segmentation on ADE20K-150/847, PASCAL-Context and COCO — the colored masks are direct DiGSeg outputs.

Input DiGSeg

Cross-domain transfer to BDD100K & Cityscapes driving scenes — dense urban street semantics with no driving-specific architecture changes.

Input DiGSeg

Road & structure extraction on DeepGlobe aerial / satellite imagery — a domain far from the diffusion backbone's training data.

Input DiGSeg

Medical image segmentation on REFUGE2 and MoNuSeg — optic disc / cup and nuclei delineation across retinal fundus and histopathology imagery.

Input DiGSeg

Crop & plant phenotyping on PhenoBench — fine-grained agricultural segmentation under heavy occlusion.

Results

Benchmarks

A single diffusion backbone sets a new state of the art across open-vocabulary, standard semantic, cross-domain, and medical segmentation. Bars highlight the headline wins; full per-method tables follow. Higher is better unless noted.

Headline comparisons

Open-vocab — ADE20K-150mIoU ↑

CAT-Seg

31.5

Mask-Adapter

38.2

HyperCLIP

38.2

ESC-Net

41.8

DiGSeg (Ours)

43.2

CLIP ViT-L/14 · +1.4 over best prior

Open-vocab — PASCAL-Context-59mIoU ↑

CAT-Seg

62.0

DPSeg

62.3

HyperCLIP

64.2

ESC-Net

65.6

DiGSeg (Ours)

68.4

CLIP ViT-L/14 · +2.8 over best prior

Semantic — COCO-StuffmIoU ↑

SegFormer-B5

46.7

VWFormer-B5

48.0

SegMAN

48.2

EoMT

48.7

DiGSeg (Ours)

50.8

512² input · +2.1 over best prior

Semantic — ADE20KmIoU ↑

VWFormer-B5

54.7

OneFormer

57.0

EoMT

57.1

Mask2Former-Swin-L

57.3

DiGSeg (Ours)

58.6

512² input · +1.3 over best prior

Full benchmark tables

DiGSeg (Ours) rows are highlighted; Improvement rows show the gain over the best comparable prior; gray rows are task-specific specialist models shown for reference, not a generalist comparison.

Table 1a. Open-vocabulary segmentation benchmarks - CLIP ViT-L/14 block

Method	VLM	Backbone	Training Dataset	A-847	PC-459	A-150	PC-59	Cityscapes
ODISE	CLIP ViT-L/14	Stable Diffusion	COCO-Panoptic	11.0	13.8	28.7	55.3	–
OVSeg	CLIP ViT-L/14	Swin-B	COCO-Stuff	9.0	12.4	29.6	55.7	–
SAN	CLIP ViT-L/14	Side Adapter	COCO-Stuff	12.4	15.7	29.9	51.8	–
SCAN	CLIP ViT-L/14	Swin-B	COCO-Stuff	14.0	16.7	33.5	59.3	–
CAT-Seg	CLIP ViT-L/14	–	COCO-Stuff	16.0	23.8	31.5	62.0	–
MAFT+	ConvNeXt-L	–	COCO-Stuff	15.1	21.6	36.1	59.4	–
SED	CLIP ConvNeXt-L	–	COCO-Stuff	13.9	22.6	35.2	60.6	–
Mask-Adapter	CLIP ConvNeXt-L	–	COCO-Stuff	16.2	22.7	38.2	60.4	37.9
Seg4Diff	CLIP ViT-L/14	Stable Diffusion	COCO-Stuff	–	–	35.2	51.2	26.0
HyperCLIP	CLIP ViT-L/14	–	COCO-Stuff	16.3	24.1	38.2	64.2	–
OVSNet	CLIP ViT-L/14	ResNet-101c	COCO-Stuff	16.2	23.5	37.1	62.0	–
DPSeg	CLIP ConvNeXt-L	–	COCO-Stuff	15.7	24.1	37.1	62.3	–
SemLA	CLIP ConvNeXt-L	–	COCO-Stuff	–	–	36.9	62.2	–
ESC-Net	CLIP ViT-L/14	–	COCO-Stuff	18.1	27.0	41.8	65.6	–
DiGSeg (ours)	CLIP ViT-L/14	Stable Diffusion	COCO-Stuff	19.9	29.2	43.2	68.4	38.5
Improvement	–	–	–	+1.8	+2.2	+1.4	+2.8	+0.6

Table 1b. Open-vocabulary segmentation benchmarks - CLIP ViT-B/16 block

Method	VLM	Backbone	Training Dataset	A-847	PC-459	A-150	PC-59	Cityscapes
OVSeg	CLIP ViT-B/16	ResNet-101c	COCO-Stuff	7.1	11.0	24.8	53.3	–
SCAN	CLIP ViT-B/16	Swin-B	COCO-Stuff	10.8	13.2	30.8	58.4	–
EBSeg	CLIP ViT-B/16	SAM ViT-B	COCO-Stuff	11.1	17.3	30.0	56.7	–
SED	ConvNeXt-B	–	COCO-Stuff	11.4	18.6	31.6	57.3	–
CAT-Seg	CLIP ViT-B/16	–	COCO-Stuff	12.0	19.0	31.8	57.5	–
OPMapper	CLIP ViT-B/16	Swin-B	COCO-Stuff	–	–	31.0	58.3	–
ESC-Net	CLIP ViT-B/16	–	COCO-Stuff	13.3	21.1	35.6	59.0	–
HyperCLIP	CLIP ViT-B/16	–	COCO-Stuff	12.3	19.2	32.1	58.5	–
Mask-Adapter	CLIP ConvNeXt-B	–	COCO-Stuff	14.2	17.9	35.6	58.4	35.2
DPSeg	CLIP ConvNeXt-B	–	COCO-Stuff	12.5	20.1	33.3	58.4	–
DiGSeg (ours)	CLIP ViT-B/16	Stable Diffusion	COCO-Stuff	17.5	23.1	37.2	62.7	36.5
Improvement	–	–	–	+3.3	+2.0	+1.6	+3.7	+1.3

Table 2. Semantic segmentation

Method	COCO input size	COCO mIoU	ADE20K input size	ADE20K mIoU
SegFormer-B5	512^2	46.7	640^2	51.8
Mask2Former-Swin-L	–	–	640^2	57.3
OneFormer	–	–	640^2	57.0
LDMSeg	–	–	512^2	52.2
PEM	–	–	512^2	45.5
FeedFormer-B2	–	–	512^2	48.0
CGRSeg-L	512^2	46.0	512^2	48.3
VWFormer-B5	512^2	48.0	512^2	54.7
SegMAN	512^2	48.2	512^2	53.2
EoMT	512^2	48.7	512^2	57.1
OffSeg-L	512^2	46.0	512^2	48.5
MambaVision-B	–	–	512^2	49.1
DiGSeg (ours)	512^2	50.8	512^2	58.6
Improvement	–	+2.1	–	+1.3

Table 3. DeepGlobe road segmentation

Method	IoU_road	Precision	Recall	F1
DDCTNet task-specific	64.27	79.02	78.10	78.24
Unet task-specific	62.94	–	–	–
D-LinkNet task-specific	63.00	–	–	–
CoANet task-specific	60.65	76.98	72.34	74.61
SGCN task-specific	53.92	72.95	68.25	72.31
DeepLabv3 task-specific	61.97	77.26	74.09	75.64
Segroad task-specific	66.23	–	–	–
CGC-Net task-specific	68.80	82.67	80.39	81.51
SegMAN	48.12	70.25	68.10	69.16
EoMT	52.52	72.88	71.35	72.10
OffSeg	51.75	72.10	70.85	71.46
MambaVision	57.28	75.32	74.50	74.91
DiGSeg (ours)	65.78	79.93	78.92	78.79
Improvement	+8.50	+4.61	+4.42	+3.88

Rows in gray are methods specifically designed for road / remote-sensing extraction — not a generalist comparison. DiGSeg leads all generalist methods (+8.50 IoU over the best one).

Table 4. Ablation on E-Step scheduling

E-Step	COCO	ADE20K	FPS
1x1	48.2	56.8	11.27
1x2	48.5	57.1	10.61
4x2	50.5	58.5	5.82
8x2	50.8	58.6	3.15
20x50 (w/o trailing)	50.9	58.8	0.12

Table 5. MoNuSeg segmentation

Method	Dice	mIoU
MedT task-specific	79.55	66.17
SegDiff task-specific	81.59	69.00
cDAL task-specific	82.94	70.96
SAMPO task-specific	81.83	69.25
SegMAN	74.17	58.94
EoMT	74.82	59.77
SAM2	44.51	29.81
SAM3	51.56	45.75
DiGSeg	78.27	64.29
DiGSeg w/ threshold search	79.84	66.44
DiGSeg w/ BiomedCLIP	84.56	73.25

Rows in gray are medical-specialist models. Among generalist methods DiGSeg is strongest, and with a BiomedCLIP encoder it surpasses the specialists too.

What is DiGSeg?

State-of-the-art segmentation

Diffusion priors for segmentation

Generation meets understanding

Abstract

How it works

See it in action

More qualitative results

Benchmarks

Headline comparisons

Full benchmark tables

Table 1a. Open-vocabulary segmentation benchmarks - CLIP ViT-L/14 block

Table 1b. Open-vocabulary segmentation benchmarks - CLIP ViT-B/16 block

Table 2. Semantic segmentation

Table 3. DeepGlobe road segmentation

Table 4. Ablation on E-Step scheduling

Table 5. MoNuSeg segmentation

Authors & Contributors

BibTeX