Contamination-Aware Differential Gene Expression Analysis#

Background#

Imaging-based spatial transcriptomics methods like Xenium In Situ enable single-molecule RNA measurement but face a critical challenge: transcript spill-over due to imperfect cell segmentation. Transcripts from neighboring cells can be incorrectly assigned to target cells, particularly in densely packed tissues.

Problem: When performing differential gene expression (DGE) analysis, we cannot distinguish whether detected genes are:

  • Genuinely upregulated in target cells, or

  • Artifacts from misassigned transcripts from neighbors

Solution: Compare cells not only against each other but also against their spatial neighborhoods to identify contamination artifacts.

%load_ext autoreload
%autoreload 2
from pathlib import Path
from insitupy import InSituData, CACHE
from insitupy.tools import dge
import pandas as pd

Load Data#

insitupy_project = Path(CACHE / "out/demo_insitupy_project")
xd = InSituData.read(insitupy_project)

xd.load_images()
xd.load_cells()
xd.load_annotations()

Contamination-Aware DGE Workflow#

The dge() function performs contamination-aware differential gene expression analysis in a single step. It:

  1. Compares target vs reference cell types (standard DGE)

  2. Assesses neighborhood contamination for both groups

  3. Integrates results to identify potential artifacts

Parameters:

  • consider_neighbors=True: Enables contamination analysis

  • exclude_ambiguous_assignments=True: Filters genes with conflicting signals

# Run contamination-aware DGE analysis
dge_results = dge(
    target=xd,
    target_cell_type_tuple=("cell_type_dc_sub_final", "Breast cancer subtype 4"),
    ref_cell_type_tuple=("cell_type_dc_sub_final", "Breast cancer subtype 1"),
    ref=None,
    method="wilcoxon",
    consider_neighbors=True,
    exclude_ambiguous_assignments=True
)
Calculate differentially expressed genes with Scanpy's `rank_genes_groups` using 'wilcoxon'.
Building spatial graph (radius=20, weighted=False)...
Selected 6445 'Breast cancer subtype 4' cells as targets
Excluding same-celltype neighbors from adjacency matrix...
Computing mean neighbor expression...
Computing gex_diff = gex_target - gex_neighbor ...
Running paired tests (wilcoxon) per gene and computing effect sizes...
Applying multiple-testing correction...

============================================================
ANALYSIS SUMMARY
============================================================
Target cell type: 'Breast cancer subtype 4'
Target cells: 6445/156447 (4.1%)
Strategy: mean
Genes tested: 297
Target cells without neighbors: 1340 (20.8%)
Median neighbors per target cell: 3.0
============================================================

Building spatial graph (radius=20, weighted=False)...
Selected 31016 'Breast cancer subtype 1' cells as targets
Excluding same-celltype neighbors from adjacency matrix...
Computing mean neighbor expression...
Computing gex_diff = gex_target - gex_neighbor ...
Running paired tests (wilcoxon) per gene and computing effect sizes...
Applying multiple-testing correction...

============================================================
ANALYSIS SUMMARY
============================================================
Target cell type: 'Breast cancer subtype 1'
Target cells: 31016/156447 (19.8%)
Strategy: mean
Genes tested: 297
Target cells without neighbors: 13137 (42.4%)
Median neighbors per target cell: 2.0
============================================================
# Display results
print("\n=== DGE Results Summary ===")
print(f"Total genes: {len(dge_results.main)}")
print(f"Significant upregulated:")
print(f"\t- Target vs. reference: {((dge_results.main['padj'] < 0.05) & (dge_results.main['log2foldchange'] > 0)).sum()}")
print(f"\t- Target vs. neighborhood: {((dge_results.target_neighborhood['padj'] < 0.05) & (dge_results.target_neighborhood['log2foldchange'] > 0)).sum()}")
print(f"\t- Reference vs. neighborhood: {((dge_results.ref_neighborhood['padj'] < 0.05) & (dge_results.ref_neighborhood['log2foldchange'] > 0)).sum()}")
=== DGE Results Summary ===
Total genes: 297
Significant upregulated:
	- Target vs. reference: 57
	- Target vs. neighborhood: 79
	- Reference vs. neighborhood: 85

Understanding the Results#

The dge() function returns a DiffExprResults object containing:

Main attributes:

  • main: Standard DGE results (target vs reference)

  • target_neighborhood: Target cells vs their spatial neighborhood

  • ref_neighborhood: Reference cells vs their spatial neighborhood

  • config: Analysis configuration and metadata

Key columns in results DataFrames:

Column

Description

log2foldchange

Log2 fold change between conditions

padj

FDR-adjusted p-value

Accessing results:

# Main DGE results
main_results = dge_results.main

# Neighborhood comparisons (if consider_neighbors=True)
if dge_results.has_neighbors():
    target_nb = dge_results.target_neighborhood  # Target vs neighbors
    ref_nb = dge_results.ref_neighborhood        # Reference vs neighbors

# Get all results as dictionary
all_results = dge_results.get_all_results()
print(all_results)
{'main':           log2foldchange           padj      scores  neg_log10_pvals
gene                                                                
SERPINA3        5.463246  1.000000e-300  101.212189            300.0
CEACAM6         3.637490  1.000000e-300   94.099846            300.0
AGR3            4.773612  1.000000e-300   84.277008            300.0
TACSTD2         1.096257  1.000000e-300   64.926903            300.0
ESR1            2.567329  1.000000e-300   50.270954            300.0
...                  ...            ...         ...              ...
TCIM           -1.277700  1.000000e-300  -49.620869            300.0
MDM2           -0.940504  1.000000e-300  -50.664825            300.0
FASN           -0.892925  1.000000e-300  -60.591518            300.0
KRT7           -0.575407  1.000000e-300  -61.039478            300.0
ABCC11         -1.645688  1.000000e-300  -62.645016            300.0

[297 rows x 4 columns], 'target_neighborhood':         n_target_cells  n_cells_used  n_cells_expressed  mean_target  \
gene                                                                   
ABCC11            6445          5105               2477     0.565558   
ACTA2             6445          5105               2680     0.600571   
ACTG2             6445          5105               3729     0.922865   
ADAM9             6445          5105               3792     0.962086   
ADGRE5            6445          5105                129     0.018626   
...                ...           ...                ...          ...   
VWF               6445          5105                245     0.037620   
WARS              6445          5105               2004     0.358330   
ZEB1              6445          5105                176     0.027682   
ZEB2              6445          5105                344     0.051499   
ZNF562            6445          5105               2836     0.558369   

        mean_neighbor  log2foldchange  log2foldchange_unpaired         pvalue  \
gene                                                                            
ABCC11       0.463177        0.147704                 0.368266   4.354693e-21   
ACTA2        0.895618       -0.425663                -0.815663  4.306899e-123   
ACTG2        0.968523       -0.065870                -0.107717   1.485075e-04   
ADAM9        0.741608        0.318083                 0.556858   2.947663e-81   
ADGRE5       0.071199       -0.075846                -1.972721   2.542418e-72   
...               ...             ...                      ...            ...   
VWF          0.091385       -0.077567                -1.319651   2.434527e-44   
WARS         0.371862       -0.019522                -0.063830   1.270472e-01   
ZEB1         0.159486       -0.190153                -2.622956  5.557927e-163   
ZEB2         0.248829       -0.284686                -2.418438  1.280223e-203   
ZNF562       0.472224        0.124282                 0.309204   1.206054e-18   

                 padj  contamination_score  
gene                                        
ABCC11   8.290665e-21            -0.221925  
ACTA2   3.197873e-122             0.607594  
ACTG2    1.876882e-04             0.064398  
ADAM9    1.435174e-80            -0.299489  
ADGRE5   1.094345e-71             0.639369  
...               ...                  ...  
VWF      6.573223e-44             0.563628  
WARS     1.382161e-01             0.042594  
ZEB1    9.170579e-162             1.489268  
ZEB2    3.456602e-202             1.879130  
ZNF562   2.170897e-18            -0.188773  

[297 rows x 10 columns], 'ref_neighborhood':         n_target_cells  n_cells_used  n_cells_expressed  mean_target  \
gene                                                                   
ABCC11           31016         17879              14751     1.211318   
ACTA2            31016         17879               9639     0.525261   
ACTG2            31016         17879              13011     0.858179   
ADAM9            31016         17879              12893     0.852588   
ADGRE5           31016         17879                446     0.018370   
...                ...           ...                ...          ...   
VWF              31016         17879               1830     0.079330   
WARS             31016         17879               8557     0.440672   
ZEB1             31016         17879               1116     0.049014   
ZEB2             31016         17879               2163     0.095106   
ZNF562           31016         17879              11201     0.646623   

        mean_neighbor  log2foldchange  log2foldchange_unpaired         pvalue  \
gene                                                                            
ABCC11       0.665242        0.787822                 1.319181   0.000000e+00   
ACTA2        1.005323       -0.692584                -1.326549   0.000000e+00   
ACTG2        0.854385        0.005473                 0.009515   3.502186e-01   
ADAM9        0.799903        0.076008                 0.135203   3.145965e-19   
ADGRE5       0.143971       -0.181204                -3.062200   0.000000e+00   
...               ...             ...                      ...            ...   
VWF          0.277080       -0.285294                -1.951254   0.000000e+00   
WARS         0.463612       -0.033096                -0.091005   2.581637e-06   
ZEB1         0.330657       -0.406326                -2.963661   0.000000e+00   
ZEB2         0.623493       -0.762302                -3.116667   0.000000e+00   
ZNF562       0.482131        0.237311                 0.553255  1.573691e-189   

                 padj  contamination_score  
gene                                        
ABCC11   0.000000e+00            -0.600786  
ACTA2    0.000000e+00             1.107672  
ACTG2    3.586721e-01            -0.005712  
ADAM9    3.893131e-19            -0.079791  
ADGRE5   0.000000e+00             1.530831  
...               ...                  ...  
VWF      0.000000e+00             1.590893  
WARS     2.808594e-06             0.061213  
ZEB1     0.000000e+00             2.726766  
ZEB2     0.000000e+00             3.907118  
ZNF562  3.595280e-189            -0.317846  

[297 rows x 10 columns]}

Interpretation patterns:

When neighborhood analysis is enabled (consider_neighbors=True):

Target vs Reference (.main)

Target vs Neighbors (.target_neighborhood)

Reference vs Neighbors (.ref_neighborhood)

Interpretation

Upregulated (sig.)

Upregulated or NS

Any

Genuine upregulation

Upregulated (sig.)

Downregulated (sig.)

Any

Possible contamination in target cells

Downregulated (sig.)

Any

Downregulated (sig.)

Possible contamination in reference cells

Downregulated (sig.)

Any

Upregulated or NS

Genuine downregulation

Saving and loading results:

from insitupy.dataclasses.results import DiffExprResults

# Save complete analysis
dge_results.save("out/dge_results", overwrite=True)

# Load saved results
dge_results_reloaded = DiffExprResults.read("out/dge_results")

# Quick summary
print(dge_results_reloaded.summary())
Warning: Overwriting existing directory 'out\dge_results'.
Main DGE results: 297 genes
Neighbor comparison (A): 297 genes
Neighbor comparison (B): 297 genes
Configuration:
<bound method DiffExprConfigCollector.__repr__ of DiffExprConfigCollector(
  General:
    mode: single-cell
    method_params: {'groupby': 'DGE_COMPARISON_COLUMN', 'reference': 'REFERENCE', 'method': 'wilcoxon', 'use_raw': False, 'corr_method': 'benjamini-hochberg'}
    cells_layer: None
    exclude_ambiguous_assignments: True
  Target:
    annotation: None
    cell_type: Breast cancer subtype 4
    region: None
    cell_number: 6445
    name: None
    metadata: None
  Reference:
    annotation: None
    cell_type: Breast cancer subtype 1
    region: None
    cell_number: 31016
    name: None
    metadata: None
)>

Visualization#

Volcano Plot#

Standard visualization showing significance vs fold change.

from insitupy.plotting import dual_foldchange_plot, volcano
volcano(dge_results)
../../_images/1f5f6ddcb3ef0fafca23cdb93d51c0f238abc36375d534cdcdc332d46f54bb0f.png

Dual Fold Change Plot - Key visualization for contamination-aware differential gene expression analysis#

Plots standard DGE (x-axis) against neighborhood contamination (y-axis) for both target and reference populations:

Target panel (left):

  • Upper half: Target cells express more than their neighbors → genuine expression

  • Lower half: Target cells express less than their neighbors → potential contamination from neighbors

Reference panel (right):

  • Upper half: Reference cells express more than their neighbors → genuine expression

  • Lower half: Reference cells express less than their neighbors → potential contamination from neighbors

Interpretation:

  • Genes in upper-right of target panel + positive main DGE: Genuine upregulation in target

  • Genes in lower-right of target panel + positive main DGE: Likely contamination artifact in target

  • Genes in upper-right of reference panel + negative main DGE: Genuine upregulation in reference (= downregulation in target)

dual_foldchange_plot(
    dge_results,
    foldchange_threshold=2,
    adjust_labels=True,
    size_by_pvalue=True,
    show_nonsignificant=True
)
../../_images/b51579a2f9059793f3a28fc667a23c2435c641524cc52cf7985f7f5ef163070c.png

Using show_config one can also display more information on the cells used for the differential gene expression analysis. The savepath argument can be used to save the plot.

dual_foldchange_plot(
    dge_results,
    foldchange_threshold=2,
    adjust_labels=True,
    size_by_pvalue=True,
    show_nonsignificant=True,
    show_config=True,
    savepath="figures/dual_foldchange_config.pdf"
)
../../_images/c90a3ade894bdfd94b9e2d3df91aa7cebf665aa9177ff5c87ead705cfbed1090.png