Tutorial: Exporting and Importing AnnData with InSituExperiment#

This tutorial demonstrates how to:

  1. Export multiple samples from an InSituExperiment to a single AnnData object using to_anndata()

  2. Perform batch correction and integration analysis using standard scanpy/scvi-tools workflows

  3. Re-import the results back into the InSituExperiment using import_from_anndata()

  4. Visualize and analyze the integrated results spatially

Workflow Overview#

InSituExperiment → to_anndata() → AnnData → Analysis → import_from_anndata() → InSituExperiment

This workflow allows you to:

  • Integrate multiple spatial samples

  • Apply external analysis functions that rely on a single AnnData object (e.g. batch correction, clustering, niche analysis)

  • Transfer results back for spatial visualization and further processing within the InSituPy ecosystem.

Setup and Data Loading#

## The following code ensures that all functions and init files are reloaded before executions.
%load_ext autoreload
%autoreload 2
from insitupy._core.data import InSituData, CACHE
from insitupy import InSituExperiment
from pathlib import Path

Load Xenium Data#

Load preprocessed Xenium data containing spatial transcriptomics information with cell segmentation and annotations.

# Update this path to your data location
data_path = Path(CACHE / "out/demo_insitupy_project")
xd = InSituData.read(data_path)
xd.load_all(skip="transcripts")  # Load all data except transcript coordinates
xd
InSituData
Method:		Xenium
Slide ID:	0001879
Sample ID:	Replicate 1
Path:		C:\Users\ge37voy\.cache\InSituPy\out\demo_insitupy_project

    ➤ images
       CD20:	(25778, 35416)
       HE:	(25778, 35416, 3)
       HER2:	(25778, 35416)
       nuclei:	(25778, 35416)
    ➤ cells
       MultiCellData with main layer 'main'
           matrix
               AnnData object with n_obs × n_vars = 156447 × 297
               obs: 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'n_genes_by_counts', 'n_genes', 'leiden', 'cell_type_dc', 'cell_type_dc_sub', 'cell_type_tacco', 'cell_type_publ_x', 'cell_type_publ_y', 'cell_type_dc_sub_final', 'cell_type_publ'
               var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
               uns: 'cell_type_dc_colors', 'cell_type_dc_sub', 'cell_type_dc_sub_colors', 'cell_type_dc_sub_final_colors', 'cell_type_publ_colors', 'cell_type_tacco_colors', 'counts_location', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'umap'
               obsm: 'OT', 'X_pca', 'X_umap', 'annotations', 'ora_estimate', 'ora_pvals', 'regions', 'spatial'
               varm: 'OT', 'PCs'
               layers: 'counts', 'norm_counts'
               obsp: 'connectivities', 'distances'
           boundaries
               BoundariesData object with 2 entries:
                   cells
                   nuclei annotations
       TestKey:	5 annotations, 2 classes ('TestClass', 'test') ✔
       demo:	28 annotations, 2 classes ('Stroma', 'Tumor cells') ✔
       demo2:	5 annotations, 3 classes ('Negative', 'Other', 'Positive') ✔
       demo3:	7 annotations, 5 classes ('Immune cells', 'Necrosis', 'Stroma', 'Tumor', 'unclassified') ✔
       Demo:	28 annotations, 2 classes ('Stroma', 'Tumor cells') ✔
       Janesick:	18 annotations, 3 classes ('DCIS #1', 'DCIS #2', 'Invasive') 
       Katja:	18 annotations, 4 classes ('DCIS', 'DCIS intermediate', 'DCIS with stromal reaction', 'Invasive') ✔
    ➤ regions
       demo_regions:	3 regions, 3 classes ('Region1', 'Region2', 'Region3') ✔
       TMA:	6 regions, 6 classes ('A-1', 'A-2', 'A-3', 'B-1', 'B-2', 'B-3') ✔
       Demo:	3 regions, 3 classes ('Region 1', 'Region 2', 'Region 3') ✔
       Katja:	4 regions, 4 classes ('Region 1', 'Region 2', 'Region 3', 'Region 4') ✔

Create InSituExperiment from dummy TMA cores.

exp = InSituExperiment.from_regions(
    data=xd,
    region_key="TMA"  # Column in annotations containing region IDs
)
A-1
A-2
A-3
B-1
B-2
B-3
exp
InSituExperiment with 6 samples:
           uid  CITAR slide_id    sample_id region_key region_name
0     3f5f4e9d  ++-++  0001879  Replicate 1        TMA         A-1
1     c7db290b  ++-++  0001879  Replicate 1        TMA         A-2
2     28bd6cfc  ++-++  0001879  Replicate 1        TMA         A-3
3     2583f2e0  ++-++  0001879  Replicate 1        TMA         B-1
4     66bd0a34  ++-++  0001879  Replicate 1        TMA         B-2
5     2fdb42a4  ++-++  0001879  Replicate 1        TMA         B-3
# View metadata to understand your samples
exp.metadata
You are accessing a copy of the metadata. Changes to this DataFrame will not affect the internal metadata. Use `add_metadata_column()` or `append_metadata()` to add new information to the metadata.
uid slide_id sample_id region_key region_name
0 3f5f4e9d 0001879 Replicate 1 TMA A-1
1 c7db290b 0001879 Replicate 1 TMA A-2
2 28bd6cfc 0001879 Replicate 1 TMA A-3
3 2583f2e0 0001879 Replicate 1 TMA B-1
4 66bd0a34 0001879 Replicate 1 TMA B-2
5 2fdb42a4 0001879 Replicate 1 TMA B-3

Export to AnnData#

We’ll export all samples into a single AnnData object. This allows us to perform integrated analysis across samples.

adata = exp.to_anndata(
    cells_layer=None,  # Use default layer
    label_col="uid",  # Use UID to label samples
    obs_keys="all",  # Export all cell-level annotations
    var_keys="all",  # Export all gene-level annotations
    varm_keys="all",
    layer_keys=["counts"],  # Export counts layer
)

adata
AnnData object with n_obs × n_vars = 17695 × 297
    obs: 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'n_genes_by_counts', 'n_genes', 'leiden', 'cell_type_dc', 'cell_type_dc_sub', 'cell_type_tacco', 'cell_type_publ_x', 'cell_type_publ_y', 'cell_type_dc_sub_final', 'cell_type_publ', 'leiden_integrated', 'uid'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
    varm: 'OT', 'PCs'
    layers: 'counts'
# Verify the sample labels are present
print("Sample distribution:")
print(adata.obs['uid'].value_counts())
Sample distribution:
uid
3f5f4e9d    4847
2583f2e0    3569
2fdb42a4    3405
66bd0a34    2301
28bd6cfc    1807
c7db290b    1766
Name: count, dtype: int64

Dimensionality Reduction (Before Batch Correction)#

Let’s first see how the samples look without batch correction.

import scanpy as sc
import matplotlib.pyplot as plt
sc.pp.highly_variable_genes(adata=adata, batch_key="uid")
# PCA
sc.tl.pca(adata, use_highly_variable=True)

# Visualize PCA colored by sample
sc.pl.pca(adata, color='uid', title='PCA before batch correction')
../../_images/d098019394c0e682f7ad84fadc0e8fda3741219e5bd7d138d3ffd6096bcf5b08.png
# UMAP (uncorrected)
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata)

# Store uncorrected UMAP
adata.obsm['X_umap_uncorrected'] = adata.obsm['X_umap'].copy()
sc.pl.umap(adata, color=['uid', 'cell_type_dc_sub_final'])
../../_images/6eecb6014f9c286225b6e941fe826b56ad13dea746a84140f1c3bf800fa9d245.png

Batch Correction#

Now we’ll perform batch correction. Here we show multiple methods - choose the one that works best for your data.

Fast method for demonstration: Harmony#

# Install harmony if needed: pip install harmonypy
import harmonypy as hm

# Run Harmony on PCA
harmony_out = hm.run_harmony(
    adata.obsm['X_pca'],
    adata.obs,
    'uid',  # Batch key
    max_iter_harmony=20
)

# Store Harmony-corrected embeddings
adata.obsm['X_pca_harmony'] = harmony_out.Z_corr.T

print("Harmony correction completed!")
Harmony correction completed!
# Compute UMAP on Harmony-corrected data
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30, use_rep='X_pca_harmony')
sc.tl.umap(adata)

# Store Harmony UMAP
adata.obsm['X_umap_harmony'] = adata.obsm['X_umap'].copy()
sc.pl.umap(adata, color=['uid', 'cell_type_dc_sub_final'])
../../_images/3cd7aaa66f51e15c93974dad7d9a0fd1924de99e23078b2f87d706621c2d943b.png

Clustering on Batch-Corrected Data#

# Leiden clustering on corrected data
sc.tl.leiden(adata, resolution=0.5, key_added='leiden_integrated')

print(f"Number of clusters: {adata.obs['leiden_integrated'].nunique()}")
Number of clusters: 8
# Visualize corrected UMAP
fig, axes = plt.subplots(1, 3, figsize=(18, 4))

sc.pl.umap(adata, color='uid', title='After Harmony correction (by sample)',
           ax=axes[0], show=False)
sc.pl.umap(adata, color='cell_type_dc_sub_final', title='After Harmony correction (by original cell type)',
           ax=axes[1], show=False, #legend_loc='on data', legend_fontsize=6
           )
sc.pl.umap(adata, color='leiden_integrated', title='After Harmony correction (integrated clusters)',
           ax=axes[2], show=False, #legend_loc='on data', legend_fontsize=6
           )

plt.tight_layout()
plt.show()
../../_images/8470710dfbbb8cc1f5bde312f94f751838b6fdb3f04a8a7214423c25d7539a4f.png

Compare Before and After Batch Correction#

# Side-by-side comparison
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Before correction
sc.pl.embedding(adata, basis='X_umap_uncorrected', color='uid',
                title='Before correction - by sample', ax=axes[0, 0], show=False)
sc.pl.embedding(adata, basis='X_umap_uncorrected', color='cell_type_dc_sub_final',
                title='Before correction - by cell type', ax=axes[0, 1], show=False,
                #legend_loc='on data', legend_fontsize=6
                )

# After correction
sc.pl.embedding(adata, basis='X_umap_harmony', color='uid',
                title='After correction - by sample', ax=axes[1, 0], show=False)
sc.pl.embedding(adata, basis='X_umap_harmony', color='leiden_integrated',
                title='After correction - integrated clusters', ax=axes[1, 1], show=False,
                #legend_loc='on data', legend_fontsize=6
                )

plt.tight_layout()
plt.show()
../../_images/0d22d353810e7cd20aa114d38156b4a4c81fc01ccf2406aaad547c6b4dd27f16.png

Re-import Results into InSituExperiment#

Now we’ll transfer the integrated results back to the original InSituExperiment for spatial visualization.

# Define what to import back
obs_columns_to_import = [
    'leiden_integrated',  # New cluster assignments
    # Add any other new annotations you want to transfer
]

obsm_keys_to_import = [
    'X_pca_harmony',      # Harmony-corrected PCA
    'X_umap_harmony',     # Harmony-corrected UMAP
    'X_umap_uncorrected', # Uncorrected UMAP for comparison
    # 'X_scvi',           # If using scVI
]
# Import back into InSituExperiment
exp.import_from_anndata(
    adata=adata,
    uid_column='uid',              # Column in exp.metadata
    uid_column_adata='uid',        # Column in adata.obs (the label_col from to_anndata)
    obs_columns_to_transfer=obs_columns_to_import,
    obsm_keys_to_transfer=obsm_keys_to_import,
    cells_layer=None,              # Import to default layer
    overwrite=True                 # Overwrite if keys already exist
)

print("✓ Successfully imported results back into InSituExperiment!")
✓ Successfully imported results back into InSituExperiment!

Visualize Integrated Results#

Now we can visualize the batch-corrected results in their spatial context.

# Synchronize colors for consistent visualization across samples
exp.sync_colors(
    keys=['leiden_integrated', 'cell_type'],
    cells_layer=None,
    overwrite=True
)
Synchronized colors for key 'leiden_integrated' and palette 'tab20_mod'.
# Plot UMAPs across all samples
exp.plot_umaps(
    cells_layer=None,
    color='leiden_integrated',
    title_column='uid',
    max_cols=4,
    figsize=(6, 5)
)
../../_images/75fe3f726b31a5c7a0e88b66c3cdae935e42a4a7b0689aeae0d2974366e14938.png
# Compare corrected vs uncorrected UMAP for a specific sample
sample_idx = 0  # Change this to view different samples
xd = exp.data[sample_idx]

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Uncorrected UMAP
sc.pl.embedding(
    xd.cells.matrix,
    basis='X_umap_uncorrected',
    color='cell_type_dc_sub_final',
    title=f'Sample {sample_idx}: Before correction',
    ax=axes[0],
    show=False
)

# Corrected UMAP
sc.pl.embedding(
    xd.cells.matrix,
    basis='X_umap_harmony',
    color='cell_type_dc_sub_final',
    title=f'Sample {sample_idx}: After correction',
    ax=axes[1],
    show=False
)

plt.tight_layout()
plt.show()
../../_images/eefda29e6e027c54f2b16c57f144c7628b882bca3c13fce4115c611ab25cf421.png
# Spatial visualization of integrated clusters
exp.show(0)