Convert data from SpatialData into InSituPy

Convert data from SpatialData into InSituPy#

Setup and Imports#

# Enable autoreload for development
%load_ext autoreload
%autoreload 2

Make sure `SpatialData` is installed#

If it is not installed yet, install it with:

pip install spatialdata[extra]

Make sure the version is >=0.7.2. For more information on the installation of SpatialData see here.

from pathlib import Path

from insitupy import InSituData, CACHE
from insitupy.spatialdata import convert_from_spatialdata

Load SpatialData#

First, let’s load a SpatialData object. We’ll use the spatialdata_io package to read data from common spatial transcriptomics platforms like Xenium or MERSCOPE.

In this tutorial, we’ll use a Xenium dataset as an example. If you did not download the demo datasets already, checkout the demo dataset tutorial to learn how to do so.

from spatialdata_io import xenium

# Path to your Xenium output folder
datapath = CACHE / "demo_datasets/xenium_hbreastcancer/output-XETG00000__slide_id__hbreastcancer"

# Load the Xenium data as a SpatialData object
sdata = xenium(datapath)

WARNING  The `feature_key` column feature_name is categorical with unknown categories. Please ensure the categories
         are known before calling `PointsModel.parse()` to avoid significant performance implications due to the   
         need for dask to compute the categories. If you did not use PointsModel.parse() explicitly in your code   
         (e.g. this message is coming from a reader in `spatialdata_io`), please report this finding.

# Display the SpatialData object to see available elements
sdata

SpatialData object
├── Images
│     ├── 'morphology_focus': DataTree[cyx] (1, 25778, 35416), (1, 12889, 17708), (1, 6444, 8854), (1, 3222, 4427), (1, 1611, 2213)
│     └── 'morphology_mip': DataTree[cyx] (1, 25778, 35416), (1, 12889, 17708), (1, 6444, 8854), (1, 3222, 4427), (1, 1611, 2213)
├── Labels
│     ├── 'cell_labels': DataTree[yx] (25778, 35416), (12889, 17708), (6444, 8854), (3222, 4427), (1611, 2213)
│     └── 'nucleus_labels': DataTree[yx] (25778, 35416), (12889, 17708), (6444, 8854), (3222, 4427), (1611, 2213)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 8) (3D points)
├── Shapes
│     ├── 'cell_boundaries': GeoDataFrame shape: (167780, 1) (2D shapes)
│     ├── 'cell_circles': GeoDataFrame shape: (167780, 2) (2D shapes)
│     └── 'nucleus_boundaries': GeoDataFrame shape: (167780, 1) (2D shapes)
└── Tables
      └── 'table': AnnData (167780, 313)
with coordinate systems:
    ▸ 'global', with elements:
        morphology_focus (Images), morphology_mip (Images), cell_labels (Labels), nucleus_labels (Labels), transcripts (Points), cell_boundaries (Shapes), cell_circles (Shapes), nucleus_boundaries (Shapes)

To better understand the SpatialData structure, please look into the SpatialData documentation.

Convert to InSituData#

The convert_from_spatialdata() function converts a SpatialData object into an InSituData object. You need to specify which elements from the SpatialData object should be mapped to InSituPy’s data structure.

Key Parameters#

Parameter	Description
`sdata`	The SpatialData object to convert
`image_data`	Dictionary mapping image names to `(sdata_key, pixel_size)` tuples
`cells_key`	Key for cell shapes in SpatialData (e.g., `"cell_circles"`)
`table_key`	Key for the expression table (default: `"table"`)
`cell_boundaries_data`	Tuple of `(label_key, pixel_size)` for cell segmentation masks
`nucleus_boundaries_data`	Tuple of `(label_key, pixel_size)` for nucleus segmentation masks
`transcripts_key`	Key for transcript points (default: `"transcripts"`)
`slide_id`	Identifier for the slide
`sample_id`	Identifier for the sample
`method_name`	Name of the spatial method (e.g., `"Xenium"`)

# Define the pixel size (in micrometers per pixel)
# For Xenium, this is typically 0.2125 µm/pixel
pixel_size = 0.2125

# Convert SpatialData to InSituData
xd = convert_from_spatialdata(
    sdata=sdata,
    # Map images: {new_name: (spatialdata_key, pixel_size)}
    image_data={
        "nuclei": ("morphology_mip", pixel_size, True),
        "mip": ("morphology_focus", pixel_size)
    },
    # Cell data configuration
    cells_key="cell_circles",
    table_key="table",
    # Boundary masks
    cell_boundaries_data=("cell_labels", pixel_size),
    nucleus_boundaries_data=("nucleus_labels", pixel_size),
    # Transcripts
    transcripts_key="transcripts",
    # Metadata
    slide_id="slide_demo",
    sample_id="sample_demo",
    method_name="Xenium"
)

2026-02-24 10:01:31 | [INFO] Using 'global' coordinate system for pixel size extraction.
Adding images...
Adding cell data...
2026-02-24 10:01:31 | [WARNING] Spatial coordinates in `obsm['spatial']` are overwritten using centroids from `'cell_circles'`.
2026-02-24 10:01:31 | [WARNING] For the segmentation mask values of the boundaries, it is assumed that the order of the cells matches the ascending values of the segmentation mask.
Adding transcripts...

# Display the converted InSituData object
xd

InSituData
Method:		Xenium
Slide ID:	slide_demo
Sample ID:	sample_demo
Path:		None

    ➤ images
       'nuclei':   (25778, 35416)
       'mip':      (25778, 35416)
    ➤ cells
       MultiCellData with main layer 'main'
           table
               AnnData object with n_obs × n_vars = 167780 × 313
               obs: 'cell_id', 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'region'
               var: 'gene_ids', 'feature_types', 'genome'
               uns: 'spatialdata_attrs'
               obsm: 'spatial'
           boundaries
               BoundariesData object with 2 entries:
                   cells
                   nuclei
    ➤ transcripts
       DataFrame with shape <dask_expr.expr.Scalar: expr=(Assign(frame=RenameFrame(frame=Assign(frame=Assign(frame=Assign(frame=Assign(frame=Assign(frame=Assign(frame=ColumnsSetter(frame=Assign(frame=ReadParquetFSSpec(5462333))[['x_location', 'y_location', 'z_location']], columns=['x', 'y', 'z']))))))), columns={'x': 'x_location', 'y': 'y_location', 'z': 'z_location'}))).size() // 8, dtype=int64> x 8

Explore the Converted Data#

Let’s verify that all data modalities were correctly converted.

# Check available images
xd.images

'nuclei':   (25778, 35416)
'mip':      (25778, 35416)

# Check cell data and spatial coordinates
xd.cells.table.obsm['spatial']

array([[ 847.25991211,  326.19136505],
       [ 826.34199524,  328.03182983],
       [ 848.76691895,  331.74318695],
       ...,
       [7470.15942383, 5119.13205566],
       [7477.73720703, 5128.71281738],
       [7489.3765625 , 5123.19777832]], shape=(167780, 2))

# Check cell boundaries
xd.cells.boundaries

BoundariesData object with 2 entries:
    cells
    nuclei

# Check transcripts
xd.transcripts

Dask DataFrame Structure:

	x_location	y_location	z_location	feature_name	cell_id	qv	overlaps_nucleus	transcript_id
npartitions=8
	float32	float32	float32	object	int32	float32	uint8	uint64
	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...

Dask Name: assign, 25 expressions

Saving the Converted Data#

Once converted, you can save the InSituData object to disk in InSituPy’s native format for efficient storage and future use.

# Define output path
outpath = CACHE / "out/from_spatialdata_demo"

# Save the InSituData object
xd.saveas(outpath, overwrite=True)
print(f"Saved to: {outpath}")

Saving data to C:\Users\ge37voy\.cache\InSituPy\out\from_spatialdata_demo
Saved.
Saved to: C:\Users\ge37voy\.cache\InSituPy\out\from_spatialdata_demo

Loading Saved Data#

The saved data can be loaded back using InSituPy’s standard reading functions.

# Load the saved data
xd_loaded = InSituData.read(outpath)
xd_loaded.load_all()

# Display the loaded data
xd_loaded

InSituData
Method:		Xenium
Slide ID:	slide_demo
Sample ID:	sample_demo
Path:		C:\Users\ge37voy\.cache\InSituPy\out\from_spatialdata_demo

    ➤ images
       'mip':      (25778, 35416)
       'nuclei':   (25778, 35416)
    ➤ cells
       MultiCellData with main layer 'main'
           table
               AnnData object with n_obs × n_vars = 167780 × 313
               obs: 'cell_id', 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'region'
               var: 'gene_ids', 'feature_types', 'genome'
               uns: 'spatialdata_attrs'
               obsm: 'spatial'
           boundaries
               BoundariesData object with 2 entries:
                   cells
                   nuclei
    ➤ transcripts
       DataFrame with shape <dask_expr.expr.Scalar: expr=ReadParquetFSSpec(269c66f).size() // 8, dtype=int64> x 8

# Visualize the data
xd_loaded.show()

2026-02-24 12:53:35 | [INFO] Extracting unique gene names from Dask DataFrame...
2026-02-24 12:53:37 | [INFO] Found 541 unique genes
2026-02-24 12:53:45 | [INFO] Loading coordinates for gene 'ACTA2'...
2026-02-24 12:53:47 | [INFO] Loaded 439233 coordinates for gene 'ACTA2'

Alternative: Using the Xenium Reader with SpatialData Backend#

InSituPy also provides a convenient way to read Xenium data directly using SpatialData as the backend. This combines the loading and conversion into a single step.

from insitupy.io import read_xenium

# Read Xenium data using SpatialData backend
xd_direct = read_xenium(datapath, backend="spatialdata")

2026-02-24 13:13:36 | [INFO] Reading Xenium data with spatialdata-io backend...
WARNING  The `feature_key` column feature_name is categorical with unknown categories. Please ensure the categories
         are known before calling `PointsModel.parse()` to avoid significant performance implications due to the   
         need for dask to compute the categories. If you did not use PointsModel.parse() explicitly in your code   
         (e.g. this message is coming from a reader in `spatialdata_io`), please report this finding.              
2026-02-24 13:15:21 | [INFO] Using 'global' coordinate system for pixel size extraction.
Adding images...
Adding cell data...
2026-02-24 13:15:21 | [WARNING] Spatial coordinates in `obsm['spatial']` are overwritten using centroids from `'cell_circles'`.
2026-02-24 13:15:21 | [WARNING] For the segmentation mask values of the boundaries, it is assumed that the order of the cells matches the ascending values of the segmentation mask.
Adding transcripts...

# Display the result
xd_direct

InSituData
Method:		Xenium
Slide ID:	slide_id
Sample ID:	sample_id
Path:		None

    ➤ images
       'nuclei':   (25778, 35416)
       'mip':      (25778, 35416)
    ➤ cells
       MultiCellData with main layer 'main'
           table
               AnnData object with n_obs × n_vars = 167780 × 313
               obs: 'cell_id', 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'region'
               var: 'gene_ids', 'feature_types', 'genome'
               uns: 'spatialdata_attrs'
               obsm: 'spatial'
           boundaries
               BoundariesData object with 2 entries:
                   cells
                   nuclei
    ➤ transcripts
       DataFrame with shape <dask_expr.expr.Scalar: expr=(Assign(frame=RenameFrame(frame=Assign(frame=Assign(frame=Assign(frame=Assign(frame=Assign(frame=Assign(frame=ColumnsSetter(frame=Assign(frame=ReadParquetFSSpec(5462333))[['x_location', 'y_location', 'z_location']], columns=['x', 'y', 'z']))))))), columns={'x': 'x_location', 'y': 'y_location', 'z': 'z_location'}))).size() // 8, dtype=int64> x 8