Dataset: Xenium Human Lung Preview — Non-diseased FFPE
Source & Download: 10x Genomics — Xenium Human Lung Preview (standard)
Licensed under CC BY 4.0
This repository implements a transparent and flexible pipeline for processing Xenium spatial transcriptomics data from raw output to spatial visualization.
Unlike workflows that depend on Zarr or aggregated formats, this approach uses step-by-step raw data processing for reproducibility and educational clarity.
-
Reproducibility & clarity
You manually stream and filter transcripts (e.g., Q ≥ 20) and build the cell×gene matrix, making every step clear and auditable. -
Robustness to changes
Xenium data format may change over time; this pipeline handles schema differences gracefully (e.g. variations in column names likex_location
vsx_centroid
). -
Scalability & memory efficiency
The two-pass Arrow/Parquet batching avoids memory issues, allowing for large datasets to be handled smoothly. -
Customizable & extensible
Users can easily adjust quality thresholds or extend the pipeline to other spatial platforms (e.g., CosMx, MERFISH).
- scripts/unpack_all.py — Extracts Xenium data from zipped output bundles.
- notebooks/preview_quickstart.ipynb — Walk-through notebook to:
- Load raw
cells.parquet
andtranscripts.parquet
, plus image and metrics files - Filter and build a sparse count matrix
- Construct and QC an
AnnData
object - Normalize, cluster, and visualize spatial patterns
- Compute neighborhood enrichment and save results
- Load raw
- Clone this repository
git clone https://github.com/jrs-orellana/xenium2anndata-analysis-workflow
- Download the Xenium dataset ZIP from the link above and place it in
data/
. - Run
python scripts/unpack_all.py
to extract the dataset. - Install dependencies (see below).
- Open and run the notebooks to process the data:
01_xenium_raw2anndata.ipynb
— conversion from raw Parquet → AnnData02_xenium_downstream.ipynb
— QC, clustering, marker detection, spatial plots
- Explore results in
results/figures/
and the processed.h5ad
file.
Main packages required (see full requirements.txt
for exact versions):
- scanpy — single-cell analysis and visualization
- squidpy — spatial transcriptomics analysis
- spatialdata-io — Xenium/Visium data import
- pyarrow — efficient batch processing of Parquet files
- numpy and pandas — data handling
- matplotlib and seaborn — visualization
Install via:
pip install -r requirements.txt
Total Counts | Genes per Cell | Counts vs Genes |
---|---|---|
![]() |
![]() |
![]() |
Post-QC Density | PCA Scree | UMAP (Leiden) |
---|---|---|
![]() |
![]() |
![]() |
UMAP: n Genes by Cell | UMAP: Total Counts | Spatial Leiden |
---|---|---|
![]() |
![]() |
![]() |
Cell-Type Scores | Compartments (High) |
---|---|
![]() |
![]() |
Marker Dotplot | Marker Heatmap | Neighborhood Enrichment |
---|---|---|
![]() |
![]() |
![]() |
If you use this repository or adapt parts of the workflow, please cite it as:
APA style:
Orellana-Montes, J. (2025). xenium2anndata-analysis-workflow: Transparent pipeline for Xenium spatial transcriptomics. GitHub. Available at: https://github.com/jrs-orellana/xenium2anndata-analysis-workflow
BibTeX:
@misc{xenium2anndata2025,
author = {Julio Orellana-Montes},
title = {xenium2anndata-analysis-workflow: Transparent pipeline for Xenium spatial transcriptomics},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/jrs-orellana/xenium2anndata-analysis-workflow}}
}
Dataset: Xenium Human Lung Preview — Non-diseased FFPE, 10x Genomics (licensed under CC BY 4.0).
Please cite per 10x Genomics citation guidelines.
- Name:
xenium2anndata-analysis-workflow
- Purpose: Detailed, manual parsing and processing of Xenium raw data
- Strengths: Transparency, flexibility, reproducibility over convenience
Planned extensions for this repository include:
- Integration with additional spatial transcriptomics platforms (e.g., CosMx, MERFISH).
- Adding batch correction and cross-sample integration modules.
- Enhanced visualization (interactive dashboards with napari or Bokeh).
- Tutorials for exporting processed data to standard formats (e.g.,
.loom
,.h5ad
sharing).
This project is released under the MIT License. See LICENSE for details.
Dataset belongs to 10x Genomics and is licensed under CC BY 4.0.
Author: Julio Orellana-Montes
For questions, suggestions, or collaborations: open an issue or pull request on GitHub,
or contact me at julio.orellana@upch.pe