Skip to content

Full Pipeline

--workflow full

The full pipeline downloads Seurat objects from LabKey, exports per-sample count matrices, harmonizes genes across species, and trains a scMODAL variational autoencoder to produce a joint multi-species latent embedding with Leiden clustering.

GPU required

This workflow must run on an HPC cluster via -profile slurm. It requires at least one NVIDIA GPU for the SCMODAL_INTEGRATE step. It cannot run on a local Mac or CPU-only Linux host.

The only CPU path is the GitHub Actions smoke-test stub, enabled with --scmodal_use_cpu true together with -stub-run.


Stage-by-stage dataflow

flowchart TD
    SS["**samplesheet.csv**
    sample_id · output_file_id · species"]

    INGEST["**INGEST**
    (Rdiscvr)
    Downloads full Seurat object from LabKey"]

    EXPORT_COUNTS["**EXPORT_COUNTS**
    (CellMembrane)
    Extracts raw counts → 10x-like matrix dir"]

    COLLECT["collect()
    Gathers all sample count dirs"]

    GENE_HARMONIZE["**GENE_HARMONIZE**
    (scmodal-cuda)
    Ortholog mapping + AnnData per species"]

    SCMODAL["**SCMODAL_INTEGRATE**
    (scmodal-cuda · GPU)
    scMODAL VAE → latent embedding + Leiden clustering"]

    OUT_RDS["outputs/ingest/{id}/{id}.rds"]
    OUT_COUNTS["outputs/counts/{id}/{id}_counts/
    matrix.mtx · features.tsv · barcodes.tsv · obs_meta.csv"]
    OUT_HARM["outputs/harmonized/harmonized_outputs/
    *_harmonized.h5ad · integration_manifest.csv
    shared_genes.csv · ortholog_mapping.csv"]
    OUT_MODEL["outputs/scmodal/model_outputs/
    latent_clustered.h5ad · ckpt.pth
    training_history.csv · run_summary.json"]

    SS --> INGEST
    INGEST -->|"tuple(meta, .rds)"| OUT_RDS
    INGEST -->|"tuple(meta, .rds)"| EXPORT_COUNTS
    EXPORT_COUNTS -->|"tuple(meta, counts_dir/)"| OUT_COUNTS
    EXPORT_COUNTS -->|"all samples"| COLLECT
    COLLECT -->|"[counts_dir/, ...]"| GENE_HARMONIZE
    GENE_HARMONIZE --> OUT_HARM
    GENE_HARMONIZE -->|"harmonized_dir/"| SCMODAL
    SCMODAL --> OUT_MODEL

Inputs

Samplesheet

Path: --input (default data/samplesheet.csv)

See Data Formats → Samplesheet for the schema. All three columns are required for the full pipeline.

Required parameters

Parameter Description
--labkey_base_url LabKey server base URL
--labkey_folder LabKey folder path
--species_order Comma-separated species list (default human,macaque,mouse)

Outputs at each stage

INGEST → outputs/ingest/{sample_id}/

File Description
{sample_id}.rds Full Seurat object (counts + metadata) downloaded from LabKey

EXPORT_COUNTS → outputs/counts/{sample_id}/{sample_id}_counts/

A 10x-like matrix directory (one per sample):

File Description
matrix.mtx Sparse count matrix (genes × cells, Market Exchange format)
features.tsv Gene / feature names, one per row
barcodes.tsv Cell barcode identifiers, one per row
obs_meta.csv Cell-level metadata exported from the Seurat object

GENE_HARMONIZE → outputs/harmonized/harmonized_outputs/

File Description
{idx}_{species}_harmonized.h5ad One AnnData per species. Cells × shared ortholog genes, log-normalised.
integration_manifest.csv Maps each species file to its integration order index.
shared_genes.csv Shared ortholog gene list used across all species.
ortholog_mapping.csv Full HomoloGene-based ortholog mapping table (all species).
n_shared.txt Count of shared genes (read by SCMODAL_INTEGRATE).

SCMODAL_INTEGRATE → outputs/scmodal/model_outputs/

File Description
latent_clustered.h5ad Concatenated AnnData with obsm["X_scmodal"] latent coords, UMAP, and Leiden cluster labels.
ckpt.pth Trained scMODAL model checkpoint (PyTorch).
training_history.csv Per-run summary: n_cells, n_genes, n_latent, training time, device.
run_summary.json JSON summary of integration parameters and results.
gpu_info.txt Output of nvidia-smi captured during the run.

Synthetic input snapshot

The seeded metadata fixture used by docs and CI gives a safe preview of the cell-level composition that flows into harmonization and scMODAL integration.

Synthetic immune-class composition

This plot is derived from the synthetic sample_metadata.csv bundle and is useful for validating that broad immune-class distributions look sensible before the GPU stage.

For the generated code-level reference, see API Reference → Workflows.


Running on HPC

# Sync the repo (recommended before each run)
sbatch slurm_sync_repo.sh

# Submit the full pipeline
sbatch slurm_nextflow.sh \
  --workflow full \
  --labkey_base_url https://labkey.example.org \
  --labkey_folder /My/Project/Folder

Optionally, sync and launch in one step:

sbatch --export=ALL,SYNC_REPO_BEFORE_RUN=true slurm_nextflow.sh \
  --workflow full \
  --labkey_base_url https://labkey.example.org \
  --labkey_folder /My/Project/Folder


Resource profile (SLURM)

Step CPUs Memory Wall time GPU
INGEST 4 32 GB 4 h
EXPORT_COUNTS 4 32 GB 4 h
GENE_HARMONIZE 4 32 GB 8 h
SCMODAL_INTEGRATE 8 64 GB 24 h 1× (--gres=gpu:1 --qos=gpu)

Key parameters

Parameter Default Description
--scmodal_latent 20 Latent embedding dimensions
--scmodal_training_steps 10000 VAE training steps
--scmodal_batch_size 500 Mini-batch size
--scmodal_neighbors 30 KNN graph neighbours
--leiden_resolution 0.5 Leiden clustering resolution

See the full Parameter reference for all options.