Full Pipeline¶

--workflow full

The full pipeline downloads Seurat objects from LabKey, exports per-sample count matrices, harmonizes genes across species, and trains a scMODAL variational autoencoder to produce a joint multi-species latent embedding with Leiden clustering.

GPU required

This workflow must run on an HPC cluster via -profile slurm. It requires at least one NVIDIA GPU for the SCMODAL_INTEGRATE step. It cannot run on a local Mac or CPU-only Linux host.

The only CPU path is the GitHub Actions smoke-test stub, enabled with --scmodal_use_cpu true together with -stub-run.

Stage-by-stage dataflow¶

flowchart TD
    SS["**samplesheet.csv**
    sample_id · output_file_id · species"]

    INGEST["**INGEST**
    (Rdiscvr)
    Downloads full Seurat object from LabKey"]

    EXPORT_COUNTS["**EXPORT_COUNTS**
    (CellMembrane)
    Extracts raw counts → 10x-like matrix dir"]

    COLLECT["collect()
    Gathers all sample count dirs"]

    GENE_HARMONIZE["**GENE_HARMONIZE**
    (scmodal-cuda)
    Ortholog mapping + AnnData per species"]

    SCMODAL["**SCMODAL_INTEGRATE**
    (scmodal-cuda · GPU)
    scMODAL VAE → latent embedding + Leiden clustering"]

    OUT_RDS["outputs/ingest/{id}/{id}.rds"]
    OUT_COUNTS["outputs/counts/{id}/{id}_counts/
    matrix.mtx · features.tsv · barcodes.tsv · obs_meta.csv"]
    OUT_HARM["outputs/harmonized/harmonized_outputs/
    *_harmonized.h5ad · integration_manifest.csv
    shared_genes.csv · ortholog_mapping.csv"]
    OUT_MODEL["outputs/scmodal/model_outputs/
    latent_clustered.h5ad · ckpt.pth
    training_history.csv · run_summary.json"]

    SS --> INGEST
    INGEST -->|"tuple(meta, .rds)"| OUT_RDS
    INGEST -->|"tuple(meta, .rds)"| EXPORT_COUNTS
    EXPORT_COUNTS -->|"tuple(meta, counts_dir/)"| OUT_COUNTS
    EXPORT_COUNTS -->|"all samples"| COLLECT
    COLLECT -->|"[counts_dir/, ...]"| GENE_HARMONIZE
    GENE_HARMONIZE --> OUT_HARM
    GENE_HARMONIZE -->|"harmonized_dir/"| SCMODAL
    SCMODAL --> OUT_MODEL

Inputs¶

Samplesheet¶

Path: --input (default data/samplesheet.csv)

See Data Formats → Samplesheet for the schema. All three columns are required for the full pipeline.

Required parameters¶

Parameter	Description
`--labkey_base_url`	LabKey server base URL
`--labkey_folder`	LabKey folder path
`--species_order`	Comma-separated species list (default `human,macaque,mouse`)

Outputs at each stage¶

INGEST → `outputs/ingest/{sample_id}/`¶

File	Description
`{sample_id}.rds`	Full Seurat object (counts + metadata) downloaded from LabKey

EXPORT_COUNTS → `outputs/counts/{sample_id}/{sample_id}_counts/`¶

A 10x-like matrix directory (one per sample):

File	Description
`matrix.mtx`	Sparse count matrix (genes × cells, Market Exchange format)
`features.tsv`	Gene / feature names, one per row
`barcodes.tsv`	Cell barcode identifiers, one per row
`obs_meta.csv`	Cell-level metadata exported from the Seurat object

GENE_HARMONIZE → `outputs/harmonized/harmonized_outputs/`¶

File	Description
`{idx}_{species}_harmonized.h5ad`	One AnnData per species. Cells × shared ortholog genes, log-normalised.
`integration_manifest.csv`	Maps each species file to its integration order index.
`shared_genes.csv`	Shared ortholog gene list used across all species.
`ortholog_mapping.csv`	Full HomoloGene-based ortholog mapping table (all species).
`n_shared.txt`	Count of shared genes (read by SCMODAL_INTEGRATE).

SCMODAL_INTEGRATE → `outputs/scmodal/model_outputs/`¶

File	Description
`latent_clustered.h5ad`	Concatenated AnnData with `obsm["X_scmodal"]` latent coords, UMAP, and Leiden cluster labels.
`ckpt.pth`	Trained scMODAL model checkpoint (PyTorch).
`training_history.csv`	Per-run summary: n_cells, n_genes, n_latent, training time, device.
`run_summary.json`	JSON summary of integration parameters and results.
`gpu_info.txt`	Output of `nvidia-smi` captured during the run.

Synthetic input snapshot¶

The seeded metadata fixture used by docs and CI gives a safe preview of the cell-level composition that flows into harmonization and scMODAL integration.

Synthetic immune-class composition

This plot is derived from the synthetic sample_metadata.csv bundle and is useful for validating that broad immune-class distributions look sensible before the GPU stage.

For the generated code-level reference, see API Reference → Workflows.

Running on HPC¶

# Sync the repo (recommended before each run)
sbatch slurm_sync_repo.sh

# Submit the full pipeline
sbatch slurm_nextflow.sh \
  --workflow full \
  --labkey_base_url https://labkey.example.org \
  --labkey_folder /My/Project/Folder

Optionally, sync and launch in one step:

sbatch --export=ALL,SYNC_REPO_BEFORE_RUN=true slurm_nextflow.sh \
  --workflow full \
  --labkey_base_url https://labkey.example.org \
  --labkey_folder /My/Project/Folder

Resource profile (SLURM)¶

Step	CPUs	Memory	Wall time	GPU
INGEST	4	32 GB	4 h	—
EXPORT_COUNTS	4	32 GB	4 h	—
GENE_HARMONIZE	4	32 GB	8 h	—
SCMODAL_INTEGRATE	8	64 GB	24 h	1× (`--gres=gpu:1 --qos=gpu`)

Key parameters¶

Parameter	Default	Description
`--scmodal_latent`	`20`	Latent embedding dimensions
`--scmodal_training_steps`	`10000`	VAE training steps
`--scmodal_batch_size`	`500`	Mini-batch size
`--scmodal_neighbors`	`30`	KNN graph neighbours
`--leiden_resolution`	`0.5`	Leiden clustering resolution

See the full Parameter reference for all options.