Queries the Human Cell Atlas • cellNexus

cellNexus is a query interface for programmatic exploration and retrieval of harmonised, curated, and reannotated CELLxGENE human-cell-atlas data.

cellNexus abstract overview

This standalone documentation website provides:

Detailed data-processing information (quality control, harmonisation, and expression representations).
Complete explanations of metadata columns used in filtering and interpretation.
Guided examples for metadata-first exploration and gene-expression analysis.

Data Processing Overview

The harmonisation pipeline standardises data across datasets so queries are consistent across studies:

Metadata are retrieved from cloud-hosted harmonised tables.
Standardised quality control removes empty droplets, dead/damaged cells, and likely doublets.
Cell-level data are served through common assay layers (counts, cpm, sct, pseudobulk).
Outputs are returned in analysis-ready R formats such as SingleCellExperiment and Seurat.

Quality control steps

The QC flags used throughout cellNexus are computed using HPCell. In brief: - Empty droplets (empty_droplet): - Computed from a SingleCellExperiment. - Excludes mitochondrial genes and ribosomal genes before scoring: - Computes, per cell, the number of expressed genes and flags a cell as an empty droplet when (n_ <) RNA_feature_threshold (by default 200, except for targeted panels such as Rhapsody technology).

Alive cells (alive):
- Filters out empty droplets.
- Computes per-cell QC metrics from raw counts using scuttle::perCellQCMetrics(..., subsets=list(Mito=...)), where the mitochondrial subset is defined by ^MT in the feature names.
- Determines high mitochondrial content via scater::isOutlier(subsets_Mito_percent, type="higher"). Outlier calling is performed within each cell-type group using our harmonised label cell_type_unified_ensemble.
- Alive cells are labelled as those without high mitochondrial content (!high_mitochondrion).
Doublets (scDblFinder.class):
- Filters out empty droplets.
- scDblFinder::scDblFinder() default parameters are used. For cells that cannot be classified by scDblFinder, the class is set to "Unknown" to avoid dropping cells.

RNA abundance

RNA counts:
- RNA count distributions per sample are annotated from cellxgenedp, using the x_approximate_distribution column.
CPM:
- Counts-per-million normalisation computed from the raw counts assay via scuttle::calculateCPM().
Rank:
- Per-cell gene-expression ranks computed with singscore::rankGenes().
- Implemented in column chunks (default 1000 cells per slice) to handle very large datasets; slices are written to disk as an HDF5Array-backed sparse integer matrix and then column-bound.
SCT:
- Variance-stabilising normalisation computed with Seurat SCTransform() (v2), with regression of cell-level covariates (subsets_Mito_percent and subsets_Ribo_percent).
- QC filtering is applied first.
- The median common scale across the whole resource is applied (scale_factor=2186)
Pseudobulk:
- All low-quality cells flagged by QC are removed before aggregation.
- Aggregates counts across cells using scuttle::aggregateAcrossCells(), aggregating sample_id and the harmonised cell type (cell_type_harmonised_ensemble).

Metadata Explore

Detailed field definitions for the CELLxGENE schema follow the CELLxGENE schema 5.1.0, and CELLxGENE Census schema

Dataset-level columns

Column	Description
`dataset_id`	Primary dataset identifier in the atlas.
`collection_id`	Parent collection grouping related datasets.
`citation`	Citation or reference for the dataset.
`dataset_version_id`	Versioned identifier for a dataset release.
`explorer_url`	URL to browse the dataset in CELLxGENE explorer.
`filesize`	Reported size of source data files.
`filetype`	Storage format of source expression files.
`published_at`	Publication timestamp for dataset release.
`raw_data_location`	Location of raw data upstream.
`revised_at`	Last revision timestamp for dataset metadata.
`schema_version`	CELLxGENE schema version for this record.
`suspension_type`	Suspension type (e.g. cell vs nucleus).
`title`	Dataset publication title.
`url`	Dataset H5AD URL.
`donor_id`	Donor identifier.

Sample-level columns

Column	Description
`assay`	Sequencing technology label.
`assay_ontology_term_id`	Ontology identifier for `assay`.
`development_stage`	Developmental stage label.
`development_stage_ontology_term_id`	Ontology identifier for `development_stage`.
`self_reported_ethnicity`	Self-reported ancestry or ethnicity label.
`self_reported_ethnicity_ontology_term_id`	Ontology identifier for self-reported ethnicity.
`experiment___`	Upstream experiment grouping variable.
`organism`	Organism name (e.g. human).
`organism_ontology_term_id`	Ontology identifier for `organism`.
`sex`	Recorded biological sex label.
`sex_ontology_term_id`	Ontology identifier for `sex`.
`tissue`	Tissue label.
`tissue_type`	Tissue class (e.g. tissue vs organoid).
`tissue_ontology_term_id`	Ontology identifier for `tissue`.
`disease`	Disease annotation.
`disease_ontology_term_id`	Ontology identifier for `disease`.
`is_primary_data`	Whether observations are marked as primary (avoid duplicated cells).

Cell-level columns

Column	Description
`cell_id`	Cell identifier.
`cell_type`	Cell-type label from CELLxGENE Census API.
`cell_type_ontology_term_id`	Ontology identifier for `cell_type`.
`observation_joinid`	Cell ID join key linking metadata.

Harmonised and custom columns (cellNexus)

Through harmonisation and curation, cellNexus adds columns that are not present in the original CELLxGENE metadata alone.

Column	Description
`cell_count`	Number of cells in a dataset.
`feature_count`	Number of genes in a dataset.
`age_days`	Donor age in days.
`tissue_groups`	Coarse tissue grouping for analysis.
`empty_droplet`	Quality-control flag for empty droplets.
`alive`	Quality-control flag for viable cells (e.g. mitochondrial signal).
`scDblFinder.class`	Quality-control flag for doublet classification from `scDblFinder`.
`cell_type_unified_ensemble`	Consensus immune identity from Azimuth and `SingleR` (Blueprint, Monaco).
`cell_annotation_azimuth_l2`	Azimuth cell annotation.
`cell_annotation_blueprint_singler`	`SingleR` annotation (Blueprint).
`cell_annotation_blueprint_monaco`	`SingleR` annotation (Monaco).
`sample_heuristic`	Internal sample subdivision helper.
`file_id_cellNexus_single_cell`	Internal file id for single-cell layers.
`file_id_cellNexus_pseudobulk`	Internal file id for pseudobulk layers.
`sample_id`	Harmonised sample identifier.
`nCount_RNA`	Total RNA counts per cell (sample-aware).
`nFeature_expressed_in_sample`	Number of expressed features per cell.
`ethnicity_flagging_score`	Supporting score for ethnicity imputation.
`low_confidence_ethnicity`	Supporting flag for low-confidence ethnicity calls.
`.aggregated_cells`	Post-QC cells combined into each pseudobulk sample.
`imputed_ethnicity`	Imputed ethnicity label.
`atlas_id`	cellNexus atlas release identifier (internal use).
`is_immune`	Curated flag for immune-cell context.

Client Usage Examples

R client (`cellNexus`)

library(cellNexus)
library(dplyr)
library(stringr)

metadata <- get_metadata() |>
  join_census_table()

metadata <- metadata |>
  keep_quality_cells()

query <- metadata |>
  filter(
    self_reported_ethnicity == "African",
    str_like(assay, "%10x%"),
    tissue == "lung parenchyma",
    str_like(cell_type, "%CD4%")
  )

sce <- get_single_cell_experiment(query, assays = c("counts", "cpm"))
pb <- get_pseudobulk(query)

Python client (`cellNexusPy`)

Python support is available in the companion repository: MangiolaLaboratory/cellNexusPy.

from cellnexuspy import get_metadata, get_anndata

sample_dataset = "https://object-store.rc.nectar.org.au/v1/AUTH_06d6e008e3e642da99d806ba3ea629c5/cellNexus-metadata/sample_metadata.1.3.0.parquet"
conn, table = get_metadata(parquet_url=sample_dataset)

table = table.filter("""
    empty_droplet = 'false'
    AND alive = 'true'
    AND "scDblFinder.class" != 'doublet'
    AND feature_count >= 5000
""")

query = table.filter("""
    self_reported_ethnicity = 'African'
    AND assay LIKE '%10%'
    AND tissue = 'lung parenchyma'
    AND cell_type LIKE '%CD4%'
""")

adata = get_anndata(query, assay="cpm")
pb = get_anndata(query, aggregation="pseudobulk")
conn.close()

For other implementation details and code examples, see cellNexus README

cellNexus