Experimental
Group Setup
Before comparisons between biological states can be performed, samples must be aggregated into Sample Groups.
A Sample Group is defined as a collection of samples sharing specific characteristics (e.g., “Female patients over 50 with Glioblastoma”). Defining these groups is the foundational step required to build Comparisons.

Sample Groups can be created on the Dataset Page. This page also lists essential study details, including the organism, disease, platform technology, experimental design summary, and the total counts for genes and samples.
How to Create a Sample Group
  • Click the "Create New" button in the Sample Groups section to launch the data selection interface.
  • The view displays all samples in rows, with metadata attributes (such as disease stage, tissue, and age) organized in columns, resembling a spreadsheet.
  • Click the Filter icon on a relevant column header (e.g., 'Disease') and select specific values (e.g., "breast sarcoma"). Selecting “Apply” instantly updates the table to display only the samples matching these features. Additional filters may be applied as necessary.
  • Once the selection is finalized, click the "Create Group" button. The new group will appear in the left panel. A default name is generated based on the selected filters, but this can be renamed for clarity.
  • Continue creating additional groups (such as a Control group) or click "Complete" to close the interface and return to the main dataset view.
Single-Cell Datasets and Pseudobulking
If you are browsing a single-cell RNA sequencing (scRNA-seq) dataset, the group creation process can additionally involve a step called pseudobulking.
What is Pseudobulking?
Single-cell datasets are noisy and massive, making cell-by-cell comparison difficult and slow. Pseudobulking solves this by combining similar cells into a single "virtual sample" that acts like a standard bulk dataset. For example, instead of analyzing 5,000 individual T-cells separately, pseudobulking might combine them into just ten samples representing ten different donors. This aggregation greatly reduces the computational workload and filters out noise, resulting in more reliable downstream analysis.
When creating a group in a single-cell dataset, a grouping variable (e.g., Donor ID or Cell Type) must be specified alongside standard filters. This variable determines the aggregation strategy: gene expression values are summed for all cells sharing the same grouping value. Valid grouping variables must be categorical (non-numeric) and contain at least two distinct values.
Cluster Analysis
Standard sample grouping relies on metadata attributes, where you explicitly filter samples based on pre-assigned labels like "Tumor" or "Normal." In contrast, Cluster Analysis enables you to define groups based on the -omics data itself. This data-driven approach groups samples that share similar gene expression profiles, revealing biological relationships that may not be captured in the metadata. 

To access this feature, click the "Cluster Analysis" button to open the dedicated visualization window.
Note: This option is currently available only for bulk datasets.

Cluster Analysis is a critical tool for two primary workflows:

  • Discovering Novel Subtypes
    Identify distinct molecular subpopulations within a dataset that otherwise looks uniform (e.g., finding two distinct tumor subtypes within a single "Cancer" label).
  • Quality Control
    It helps detect technical artifacts. By using the "Data to color" dropdown to overlay metadata (such as Batch ID or Sequencing Site), you can visually inspect the plot. If samples group by technical factors rather than biological conditions, this indicates the presence of batch effects.
PandaOmics initializes the cluster analysis with default parameters for visualization and clustering to provide an immediate data overview. We encourage you to modify these settings to explore the relationships between samples from different perspectives. Any cluster identified in the analysis can be selected and saved as a new Sample Group for downstream comparison.
Visualization Techniques (Plotting)
  • PCA (Principal Component Analysis)
    A linear method that preserves global distances between samples. It is computationally efficient and deterministic, making it the standard choice for initial quality control.
  • t-SNE (t-Distributed Stochastic Neighbor Embedding)
    A non-linear method excellent at revealing local structures and separating clusters in heterogeneous data (e.g., distinct cell types). However results vary slightly per run and relative distances between distant clusters may not be meaningful.
  • UMAP (Uniform Manifold Approximation and Projection)
    A non-linear technique that balances the separation of local clusters with the preservation of the global data structure. This is a default setting.
Clustering Algorithms (Grouping)
  • Spectral (Default)
    Robustly handles complex cluster shapes and is effective for most biological datasets.
  • HDBSCAN
    Density-based; automatically excludes outliers as "noise" to ensure cleaner groups.
  • K-Means
    Fast and simple; best used when groups are expected to be distinct and roughly equal in size.