Pipeline & Preprocessing

The RadiomicsPipeline is the core engine of Pictologics for executing reproducible, standardized radiomic feature extraction workflows. It manages the entire lifecycle from preprocessing to feature extraction and logging.

Why Use the Pipeline?

Reproducibility: Define a configuration once and apply it consistently to every image.
State Management: The pipeline tracks the image and masks (morphological and intensity) through every step.
Standardisation: Built-in configurations follow IBSI standards.
Batch Processing: Run multiple configurations (e.g., different binning strategies) on the same image in a single pass.
Flexibility: Steps execute linearly, so you can arrange them in any order or repeat steps.

Getting Started

from pictologics import RadiomicsPipeline, format_results, save_results

# 1. Initialize the pipeline
pipeline = RadiomicsPipeline()

# 2. Run a predefined configuration
results = pipeline.run(
    image="path/to/image.nii.gz",
    mask="path/to/mask.nii.gz",
    subject_id="Subject_001",
    config_names=["standard_fbn_32"],
)

# 3. Format and save results
row = format_results(results, fmt="wide", meta={"subject_id": "Subject_001"})
save_results([row], "results.csv")

Masks Are Optional

RadiomicsPipeline.run(...) accepts an optional mask argument:

Pass a mask path or Image object → used as the ROI (standard workflow).
Omit mask (or pass mask=None / mask="") → Pictologics generates a full (all-ones) ROI mask, treating the entire image as the initial ROI.

When you pass a mask path, the mask is loaded with the already-loaded image as reference_image, so DICOM SEG objects and cropped masks can be aligned to image geometry before extraction. When you pass an in-memory Image mask, Pictologics validates that shape, spacing, origin, and direction already match the image.

Mask values use nonzero membership semantics by default: values 1, 2, 3, etc. all mean "inside the ROI". Label numbers are not treated as weights for volume, texture, or intensity calculations. Add a binarize_mask step when you need to select a specific label or label range from a multi-label mask.

Complete Feature Sets Guaranteed

Every configuration always returns a pandas.Series with the full set of expected feature names — even when extraction fails partially or entirely.

Empty ROI (e.g., too strict resegment thresholds): all features are NaN.
Partial failures (e.g., mesh generation error, PCA with ≤3 voxels): successfully computed features retain their values; only the missing features are NaN.
Unexpected errors: all features are NaN.

Other configurations in the same run() call continue normally. The processing log records errors and the step that caused them.

This design ensures that multi-configuration batch runs always produce a complete, predictable result dictionary — downstream formatting, concatenation, and CSV export work without missing columns or unexpected exceptions.

See Result Guarantees for details.

results = pipeline.run(image, mask, config_names=["strict", "lenient"])
# If "strict" empties the ROI:
#   results["strict"]  -> Series of NaN (all expected feature names present)
#   results["lenient"] -> Series of computed values

Morphology with Whole-Image ROI

With a maskless run, morphology features describe the ROI mask after mask-refining steps (e.g., resegment, keep_largest_component). This is valid computationally, but may not be scientifically meaningful for all studies.

Predefined Configurations

Pictologics includes 6 standard configurations designed for common radiomics workflows. All share:

Resampling: 0.5mm × 0.5mm × 0.5mm isotropic spacing with linear image interpolation
Feature Families: intensity, morphology, texture, histogram, and IVH
Performance: Spatial/local intensity disabled by default

Configuration	Method	Parameters
`standard_fbn_8`	Fixed Bin Number	`n_bins=8`
`standard_fbn_16`	Fixed Bin Number	`n_bins=16`
`standard_fbn_32`	Fixed Bin Number	`n_bins=32`
`standard_fbs_8`	Fixed Bin Size	`bin_width=8.0`
`standard_fbs_16`	Fixed Bin Size	`bin_width=16.0`
`standard_fbs_32`	Fixed Bin Size	`bin_width=32.0`

# Run a single configuration
results = pipeline.run(image, mask, config_names=["standard_fbn_32"])

# Run all 6 standard configurations
all_results = pipeline.run(image, mask, config_names=["all_standard"])

Configuration Management

For detailed documentation on FBN vs FBS guidance, export/import, YAML/JSON formats, schema versioning, and sharing configurations, see the Configuration & Reproducibility guide.

Linear Step Execution

Steps are applied one after another in the exact sequence you define. You can repeat steps, arrange steps in any order, and implement complex multi-stage preprocessing:

# Example: Complex workflow with repeated steps
complex_config = [
    {"step": "resample", "params": {"new_spacing": (2.0, 2.0, 2.0)}},
    {"step": "keep_largest_component", "params": {"apply_to": "morph"}},
    {"step": "resegment", "params": {"range_min": -1000, "range_max": 400}},
    {"step": "filter_outliers", "params": {"sigma": 3.0}},
    {"step": "round_intensities", "params": {}},
    {"step": "discretise", "params": {"method": "FBN", "n_bins": 32}},
    {"step": "extract_features", "params": {"families": ["texture", "histogram"]}},
]

Intelligent Image Routing

After discretisation, the pipeline maintains both the original (raw) image and the discretised image, ensuring each feature type gets the appropriate input automatically:

Feature Family	Image Used	Why
Intensity	Raw image	Statistics require original continuous values
Morphology	Raw image	Volume/surface calculations use original geometry
Histogram	Discretised	Bin-based statistics require integer bins
Texture (GLCM, GLRLM, etc.)	Discretised	Co-occurrence matrices require discrete grey levels
IVH	Configurable	Can use raw (continuous) or discretised values

Available Preprocessing Steps

1. `resample`

Resamples the image and mask to a new voxel spacing.

Parameter	Type	Default	Description
`new_spacing`	`tuple`	(required)	Target spacing (x, y, z) in mm
`interpolation`	`str`	`"linear"`	Image interpolation: `"linear"`, `"cubic"`, `"nearest"`
`mask_interpolation`	`str`	`"nearest"`	Mask interpolation: `"nearest"`, `"linear"`
`mask_threshold`	`float`	`0.5`	Threshold for non-nearest mask interpolation
`round_intensities`	`bool`	`False`	Round intensities to nearest integer after resampling

2. `resegment`

Refines ROI masks based on intensity thresholds, excluding voxels outside the specified range from feature extraction. By default, resegmentation applies to both the morphology mask and the intensity mask, so morphology volumes and shape features describe the selected compartment, not the original geometric ROI. Set apply_to="intensity" only when morphology should remain anchored to the original ROI extent.

Memory Usage Alert

If your image has a background that resamples to 0, and 0 is within your resegment range, you must use source_mode="auto". Otherwise, resegment will include the entire background in the ROI, causing memory exhaustion during texture calculation. source_mode="auto" ensures the background remains excluded.

Parameter	Type	Default	Description
`range_min`	`float`	`None`	Minimum intensity value
`range_max`	`float`	`None`	Maximum intensity value
`apply_to`	`str`	`"both"`	`"both"`, `"morph"`, or `"intensity"`

3. `filter_outliers`

Removes outliers from ROI masks based on standard deviations from the mean. Like resegment, this defaults to both masks so compartment morphology reflects the filtered voxel set. Use apply_to="intensity" to remove outliers only from intensity, texture, histogram, and IVH calculations.

Parameter	Type	Default	Description
`sigma`	`float`	`3.0`	Number of standard deviations
`apply_to`	`str`	`"both"`	`"both"`, `"morph"`, or `"intensity"`

4. `keep_largest_component`

Restricts the mask to the largest connected component.

Parameter	Type	Default	Description
`apply_to`	`str`	`"both"`	`"both"`, `"morph"`, or `"intensity"`

5. `round_intensities`

Rounds image intensities to the nearest integer. Useful before discretisation if values are close to integers.

No parameters.

6. `binarize_mask`

Creates a binary mask from a multi-label mask. Without this step, all nonzero labels are treated as one combined ROI. Use mask_values to select specific segments before feature extraction.

Parameter	Type	Default	Description
`threshold`	`float`	`0.5`	Threshold value for binarization
`mask_values`	`int`, `list`, or `tuple`	`None`	Specific label(s) to select. Tuple `(min, max)` selects a range
`apply_to`	`str`	`"both"`	`"both"`, `"morph"`, or `"intensity"`

7. `discretise`

Discretises image intensities into bins. Required before texture feature extraction.

Parameter	Type	Default	Description
`method`	`str`	(required)	`"FBN"` (Fixed Bin Number) or `"FBS"` (Fixed Bin Size)
`n_bins`	`int`	`None`	Number of bins (for FBN)
`bin_width`	`float`	`None`	Width of each bin (for FBS)

8. `filter`

Applies an IBSI 2 image filter. See the Image Filtering guide for detailed documentation.

Parameter	Type	Default	Description
`type`	`str`	(required)	`"mean"`, `"log"`, `"laws"`, `"gabor"`, `"wavelet"`, `"simoncelli"`, `"riesz"`
`boundary`	`str`	`"mirror"`	Boundary condition

Filter-specific parameters:

Filter	Required Params	Optional Params
`mean`	`support`	`boundary`
`log`	`sigma_mm`	`truncate`, `boundary`
`laws`	`kernel`	`rotation_invariant`, `pooling`, `compute_energy`, `energy_distance`, `boundary`
`gabor`	`sigma_mm`, `lambda_mm`, `gamma`	`rotation_invariant`, `delta_theta`, `pooling`, `boundary`
`wavelet`	`wavelet`, `level`, `decomposition`	`rotation_invariant`, `pooling`, `boundary`
`simoncelli`	`level`	—
`riesz`	`order`	`variant`, `sigma_mm`, `level`

Automatic Spacing Injection

For filters requiring physical spacing (log, gabor), the pipeline uses the image's voxel spacing automatically.

9. `extract_features`

Calculates radiomic features from the current state.

Parameter	Type	Default	Description
`families`	`list[str]`	(required)	Feature families to extract (see table below)
`include_spatial_intensity`	`bool`	`False`	Include Moran's I / Geary's C
`include_local_intensity`	`bool`	`False`	Include local intensity peaks
`ivh_params`	`dict`	`None`	Parameters for IVH: `bin_width`, `min_val`, `max_val`, etc.
`ivh_discretisation`	`dict`	`None`	Temporary discretisation for IVH only
`ivh_use_continuous`	`bool`	`False`	Use raw values for IVH
`texture_matrix_params`	`dict`	`None`	E.g., `{"ngldm_alpha": 1}`

Available feature families:

Family	Description
`"intensity"`	First-order statistics (Mean, Skewness, etc.)
`"spatial_intensity"`	Moran's I / Geary's C only
`"local_intensity"`	Local/global intensity peak features only
`"morphology"`	Shape and size features (Volume, Sphericity, etc.)
`"texture"`	GLCM, GLRLM, GLSZM, GLDZM, NGTDM, NGLDM
`"glcm"`, `"glrlm"`, `"glszm"`, `"gldzm"`, `"ngtdm"`, `"ngldm"`	Individual texture subfamilies
`"texture_glcm"`, `"texture_glrlm"`, etc.	Explicit texture-subfamily aliases
`"histogram"`	Intensity histogram features
`"ivh"`	Intensity-Volume Histogram features

Working with Results

The format_results() function converts pipeline output into different formats for analysis or export.

Format Options

Wide FormatLong Format

One row per subject with all features as columns. Column names use the pattern {config}__{feature}.

row = format_results(results, fmt="wide", meta={"subject_id": "case1"})
# Returns: {"subject_id": "case1", "standard_fbn_32__mean_intensity_Q4LE": 123.4, ...}

Tidy data with one row per feature. Config name is automatically included.

df = format_results(results, fmt="long", meta={"subject_id": "case1"}, output_type="pandas")
# Returns DataFrame: [subject_id, config, feature_name, value]

Output Types

Type	Returns
`"dict"` (default)	Python dictionary (wide) or list of dicts (long)
`"pandas"`	`pandas.DataFrame`
`"json"`	JSON string

Batch Processing Pattern

all_rows = []
for file in image_files:
    res = pipeline.run(image=file, ...)
    all_rows.append(format_results(res, fmt="wide", meta={"filename": file.name}))

# Save everything at once
save_results(all_rows, "full_study_results.csv")

Result Guarantees

Every configuration in a run() call always returns a pandas.Series with a complete, predictable set of feature names — regardless of whether extraction succeeded, partially succeeded, or failed entirely. This guarantee holds at three levels:

Failure Level	Example	Behaviour
Whole-configuration failure	Empty ROI after `resegment`	All features set to `NaN`
Partial feature failure	Mesh generation error in morphology, PCA with ≤3 voxels, empty texture matrix	Successfully computed features retain their values; only the missing features are set to `NaN`
Unexpected runtime error	Uncaught exception during extraction	All features set to `NaN`

In every case:

No missing columns. The feature names in the returned Series are identical to those a fully successful extraction would have produced.
Other configurations continue. A failure in one configuration does not prevent subsequent configurations from running.
Errors are logged. The processing log records the error message and the step that caused the failure, accessible via pipeline.save_log().

This design makes batch processing safe: when you collect rows from many subjects with format_results() and merge them with save_results(), every row has the same columns. There are no ragged rows, no missing columns, and no unexpected exceptions.

Name-Based Merging

format_results() and save_results() always merge results by column name, never by position. Even though all configurations now produce the same set of feature names, the merging logic is inherently name-based — columns are identified by their {config}__{feature} key (wide format) or feature_name value (long format), so results are always aligned correctly.

How It Works Internally

The pipeline uses a static FEATURE_NAMES registry (in pictologics.features) that enumerates every feature name produced by each family. Three mechanisms ensure completeness:

Empty ROI: When EmptyROIMaskError is raised during preprocessing, the pipeline builds a full NaN Series directly from the registry without attempting extraction.
Partial failures: After each extract_features call, a backfill step compares the returned feature keys against the registry and inserts NaN for any missing keys. This catches edge cases where individual features cannot be computed (e.g., mesh fails → surface-based morphology features are NaN, but volume from voxel counting is preserved).
Unexpected errors: If an unhandled exception interrupts extraction, the general error handler backfills all expected feature names with NaN using the same registry.

Example: Empty ROI in a Multi-Configuration Run

pipeline.add_config("strict", [
    {"step": "resegment", "params": {"range_min": 100, "range_max": 200}},
    {"step": "extract_features", "params": {"families": ["intensity"]}},
])
pipeline.add_config("lenient", [
    {"step": "resegment", "params": {"range_min": -1000, "range_max": 3000}},
    {"step": "extract_features", "params": {"families": ["intensity"]}},
])

results = pipeline.run(image, mask, config_names=["strict", "lenient"])

# If the strict range empties the ROI:
#   results["strict"]  -> 18 intensity features, all NaN
#   results["lenient"] -> 18 intensity features, computed values
#
# format_results() works normally — the NaN row merges cleanly with other rows:
row = format_results(results, fmt="wide", meta={"subject_id": "case1"})

Detecting Failed Configurations

After a batch run, inspect the processing log to find which configurations failed or produced partial results:

for entry in pipeline._log:
    if "error" in entry:
        print(f"{entry['config_name']}: {entry['error']}")

Feature Catalog

The describe_features() method returns a DataFrame cataloguing every feature the pipeline will produce before you run it. Each row is one (configuration, feature) pair with columns describing the feature identity, family membership, and the preprocessing state at the extract_features step that produces that feature.

The catalog follows the same ordered step model as configuration files. If a configuration repeats a preprocessing step, the corresponding *_params cell contains a compact JSON array with one entry per occurrence, including the original 1-based step_index. This keeps CSV exports readable while preserving a machine-readable audit trail.

pipeline = RadiomicsPipeline()
catalog = pipeline.describe_features()
catalog.head()

Column	Description
`config`	Configuration name
`feature_key`	Full feature key as it appears in the output
`feature_name`	Human-readable name (IBSI code stripped)
`ibsi_code`	3–4 character IBSI identifier
`family`	Granular family (e.g. `glcm`, `ivh`)
`family_group`	Broad category: Intensity, Morphology, or Texture
`requires_discretisation`	Whether the family needs discretised input
`uses_morph_mask` / `uses_intensity_mask`	Which runtime mask(s) the feature row depends on
`source_mode` / `sentinel_value`	Source-mask configuration metadata
`feature_extraction_step_index` / `feature_extraction_params`	Which `extract_features` step produced the row and its parameters
`preprocessing_sequence`	Ordered preprocessing steps before extraction, e.g. `1:resample > 2:resegment > 3:discretise`
`preprocessing_steps`	Full ordered preprocessing step records as compact JSON
`is_discretised` / `discretisation_method` / `discretisation_param`	Discretisation details
`is_resampled` / `resampling_spacing` / `interpolation`	Resampling details
`is_resegmented` / `resegment_apply_to` / `resegment_params`	Resegmentation details and effective mask target
`is_outlier_filtered` / `filter_outliers_apply_to` / `filter_outliers_params`	Outlier filtering details and effective mask target
`is_intensity_rounded` / `round_intensities_params`	Intensity rounding details
`keeps_largest_component` / `keep_largest_component_apply_to` / `keep_largest_component_params`	Largest-component mask processing details and effective mask target
`is_mask_binarized` / `binarize_mask_apply_to` / `binarize_mask_params`	Mask binarization details and effective mask target
`is_filtered` / `filter_type` / `filter_params`	Response-map filter details

The step-parameter columns (resample_params, resegment_params, discretise_params, filter_params, etc.) contain only parameters explicitly present in the configuration. Effective summary columns such as interpolation and discretisation_method include runtime defaults when a step omits them.

Typical Use Cases

Export a data dictionary alongside study results:

catalog.to_csv("feature_catalog.csv", index=False)

Filter features for downstream analysis:

# Only texture features from FBN configs
texture_fbn = catalog[
    (catalog["family_group"] == "Texture")
    & (catalog["discretisation_method"] == "FBN")
]

Deduplication (Performance Optimization)

When running multiple configurations that share preprocessing steps, the pipeline automatically avoids redundant computation.

Enabled by Default

Deduplication is enabled by default (deduplicate=True). Just run multiple configs to benefit.

How It Works

The system analyzes your configurations and identifies reusable features:

Feature Family	Depends On	Independent Of
Morphology	Mask geometry (resample, resegment, filter_outliers, binarize_mask, keep_largest_component)	Response-map filters, discretization
Intensity	Intensity preprocessing (resample, resegment, filter_outliers, filter)	Discretization
Texture / Histogram	All of the above plus discretization	—
IVH	Same as texture/histogram unless `ivh_use_continuous=True`	Discretization in continuous mode

When configs share preprocessing but differ only in discretization:

Morphology and intensity are computed once and reused
Texture, histogram, and discretized IVH are computed per configuration
Cache reuse is scoped by feature family as well as preprocessing signature; texture, histogram, and IVH do not reuse each other's cached values.
Preprocessing order is part of the signature. The same steps in a different order are computed independently because they can produce different ROIs and intensities.

Checking Statistics

stats = pipeline.deduplication_stats
print(f"Cache hit rate: {stats['cache_hit_rate']:.1%}")
print(f"Reused: {stats['reused_families']} families")
print(f"Computed: {stats['computed_families']} families")

Results Are Always Complete

Deduplication does not affect the completeness guarantee. Reused features are deep copied into each configuration's results, and every config returns a complete feature set — no missing values.

Configuration

Parameter	Type	Default	Description
`deduplicate`	`bool`	`True`	Enable/disable deduplication
`deduplication_rules`	`str` or `DeduplicationRules`	`"1.0.0"`	Rules version for reproducibility

API Reference

For detailed documentation of ConfigurationAnalyzer, DeduplicationPlan, PreprocessingSignature, and DeduplicationRules, see the Deduplication API reference.

Logging

The pipeline maintains a detailed log of every step executed, including parameters and errors.

# Save log after running
pipeline.save_log("pipeline_execution_log.json")

# Clear log between runs
pipeline.clear_log()

The saved JSON is self-describing. It contains a log schema version, pipeline schema version, Pictologics package version when available, the mask ROI semantics used for the run, and an entries array. Each entry records:

Timestamp, subject ID, image source, and mask source
Configuration name and a full configuration snapshot
Source mode, sentinel detection status, and effective sentinel value
Deduplication settings and whether a deduplication plan was used
Mask repositioning settings used when loading mask paths
List of executed steps with serialized parameters
Final configuration status, error text, failed step, and feature count when applicable

Examples

Standard Suite (Fast Baseline)

Run all 6 built-in configurations:

from pictologics import RadiomicsPipeline

pipeline = RadiomicsPipeline()
results = pipeline.run(
    image="path/to/image.nii.gz",
    mask="path/to/mask.nii.gz",
    config_names=["all_standard"],
)

Enable Spatial/Local Intensity Extras

cfg = [
    {"step": "resample", "params": {"new_spacing": (0.5, 0.5, 0.5)}},
    {"step": "discretise", "params": {"method": "FBN", "n_bins": 32}},
    {
        "step": "extract_features",
        "params": {
            "families": ["intensity", "morphology", "texture", "histogram", "ivh"],
            "include_spatial_intensity": True,  # Moran's I / Geary's C
            "include_local_intensity": True,    # Local intensity peaks
        },
    },
]

pipeline = RadiomicsPipeline().add_config("with_extras", cfg)
results = pipeline.run("image.nii.gz", "mask.nii.gz", config_names=["with_extras"])

IVH with Physical-Unit Mapping

cfg = [
    {"step": "resample", "params": {"new_spacing": (1.0, 1.0, 1.0)}},
    {"step": "discretise", "params": {"method": "FBS", "bin_width": 25.0, "min_val": -1000}},
    {
        "step": "extract_features",
        "params": {
            "families": ["ivh"],
            "ivh_params": {"bin_width": 25.0, "min_val": -1000, "target_range_max": 400},
        },
    },
]

Custom CT Pipeline

custom_config = [
    {"step": "resample", "params": {"new_spacing": (1.0, 1.0, 1.0)}},
    {"step": "resegment", "params": {"range_min": -150, "range_max": 250}},
    {"step": "discretise", "params": {"method": "FBN", "n_bins": 64}},
    {"step": "extract_features", "params": {
        "families": ["intensity", "morphology", "texture", "histogram", "ivh"]
    }},
]

pipeline = RadiomicsPipeline().add_config("my_custom_ct", custom_config)
results = pipeline.run(image, mask, config_names=["my_custom_ct"])

LoG Filtered Features

log_config = [
    {"step": "resample", "params": {"new_spacing": (1.0, 1.0, 1.0), "interpolation": "cubic"}},
    {"step": "round_intensities", "params": {}},
    {"step": "resegment", "params": {"range_min": -1000, "range_max": 400}},
    {"step": "filter", "params": {"type": "log", "sigma_mm": 1.5, "truncate": 4.0}},
    {"step": "extract_features", "params": {"families": ["intensity", "morphology", "histogram"]}},
]

Manual Step-by-Step Extraction

If you need granular control, call features directly without the pipeline:

import numpy as np
from pictologics import load_image
from pictologics.preprocessing import (
    resample_image, resegment_mask, filter_outliers,
    discretise_image, apply_mask
)
from pictologics.features.intensity import calculate_intensity_features
from pictologics.features.morphology import calculate_morphology_features
from pictologics.features.texture import calculate_all_texture_features

# Load and preprocess
image = load_image("image.nii.gz")
mask = load_image("mask.nii.gz")
image = resample_image(image, new_spacing=(1.0, 1.0, 1.0))
mask = resample_image(mask, new_spacing=(1.0, 1.0, 1.0), interpolation="nearest")
mask = resegment_mask(image, mask, range_min=-1000, range_max=400)

# Discretise for texture
disc_image = discretise_image(image, method="FBN", n_bins=32, roi_mask=mask)

# Extract features
morph = calculate_morphology_features(mask, image=image, intensity_mask=mask)
intensity = calculate_intensity_features(apply_mask(image, mask))
texture = calculate_all_texture_features(disc_image.array, mask.array, n_bins=32)

all_features = {**morph, **intensity, **texture}
print(f"Extracted {len(all_features)} features")

Use the Pipeline Instead

The RadiomicsPipeline accomplishes the same workflow with automatic image routing, logging, deduplication, and configuration export. Manual extraction is mainly useful for debugging or understanding the underlying process.

Performance Tips

Spatial/local intensity can be extremely slow on large ROIs. Keep them disabled unless needed.
Texture requires discretisation. Without a discretise step, the pipeline raises an error.
For large 3D images, consider coarser spacing for exploratory work.
For CT in Hounsfield Units, FBS (bin_width) is often more interpretable; for MRI/PET, FBN (n_bins) may be preferable.