Skip to content

Deduplication API

The deduplication module provides intelligent optimization for multi-configuration radiomic feature extraction. When multiple configurations share preprocessing steps but differ only in discretization, the system avoids redundant computation by identifying which feature families can be computed once and reused.

Overview

The deduplication system consists of four main components:

  1. DeduplicationRules: Defines which preprocessing steps affect which feature families
  2. PreprocessingSignature: Creates hashable representations of preprocessing states
  3. ConfigurationAnalyzer: Analyzes pipeline configurations to identify optimization opportunities
  4. DeduplicationPlan: Generates optimized execution plans

Quick Start

Enabled by Default

Deduplication is enabled by default (deduplicate=True). You don't need to explicitly enable it—just create a pipeline and run multiple configurations.

from pictologics import RadiomicsPipeline

# Deduplication is enabled by default!
pipeline = RadiomicsPipeline()  # deduplicate=True is the default

# Add multiple configurations with shared preprocessing
# ... (morphology/intensity computed once, reused across configs)

results = pipeline.run(image, mask, config_names=["config1", "config2", "config3"])

# Check performance statistics
print(pipeline.deduplication_stats)

For complete usage examples, see Case 7: Multi-configuration batch with deduplication.


How Results Are Handled

When deduplication reuses features from a previous configuration, the features are copied to the reusing configuration's results—they are never empty or missing.

Result Behavior

Scenario Behavior
deduplicate=True (default) Features computed once, then copied to all configs with matching preprocessing. All configs receive complete feature sets.
deduplicate=False Features computed independently for each config. Same results, but slower.

Example: Results Structure

results = pipeline.run(image, mask, config_names=["fbn_8", "fbn_16", "fbn_32"])

# All configs have IDENTICAL morphology values (computed once, copied to others)
assert results["fbn_8"]["volume_mesh_ml_HTUR"] == results["fbn_16"]["volume_mesh_ml_HTUR"]
assert results["fbn_8"]["volume_mesh_ml_HTUR"] == results["fbn_32"]["volume_mesh_ml_HTUR"]

# Texture features DIFFER (depend on discretization)
assert results["fbn_8"]["glcm_joint_avg_d1_HTUR"] != results["fbn_32"]["glcm_joint_avg_d1_HTUR"]

Data Tables and Concatenation

When you concatenate results into a single DataFrame (e.g., for machine learning), every configuration row is complete—no missing values due to deduplication:

import pandas as pd
from pictologics import format_results

# Format results for each config
rows = []
for config_name, features in results.items():
    row = format_results({config_name: features}, fmt="wide", meta={"config": config_name})
    rows.append(row)

# Concatenate into single DataFrame - NO missing values!
df = pd.DataFrame(rows)
print(df.shape)  # (3, N) - all rows complete
print(df.isna().sum().sum())  # 0 - no NaN values

DeduplicationRules

pictologics.deduplication.DeduplicationRules dataclass

Defines which preprocessing steps affect each feature family.

This is a frozen (immutable) dataclass that specifies the dependencies between preprocessing steps and feature families. Rules are versioned to ensure reproducibility when sharing configurations.

Attributes:

Name Type Description
version str

Semantic version string for this rules definition.

family_dependencies dict[str, frozenset[str]]

Mapping of feature family names to the set of preprocessing step names that affect their output.

ivh_discretization_dependent_unless str

Condition under which IVH becomes independent of discretization (e.g., "ivh_use_continuous=True").

comparison_mode str

How to compare preprocessing parameters ("exact_params").

Source code in pictologics/deduplication.py
@dataclass(frozen=True)
class DeduplicationRules:
    """
    Defines which preprocessing steps affect each feature family.

    This is a frozen (immutable) dataclass that specifies the dependencies
    between preprocessing steps and feature families. Rules are versioned
    to ensure reproducibility when sharing configurations.

    Attributes:
        version: Semantic version string for this rules definition.
        family_dependencies: Mapping of feature family names to the set of
            preprocessing step names that affect their output.
        ivh_discretization_dependent_unless: Condition under which IVH becomes
            independent of discretization (e.g., "ivh_use_continuous=True").
        comparison_mode: How to compare preprocessing parameters ("exact_params").
    """

    version: str
    family_dependencies: dict[str, frozenset[str]]
    ivh_discretization_dependent_unless: str
    comparison_mode: str

    def to_dict(self) -> dict[str, Any]:
        """Serialize rules to a dictionary."""
        return {
            "version": self.version,
            "family_dependencies": {
                k: sorted(v) for k, v in self.family_dependencies.items()
            },
            "ivh_discretization_dependent_unless": self.ivh_discretization_dependent_unless,
            "comparison_mode": self.comparison_mode,
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "DeduplicationRules":
        """Deserialize rules from a dictionary."""
        return cls(
            version=data["version"],
            family_dependencies={
                k: frozenset(v) for k, v in data["family_dependencies"].items()
            },
            ivh_discretization_dependent_unless=data["ivh_discretization_dependent_unless"],
            comparison_mode=data["comparison_mode"],
        )

    @classmethod
    def get_version(cls, version: str) -> "DeduplicationRules":
        """
        Get rules for a specific version from the registry.

        Args:
            version: Version string (e.g., "1.0.0").

        Returns:
            The DeduplicationRules for that version.

        Raises:
            ValueError: If the version is not in the registry.
        """
        if version not in RULES_REGISTRY:
            raise ValueError(
                f"Unknown deduplication rules version: {version}. "
                f"Available versions: {list(RULES_REGISTRY.keys())}"
            )
        return RULES_REGISTRY[version]

version instance-attribute

family_dependencies instance-attribute


PreprocessingSignature

pictologics.deduplication.PreprocessingSignature dataclass

A hashable signature representing a preprocessing configuration.

Contains both a hash for fast comparison and the full JSON representation for human-readable debugging and logging.

Attributes:

Name Type Description
hash str

SHA256 hash of the normalized preprocessing steps.

json_repr str

Full JSON string of the preprocessing steps.

Source code in pictologics/deduplication.py
@dataclass(frozen=True)
class PreprocessingSignature:
    """
    A hashable signature representing a preprocessing configuration.

    Contains both a hash for fast comparison and the full JSON representation
    for human-readable debugging and logging.

    Attributes:
        hash: SHA256 hash of the normalized preprocessing steps.
        json_repr: Full JSON string of the preprocessing steps.
    """

    hash: str
    json_repr: str

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, PreprocessingSignature):
            return NotImplemented
        return self.hash == other.hash

    def __hash__(self) -> int:
        return hash(self.hash)

    @classmethod
    def from_steps(
        cls, steps: list[tuple[str, dict[str, Any]]]
    ) -> "PreprocessingSignature":
        """
        Create a signature from a list of (step_name, params) tuples.

        Args:
            steps: List of (step_name, params_dict) tuples, sorted by step name.

        Returns:
            A PreprocessingSignature with deterministic hash and JSON.
        """
        # Create deterministic JSON representation
        json_repr = json.dumps(
            {step_name: _normalize_params(params) for step_name, params in steps},
            sort_keys=True,
            separators=(",", ":"),
        )

        # Compute SHA256 hash
        hash_value = hashlib.sha256(json_repr.encode("utf-8")).hexdigest()

        return cls(hash=hash_value, json_repr=json_repr)

    def to_dict(self) -> dict[str, str]:
        """Serialize to dictionary."""
        return {"hash": self.hash, "json": self.json_repr}

    @classmethod
    def from_dict(cls, data: dict[str, str]) -> "PreprocessingSignature":
        """Deserialize from dictionary."""
        return cls(hash=data["hash"], json_repr=data["json"])

json_repr instance-attribute

from_steps(steps) classmethod

Create a signature from a list of (step_name, params) tuples.

Parameters:

Name Type Description Default
steps list[tuple[str, dict[str, Any]]]

List of (step_name, params_dict) tuples, sorted by step name.

required

Returns:

Type Description
'PreprocessingSignature'

A PreprocessingSignature with deterministic hash and JSON.

Source code in pictologics/deduplication.py
@classmethod
def from_steps(
    cls, steps: list[tuple[str, dict[str, Any]]]
) -> "PreprocessingSignature":
    """
    Create a signature from a list of (step_name, params) tuples.

    Args:
        steps: List of (step_name, params_dict) tuples, sorted by step name.

    Returns:
        A PreprocessingSignature with deterministic hash and JSON.
    """
    # Create deterministic JSON representation
    json_repr = json.dumps(
        {step_name: _normalize_params(params) for step_name, params in steps},
        sort_keys=True,
        separators=(",", ":"),
    )

    # Compute SHA256 hash
    hash_value = hashlib.sha256(json_repr.encode("utf-8")).hexdigest()

    return cls(hash=hash_value, json_repr=json_repr)

ConfigurationAnalyzer

pictologics.deduplication.ConfigurationAnalyzer

Analyzes multiple configurations to create a deduplication plan.

Compares preprocessing steps across configurations for each feature family and identifies which config/family pairs produce identical results.

Parameters:

Name Type Description Default
configs dict[str, list[dict[str, Any]]]

Dict mapping config names to lists of step dicts.

required
rules DeduplicationRules | None

The DeduplicationRules to use (defaults to current version).

None
Source code in pictologics/deduplication.py
class ConfigurationAnalyzer:
    """
    Analyzes multiple configurations to create a deduplication plan.

    Compares preprocessing steps across configurations for each feature family
    and identifies which config/family pairs produce identical results.

    Args:
        configs: Dict mapping config names to lists of step dicts.
        rules: The DeduplicationRules to use (defaults to current version).
    """

    def __init__(
        self,
        configs: dict[str, list[dict[str, Any]]],
        rules: DeduplicationRules | None = None,
    ):
        self.configs = configs
        self.rules = rules or get_default_rules()

    def analyze(self) -> DeduplicationPlan:
        """
        Analyze configurations and create a deduplication plan.

        Returns:
            A DeduplicationPlan mapping each config/family to its source.
        """
        plan = DeduplicationPlan(
            rules=self.rules,
            configs_hash=_hash_configs(self.configs),
        )

        # Get all feature families from rules
        all_families = set(self.rules.family_dependencies.keys())

        # Track first occurrence of each signature per family
        # signature_hash -> (first_config_name, signature)
        first_occurrence: dict[str, dict[str, tuple[str, PreprocessingSignature]]] = {
            family: {} for family in all_families
        }

        # Process each config
        for config_name, steps in self.configs.items():
            # Determine which families this config extracts
            families_in_config = self._get_families_in_config(steps)

            for family in families_in_config:
                if family not in all_families:
                    # Unknown family, skip
                    continue

                # Extract relevant preprocessing steps
                relevant_steps = extract_relevant_steps(steps, family, self.rules)

                # Create signature
                signature = PreprocessingSignature.from_steps(relevant_steps)
                plan.signatures[(config_name, family)] = signature

                # Check if this signature was seen before
                if signature.hash in first_occurrence[family]:
                    # Reuse from first config with this signature
                    source_config, _ = first_occurrence[family][signature.hash]
                    plan.sources[(config_name, family)] = source_config
                else:
                    # First occurrence - compute fresh
                    first_occurrence[family][signature.hash] = (config_name, signature)
                    plan.sources[(config_name, family)] = None

        return plan

    def _get_families_in_config(self, steps: list[dict[str, Any]]) -> set[str]:
        """
        Determine which feature families a config will extract.
        """
        families: set[str] = set()

        for step in steps:
            if step.get("step") == "extract_features":
                params = step.get("params", {})
                # Get explicitly listed families
                family_list = params.get(
                    "families", ["intensity", "morphology", "texture", "histogram", "ivh"]
                )
                families.update(family_list)

                # Check for texture sub-families
                if "texture" in families:
                    # Texture expands to all texture families
                    families.update(["glcm", "glrlm", "glszm", "gldzm", "ngtdm", "ngldm"])

                # Check optional intensity features
                if params.get("include_spatial_intensity", False):
                    families.add("spatial_intensity")
                if params.get("include_local_intensity", False):
                    families.add("local_intensity")

        return families

__init__(configs, rules=None)

Source code in pictologics/deduplication.py
def __init__(
    self,
    configs: dict[str, list[dict[str, Any]]],
    rules: DeduplicationRules | None = None,
):
    self.configs = configs
    self.rules = rules or get_default_rules()

analyze()

Analyze configurations and create a deduplication plan.

Returns:

Type Description
DeduplicationPlan

A DeduplicationPlan mapping each config/family to its source.

Source code in pictologics/deduplication.py
def analyze(self) -> DeduplicationPlan:
    """
    Analyze configurations and create a deduplication plan.

    Returns:
        A DeduplicationPlan mapping each config/family to its source.
    """
    plan = DeduplicationPlan(
        rules=self.rules,
        configs_hash=_hash_configs(self.configs),
    )

    # Get all feature families from rules
    all_families = set(self.rules.family_dependencies.keys())

    # Track first occurrence of each signature per family
    # signature_hash -> (first_config_name, signature)
    first_occurrence: dict[str, dict[str, tuple[str, PreprocessingSignature]]] = {
        family: {} for family in all_families
    }

    # Process each config
    for config_name, steps in self.configs.items():
        # Determine which families this config extracts
        families_in_config = self._get_families_in_config(steps)

        for family in families_in_config:
            if family not in all_families:
                # Unknown family, skip
                continue

            # Extract relevant preprocessing steps
            relevant_steps = extract_relevant_steps(steps, family, self.rules)

            # Create signature
            signature = PreprocessingSignature.from_steps(relevant_steps)
            plan.signatures[(config_name, family)] = signature

            # Check if this signature was seen before
            if signature.hash in first_occurrence[family]:
                # Reuse from first config with this signature
                source_config, _ = first_occurrence[family][signature.hash]
                plan.sources[(config_name, family)] = source_config
            else:
                # First occurrence - compute fresh
                first_occurrence[family][signature.hash] = (config_name, signature)
                plan.sources[(config_name, family)] = None

    return plan

DeduplicationPlan

pictologics.deduplication.DeduplicationPlan dataclass

A plan describing which config/family pairs should compute vs. reuse.

Attributes:

Name Type Description
rules DeduplicationRules

The DeduplicationRules used to create this plan.

signatures dict[tuple[str, str], PreprocessingSignature]

Mapping of (config_name, family) to PreprocessingSignature.

sources dict[tuple[str, str], str | None]

Mapping of (config_name, family) to source config name (or None if first).

configs_hash str

Hash of the configs dict to detect modifications.

Source code in pictologics/deduplication.py
@dataclass
class DeduplicationPlan:
    """
    A plan describing which config/family pairs should compute vs. reuse.

    Attributes:
        rules: The DeduplicationRules used to create this plan.
        signatures: Mapping of (config_name, family) to PreprocessingSignature.
        sources: Mapping of (config_name, family) to source config name (or None if first).
        configs_hash: Hash of the configs dict to detect modifications.
    """

    rules: DeduplicationRules
    signatures: dict[tuple[str, str], PreprocessingSignature] = field(
        default_factory=dict
    )
    sources: dict[tuple[str, str], str | None] = field(default_factory=dict)
    configs_hash: str = ""

    def should_compute(self, config_name: str, family: str) -> bool:
        """
        Check if this config/family should be computed fresh.

        Returns True if this is the first occurrence of this signature,
        False if it can be copied from another config.
        """
        return self.sources.get((config_name, family)) is None

    def get_source(self, config_name: str, family: str) -> str | None:
        """
        Get the source config to copy from, or None if should compute.
        """
        return self.sources.get((config_name, family))

    def is_stale(self, current_configs: dict[str, list[dict[str, Any]]]) -> bool:
        """
        Check if this plan is stale due to config modifications.
        """
        current_hash = _hash_configs(current_configs)
        return current_hash != self.configs_hash

    def to_dict(self) -> dict[str, Any]:
        """Serialize the plan to a dictionary."""
        return {
            "rules": self.rules.to_dict(),
            "configs_hash": self.configs_hash,
            "signatures": [
                {
                    "config": config,
                    "family": family,
                    **sig.to_dict(),
                }
                for (config, family), sig in self.signatures.items()
            ],
            "sources": [
                {
                    "config": config,
                    "family": family,
                    "source": source,
                }
                for (config, family), source in self.sources.items()
            ],
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "DeduplicationPlan":
        """Deserialize a plan from a dictionary."""
        rules = DeduplicationRules.from_dict(data["rules"])

        signatures = {}
        for item in data.get("signatures", []):
            key = (item["config"], item["family"])
            signatures[key] = PreprocessingSignature.from_dict(item)

        sources = {}
        for item in data.get("sources", []):
            key = (item["config"], item["family"])
            sources[key] = item["source"]

        return cls(
            rules=rules,
            signatures=signatures,
            sources=sources,
            configs_hash=data.get("configs_hash", ""),
        )

    def get_summary(self) -> dict[str, int]:
        """
        Get a summary of the deduplication plan.

        Returns:
            Dict with counts of computed vs reused families.
        """
        computed = sum(1 for s in self.sources.values() if s is None)
        reused = sum(1 for s in self.sources.values() if s is not None)
        return {"computed": computed, "reused": reused, "total": computed + reused}

Rules Registry

The RULES_REGISTRY provides versioned deduplication rules for reproducibility:

pictologics.deduplication.RULES_REGISTRY = {'1.0.0': DEDUPLICATION_RULES_V1_0_0} module-attribute

Available Versions

Version Description
"1.0.0" Initial rules defining feature family dependencies

Helper Functions

pictologics.deduplication.get_default_rules()

Get the current default deduplication rules.

Source code in pictologics/deduplication.py
def get_default_rules() -> DeduplicationRules:
    """Get the current default deduplication rules."""
    return RULES_REGISTRY[CURRENT_RULES_VERSION]

Feature Family Dependencies

The deduplication system understands which preprocessing steps affect which feature families:

Feature Family Relevant Preprocessing Steps
morphology resample, binarize_mask, keep_largest_component
intensity resample, resegment, filter_outliers, filter
spatial_intensity Same as intensity
local_intensity Same as intensity
histogram resample, resegment, filter_outliers, filter, binarize_mask, keep_largest_component, discretise
ivh Same as histogram (unless ivh_use_continuous=True, which removes discretise dependency)
texture (all subfamilies) Same as histogram

Filters Affect Intensity Features

When using image filters (LoG, Gabor, Wavelets, Laws, etc.), intensity features are computed from the filtered response map, not the original image. Therefore, different filter configurations will produce different intensity features and cannot be deduplicated.

Morphology features are not affected by filters since they are computed from the mask geometry, not intensity values. This means morphology can be computed once and reused across all filter configurations.

When two configurations share identical values for the relevant preprocessing steps of a feature family, that family is computed once and the result is reused.


Integration with RadiomicsPipeline

The RadiomicsPipeline class integrates deduplication through these parameters:

Parameter Type Default Description
deduplicate bool True Enable/disable deduplication
deduplication_rules str or DeduplicationRules "1.0.0" Rules version for reproducibility

These settings are preserved during serialization (to_dict(), save_configs(), etc.) and restored during deserialization.

# Access pipeline deduplication settings
pipeline = RadiomicsPipeline(deduplicate=True)

print(pipeline.deduplication_enabled)    # True
print(pipeline.deduplication_stats)      # Statistics after run()