Deduplication API
The deduplication module provides intelligent optimization for multi-configuration radiomic feature extraction. When multiple configurations share preprocessing steps but differ only in discretization, the system avoids redundant computation by identifying which feature families can be computed once and reused.
Overview
The deduplication system consists of four main components:
DeduplicationRules: Defines which preprocessing steps affect which feature familiesPreprocessingSignature: Creates hashable representations of preprocessing statesConfigurationAnalyzer: Analyzes pipeline configurations to identify optimization opportunitiesDeduplicationPlan: Generates optimized execution plans
Quick Start
Enabled by Default
Deduplication is enabled by default (deduplicate=True). You don't need to explicitly enable it—just create a pipeline and run multiple configurations.
from pictologics import RadiomicsPipeline
# Deduplication is enabled by default!
pipeline = RadiomicsPipeline() # deduplicate=True is the default
# Add multiple configurations with shared preprocessing
# ... (morphology/intensity computed once, reused across configs)
results = pipeline.run(image, mask, config_names=["config1", "config2", "config3"])
# Check performance statistics
print(pipeline.deduplication_stats)
For complete usage examples, see Case 7: Multi-configuration batch with deduplication.
How Results Are Handled
When deduplication reuses features from a previous configuration, the features are copied to the reusing configuration's results—they are never empty or missing.
Result Behavior
| Scenario | Behavior |
|---|---|
| deduplicate=True (default) | Features computed once, then copied to all configs with matching preprocessing. All configs receive complete feature sets. |
| deduplicate=False | Features computed independently for each config. Same results, but slower. |
Example: Results Structure
results = pipeline.run(image, mask, config_names=["fbn_8", "fbn_16", "fbn_32"])
# All configs have IDENTICAL morphology values (computed once, copied to others)
assert results["fbn_8"]["volume_mesh_ml_HTUR"] == results["fbn_16"]["volume_mesh_ml_HTUR"]
assert results["fbn_8"]["volume_mesh_ml_HTUR"] == results["fbn_32"]["volume_mesh_ml_HTUR"]
# Texture features DIFFER (depend on discretization)
assert results["fbn_8"]["glcm_joint_avg_d1_HTUR"] != results["fbn_32"]["glcm_joint_avg_d1_HTUR"]
Data Tables and Concatenation
When you concatenate results into a single DataFrame (e.g., for machine learning), every configuration row is complete—no missing values due to deduplication:
import pandas as pd
from pictologics import format_results
# Format results for each config
rows = []
for config_name, features in results.items():
row = format_results({config_name: features}, fmt="wide", meta={"config": config_name})
rows.append(row)
# Concatenate into single DataFrame - NO missing values!
df = pd.DataFrame(rows)
print(df.shape) # (3, N) - all rows complete
print(df.isna().sum().sum()) # 0 - no NaN values
DeduplicationRules
pictologics.deduplication.DeduplicationRules
dataclass
Defines which preprocessing steps affect each feature family.
This is a frozen (immutable) dataclass that specifies the dependencies between preprocessing steps and feature families. Rules are versioned to ensure reproducibility when sharing configurations.
Attributes:
| Name | Type | Description |
|---|---|---|
version |
str
|
Semantic version string for this rules definition. |
family_dependencies |
dict[str, frozenset[str]]
|
Mapping of feature family names to the set of preprocessing step names that affect their output. |
ivh_discretization_dependent_unless |
str
|
Condition under which IVH becomes independent of discretization (e.g., "ivh_use_continuous=True"). |
comparison_mode |
str
|
How to compare preprocessing parameters ("exact_params"). |
Source code in pictologics/deduplication.py
version
instance-attribute
family_dependencies
instance-attribute
PreprocessingSignature
pictologics.deduplication.PreprocessingSignature
dataclass
A hashable signature representing a preprocessing configuration.
Contains both a hash for fast comparison and the full JSON representation for human-readable debugging and logging.
Attributes:
| Name | Type | Description |
|---|---|---|
hash |
str
|
SHA256 hash of the normalized preprocessing steps. |
json_repr |
str
|
Full JSON string of the preprocessing steps. |
Source code in pictologics/deduplication.py
json_repr
instance-attribute
from_steps(steps)
classmethod
Create a signature from a list of (step_name, params) tuples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
steps
|
list[tuple[str, dict[str, Any]]]
|
List of (step_name, params_dict) tuples, sorted by step name. |
required |
Returns:
| Type | Description |
|---|---|
'PreprocessingSignature'
|
A PreprocessingSignature with deterministic hash and JSON. |
Source code in pictologics/deduplication.py
ConfigurationAnalyzer
pictologics.deduplication.ConfigurationAnalyzer
Analyzes multiple configurations to create a deduplication plan.
Compares preprocessing steps across configurations for each feature family and identifies which config/family pairs produce identical results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
configs
|
dict[str, list[dict[str, Any]]]
|
Dict mapping config names to lists of step dicts. |
required |
rules
|
DeduplicationRules | None
|
The DeduplicationRules to use (defaults to current version). |
None
|
Source code in pictologics/deduplication.py
519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 | |
__init__(configs, rules=None)
analyze()
Analyze configurations and create a deduplication plan.
Returns:
| Type | Description |
|---|---|
DeduplicationPlan
|
A DeduplicationPlan mapping each config/family to its source. |
Source code in pictologics/deduplication.py
DeduplicationPlan
pictologics.deduplication.DeduplicationPlan
dataclass
A plan describing which config/family pairs should compute vs. reuse.
Attributes:
| Name | Type | Description |
|---|---|---|
rules |
DeduplicationRules
|
The DeduplicationRules used to create this plan. |
signatures |
dict[tuple[str, str], PreprocessingSignature]
|
Mapping of (config_name, family) to PreprocessingSignature. |
sources |
dict[tuple[str, str], str | None]
|
Mapping of (config_name, family) to source config name (or None if first). |
configs_hash |
str
|
Hash of the configs dict to detect modifications. |
Source code in pictologics/deduplication.py
406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 | |
Rules Registry
The RULES_REGISTRY provides versioned deduplication rules for reproducibility:
pictologics.deduplication.RULES_REGISTRY = {'1.0.0': DEDUPLICATION_RULES_V1_0_0}
module-attribute
Available Versions
| Version | Description |
|---|---|
"1.0.0" |
Initial rules defining feature family dependencies |
Helper Functions
pictologics.deduplication.get_default_rules()
Feature Family Dependencies
The deduplication system understands which preprocessing steps affect which feature families:
| Feature Family | Relevant Preprocessing Steps |
|---|---|
morphology |
resample, binarize_mask, keep_largest_component |
intensity |
resample, resegment, filter_outliers, filter |
spatial_intensity |
Same as intensity |
local_intensity |
Same as intensity |
histogram |
resample, resegment, filter_outliers, filter, binarize_mask, keep_largest_component, discretise |
ivh |
Same as histogram (unless ivh_use_continuous=True, which removes discretise dependency) |
texture (all subfamilies) |
Same as histogram |
Filters Affect Intensity Features
When using image filters (LoG, Gabor, Wavelets, Laws, etc.), intensity features are computed from the filtered response map, not the original image. Therefore, different filter configurations will produce different intensity features and cannot be deduplicated.
Morphology features are not affected by filters since they are computed from the mask geometry, not intensity values. This means morphology can be computed once and reused across all filter configurations.
When two configurations share identical values for the relevant preprocessing steps of a feature family, that family is computed once and the result is reused.
Integration with RadiomicsPipeline
The RadiomicsPipeline class integrates deduplication through these parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
deduplicate |
bool |
True |
Enable/disable deduplication |
deduplication_rules |
str or DeduplicationRules |
"1.0.0" |
Rules version for reproducibility |
These settings are preserved during serialization (to_dict(), save_configs(), etc.) and restored during deserialization.