Skip to content

Commit d6e9b9b

Browse files
authored
Lazy-loading during attribute access (#1882)
This PR changes the lazy-loading behaviour such that lazy-loading happens when a sample attribute is accessed rather than when fetching the sample. This allows for a more fine-grained approach where attributes such as images are fetched only if they are actually used. For instance, if a piece of code only need to access the sample labels, the sample image will not be loaded, thus improving the performance. E.g.: ``` sample = dataset[0] # Old behaviour would load image here. sample.label # At this point, the old behaviour would already have loaded sample.image even if it’s not used. The new behaviour sets the label without having to load the image, thus improving performance. sample.image # The new behaviour loads the image here. ``` To do so, we track which converters are needed for which attribute, then when an attribute is accessed, we run the required converters. We distinguish between direct attributes (like labels) which we don’t require any converters and lazy attributes. Direct attributes are available as plain object attribute, no magic. Lazy attributes behaves like a class `@property` calling the a function to evaluate the attribute value. Once a lazy attribute has been evaluated, the computed value is stored in the sample class and it behaves like a direct attribute. The existing implementation of attribute renaming was problematic because the renaming converter was inserted as the last converter and all attributes had a dependency on it which prevents per-attribute lazy loading. To fix this problem, this PR reworks attribute renaming to insert the converter as the first converter. Since this converter is the first, it can also be executed as a batch converter. **Refactoring of converter pathfinding and lazy converter handling:** * The logic for attribute renaming and deletion during schema conversion has been reworked. Now, an initial attribute remapping step is introduced at the start of the conversion process, and post-processing steps for renaming/deletion have been removed for clarity and correctness. (_create_initial_renaming_converter replaces _create_post_processing_for_semantic, and related changes in `_find_conversion_path_for_semantic`) [[1]](diffhunk://#diff-fb2a908f10fa67b50f19323df8324a854db0618c5c61595394aa828af411ca41L638-R677) [[2]](diffhunk://#diff-fb2a908f10fa67b50f19323df8324a854db0618c5c61595394aa828af411ca41L714-R715) [[3]](diffhunk://#diff-fb2a908f10fa67b50f19323df8324a854db0618c5c61595394aa828af411ca41L741-R731) [[4]](diffhunk://#diff-fb2a908f10fa67b50f19323df8324a854db0618c5c61595394aa828af411ca41L769-R755) * The handling of lazy converters has been improved: instead of a flat list, lazy converters are now tracked as a dictionary mapping output attribute names to lists of converters, allowing for more precise and efficient application of lazy conversions. This affects both the return type of `ConversionPaths` and the internal logic for separating batch and lazy converters. [[1]](diffhunk://#diff-fb2a908f10fa67b50f19323df8324a854db0618c5c61595394aa828af411ca41L54-R54) [[2]](diffhunk://#diff-fb2a908f10fa67b50f19323df8324a854db0618c5c61595394aa828af411ca41R824) [[3]](diffhunk://#diff-fb2a908f10fa67b50f19323df8324a854db0618c5c61595394aa828af411ca41L856-R833) [[4]](diffhunk://#diff-fb2a908f10fa67b50f19323df8324a854db0618c5c61595394aa828af411ca41L887-R900) **Dataset and Sample class improvements:** * The `Sample` and `Dataset` classes have been updated to support the new lazy converter structure, storing and exposing lazy converters as dictionaries keyed by attribute name. The initialization and from_dataframe logic have been updated accordingly. [[1]](diffhunk://#diff-4ac196ddc4dc8e6d33daf684ded18886ff8774fadb8b6cbd4bfa88ca424bb34fR48-R55) [[2]](diffhunk://#diff-4ac196ddc4dc8e6d33daf684ded18886ff8774fadb8b6cbd4bfa88ca424bb34fL153-R172) [[3]](diffhunk://#diff-4ac196ddc4dc8e6d33daf684ded18886ff8774fadb8b6cbd4bfa88ca424bb34fL169-R190) [[4]](diffhunk://#diff-4ac196ddc4dc8e6d33daf684ded18886ff8774fadb8b6cbd4bfa88ca424bb34fL186-R200) * The `Sample` class now tracks applied converters and stores schema and dataframe references for dynamic property loading, enabling more flexible and efficient lazy evaluation. **Other improvements and minor changes:** * The `AttributeRemapperConverter` now uses `rename()` to rename columns and keeps all other columns, instead of selecting only mapped columns. * The image converter now creates output columns with explicit dtype using schema information, improving compatibility and correctness. * Minor code cleanup and improvements, such as more precise `__repr__` output for `Sample` and updated imports. [[1]](diffhunk://#diff-4ac196ddc4dc8e6d33daf684ded18886ff8774fadb8b6cbd4bfa88ca424bb34fL56-R68) [[2]](diffhunk://#diff-4ac196ddc4dc8e6d33daf684ded18886ff8774fadb8b6cbd4bfa88ca424bb34fR9) [[3]](diffhunk://#diff-4ac196ddc4dc8e6d33daf684ded18886ff8774fadb8b6cbd4bfa88ca424bb34fL17-R19) <!-- Contributing guide: https://github.com/open-edge-platform/datumaro/blob/develop/CONTRIBUTING.md --> <!-- Please add a summary of changes. You may use Copilot to auto-generate the PR description but please consider including any other relevant facts which Copilot may be unaware of (such as design choices and testing procedure). Add references to the relevant issues and pull requests if any like so: Resolves #111 and #222. Depends on #1000 (for series of dependent commits). --> ### Checklist <!-- Put an 'x' in all the boxes that apply --> - [ ] I have added tests to cover my changes or documented any manual tests. - [ ] I have added the description of my changes into [CHANGELOG](https://github.com/open-edge-platform/datumaro/blob/develop/CHANGELOG.md). - [ ] I have updated the [documentation](https://github.com/open-edge-platform/datumaro/tree/develop/docs) accordingly --------- Signed-off-by: Grégoire Payen de La Garanderie <[email protected]>
1 parent 0df650a commit d6e9b9b

File tree

6 files changed

+234
-197
lines changed

6 files changed

+234
-197
lines changed

src/datumaro/experimental/converter_registry.py

Lines changed: 92 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ class ConversionPaths(NamedTuple):
5151
"""
5252

5353
batch_converters: List["Converter"]
54-
lazy_converters: List["Converter"]
54+
lazy_converters: Dict[str, List["Converter"]]
5555

5656

5757
@dataclass(frozen=True)
@@ -408,22 +408,16 @@ def get_to_types(self) -> dict[str, Type[Field]]:
408408

409409
def convert(self, df: pl.DataFrame) -> pl.DataFrame:
410410
"""
411-
Select and rename columns according to column_map.
412-
Columns not in the mapping are dropped.
411+
Rename columns according to column_map and keep all other columns.
413412
414413
Args:
415414
df: Input DataFrame
416415
417416
Returns:
418-
DataFrame with only the selected/renamed columns
417+
DataFrame with renamed columns
419418
"""
420-
# Build selection expressions for all columns we want to keep
421-
select_exprs = {}
422-
423-
for old_name, new_name in self.column_map.items():
424-
select_exprs[new_name] = pl.col(old_name)
425-
426-
return df.select(**select_exprs)
419+
# Apply all renames
420+
return df.rename(self.column_map)
427421

428422
def filter_output_spec(self) -> bool:
429423
"""Always return True as renaming is always applicable."""
@@ -592,9 +586,9 @@ def _get_applicable_converters(
592586
output_field = field_type(semantic=semantic)
593587
output_categories = None
594588

595-
# Add the iteration count at the end to ensure uniqueness
596-
# and avoid any conflict with existing attribute names
597-
output_name = f"{output_name}_temp_{iteration}"
589+
# Add the iteration count at the end to ensure uniqueness
590+
# and avoid any conflict with existing attribute names
591+
output_name = f"{output_name}_temp_{iteration}"
598592

599593
output_attr_spec = AttributeSpec(
600594
name=output_name, field=output_field, categories=output_categories
@@ -655,63 +649,52 @@ def _group_fields_by_semantic(schema: Schema) -> dict[Semantic, _SchemaState]:
655649
}
656650

657651

658-
def _create_post_processing_for_semantic(
659-
final_state: _SchemaState, target_state: _SchemaState
660-
) -> Tuple[List[Converter], _SchemaState]:
652+
def _create_initial_renaming_converter(
653+
start_state: _SchemaState, target_state: _SchemaState
654+
) -> Tuple[Optional[AttributeRemapperConverter], _SchemaState]:
661655
"""
662-
Create post-processing converters for a single semantic group.
656+
Create an initial AttributeRemapperConverter to handle renaming at the beginning.
663657
664658
Args:
665-
final_state: Final state reached after conversions
659+
start_state: Starting state of the schema
666660
target_state: Target state for this semantic group
667661
668662
Returns:
669-
Tuple of (list of post-processing converters, updated_state_after_processing)
670-
where updated_state_after_processing reflects the state after renaming/deletion
663+
Tuple of (optional AttributeRemapperConverter, updated_start_state after renaming)
671664
"""
672-
# Build attribute mappings: include only attributes that should be kept
673-
attr_mappings = []
674665

675-
# Determine if a converter is needed (i.e. at least one attribute has been renamed or deleted)
666+
attr_mappings = []
676667
converter_needed = False
668+
updated_field_to_attr_spec = dict(start_state.field_to_attr_spec)
677669

678-
for field_type, target_attr_spec in target_state.field_to_attr_spec.items():
679-
if field_type in final_state.field_to_attr_spec:
680-
final_attr_spec = final_state.field_to_attr_spec[field_type]
670+
# Used to check for conflicts when renaming
671+
used_names = set(
672+
updated_field_to_attr_spec[field_type].name for field_type in updated_field_to_attr_spec
673+
)
681674

682-
if final_attr_spec.name != target_attr_spec.name:
675+
for field_type, start_attr_spec in start_state.field_to_attr_spec.items():
676+
if field_type in target_state.field_to_attr_spec:
677+
target_attr_spec = target_state.field_to_attr_spec[field_type]
678+
if start_attr_spec.name != target_attr_spec.name:
683679
converter_needed = True
680+
new_name = target_attr_spec.name
681+
# If the new name would conflict with another attribute in the target_state, use a temporary name
682+
if new_name in used_names:
683+
new_name = f"{new_name}_temp_{id(start_attr_spec)}"
684+
renamed_attr_spec = AttributeSpec(
685+
name=new_name,
686+
field=start_attr_spec.field,
687+
categories=start_attr_spec.categories,
688+
)
689+
attr_mappings.append((start_attr_spec, renamed_attr_spec))
690+
updated_field_to_attr_spec[field_type] = renamed_attr_spec
684691

685-
# Add the mapping from final to target attribute spec
686-
attr_mappings.append((final_attr_spec, target_attr_spec))
687-
688-
# Check if any fields need to be deleted (exist in final but not in target)
689-
for field_type, final_attr_spec in final_state.field_to_attr_spec.items():
690-
if field_type not in target_state.field_to_attr_spec:
691-
# If the field is not in the target state, it should be deleted
692-
converter_needed = True
693-
694-
# Create a single remapper converter that handles both renaming and deletion
695692
if converter_needed:
696-
# Create the updated state after processing by applying the attr_mappings to final_state
697-
# This preserves the categories from final_state but with the target names/structure
698-
updated_field_to_attr_spec = {}
699-
700-
for final_attr_spec, target_attr_spec in attr_mappings:
701-
# Use the target name and field type, but preserve categories from final state
702-
updated_field_to_attr_spec[type(target_attr_spec.field)] = AttributeSpec(
703-
name=target_attr_spec.name,
704-
field=target_attr_spec.field,
705-
categories=final_attr_spec.categories, # Preserve inferred categories
706-
)
707-
708-
updated_state_after_processing = _SchemaState(updated_field_to_attr_spec)
709-
return [
710-
AttributeRemapperConverter(attr_mappings=attr_mappings)
711-
], updated_state_after_processing
712-
713-
# No converter needed, return the final state as-is
714-
return [], final_state
693+
converter = AttributeRemapperConverter(attr_mappings=attr_mappings)
694+
updated_start_state = _SchemaState(updated_field_to_attr_spec)
695+
return converter, updated_start_state
696+
else:
697+
return None, start_state
715698

716699

717700
def _find_conversion_path_for_semantic(
@@ -731,19 +714,25 @@ def _find_conversion_path_for_semantic(
731714
Raises:
732715
ConversionError: If no conversion path is found for this semantic
733716
"""
734-
# If we already have all required fields, check if we need renaming/deletion
735-
if start_state == target_state:
736-
return _create_post_processing_for_semantic(start_state, target_state)
717+
# Apply initial renaming at the beginning if needed
718+
initial_converter, effective_start_state = _create_initial_renaming_converter(
719+
start_state, target_state
720+
)
721+
initial_converters = [initial_converter] if initial_converter else []
722+
723+
# If we already have all required fields after initial renaming, we might be done
724+
if effective_start_state == target_state:
725+
return initial_converters, target_state
737726

738-
# Initialize A* search
727+
# Initialize A* search from the effective start state
739728
open_set: List[_SearchNode] = []
740729
closed_set: Set[_SchemaState] = set()
741730

742731
start_node = _SearchNode(
743-
state=start_state,
744-
path=[],
745-
g_cost=0,
746-
h_cost=_heuristic_cost(start_state, target_state),
732+
state=effective_start_state,
733+
path=initial_converters, # Add initial converters to the start node path
734+
g_cost=len(initial_converters), # Account for initial converters in cost
735+
h_cost=_heuristic_cost(effective_start_state, target_state),
747736
)
748737

749738
heapq.heappush(open_set, start_node)
@@ -758,11 +747,8 @@ def _find_conversion_path_for_semantic(
758747

759748
# Check if we've reached the goal - all target fields must match exactly
760749
if _heuristic_cost(current_node.state, target_state) == 0:
761-
# Add post-processing converters for final renaming and deletion
762-
post_processing, final_state = _create_post_processing_for_semantic(
763-
current_node.state, target_state
764-
)
765-
return current_node.path + post_processing, final_state
750+
# We've reached the goal, return the path
751+
return current_node.path, current_node.state
766752

767753
# Explore neighbors
768754
for converter, new_state in _get_applicable_converters(
@@ -786,7 +772,7 @@ def _find_conversion_path_for_semantic(
786772

787773
# No path found
788774
missing_fields = set(target_state.field_to_attr_spec.keys()) - set(
789-
start_state.field_to_attr_spec.keys()
775+
effective_start_state.field_to_attr_spec.keys()
790776
)
791777
raise ConversionError(
792778
f"No conversion path found for semantic {semantic}. " f"Missing fields: {missing_fields}"
@@ -820,8 +806,6 @@ def find_conversion_path(
820806
# Collect all converters needed across all semantic groups
821807
all_converters: List[Converter] = []
822808

823-
attr_mappings = []
824-
825809
# Process each semantic group in the target schema
826810
for semantic, target_state in target_groups.items():
827811
# Get corresponding source state for this semantic (if any)
@@ -835,16 +819,8 @@ def find_conversion_path(
835819
# Update the target state with any inferred categories
836820
target_groups[semantic] = updated_target_state
837821

838-
# Merge all the attribute remappers into a single one. If any, the remapper is always the last step.
839-
if semantic_converters and isinstance(semantic_converters[-1], AttributeRemapperConverter):
840-
attr_mappings += semantic_converters[-1].attr_mappings
841-
semantic_converters = semantic_converters[:-1]
842-
843822
all_converters.extend(semantic_converters)
844823

845-
if attr_mappings:
846-
all_converters.append(AttributeRemapperConverter(attr_mappings=attr_mappings))
847-
848824
# Reconstruct the updated schema with inferred categories
849825
# Use the list of attributes from to_schema rather than just the target_groups
850826
# because the target_groups may include attributes which are deleted in the final to_schema.
@@ -869,6 +845,7 @@ def _separate_batch_and_lazy_converters(
869845
Separate converters into batch and lazy lists based on dependencies.
870846
871847
If a converter is lazy, all converters that depend on its output must also be lazy.
848+
Also tracks which lazy converters are required for each output attribute.
872849
873850
Args:
874851
conversion_path: The complete conversion path from A* search
@@ -877,7 +854,7 @@ def _separate_batch_and_lazy_converters(
877854
ConversionPaths with separated batch and lazy converter lists
878855
"""
879856
if not conversion_path:
880-
return ConversionPaths(batch_converters=[], lazy_converters=[])
857+
return ConversionPaths(batch_converters=[], lazy_converters={})
881858

882859
# Track which converters must be lazy
883860
lazy_indices: Set[int] = set()
@@ -908,14 +885,40 @@ def _separate_batch_and_lazy_converters(
908885
for attr_spec in output_specs:
909886
lazy_fields[attr_spec.name] = True
910887

911-
# Separate into batch and lazy lists
888+
# Collect batch converters (non-lazy ones)
912889
batch_converters: List[Converter] = []
913-
lazy_converters: List[Converter] = []
914-
915890
for i, converter in enumerate(conversion_path):
916-
if i in lazy_indices:
917-
lazy_converters.append(converter)
918-
else:
891+
if i not in lazy_indices:
919892
batch_converters.append(converter)
920893

921-
return ConversionPaths(batch_converters=batch_converters, lazy_converters=lazy_converters)
894+
# Collect lazy converters by output attribute
895+
lazy_converters_by_output: Dict[str, List[Converter]] = defaultdict(list)
896+
897+
# Iterate through converters in reverse to propagate output dependencies
898+
dependents_by_output: Dict[str, Set[Converter]] = defaultdict(set)
899+
900+
for i, converter in reversed(list(enumerate(conversion_path))):
901+
if i in lazy_indices:
902+
# This is a lazy converter - track its outputs
903+
dependents = set()
904+
905+
output_specs = converter.get_output_attr_specs()
906+
for attr_spec in output_specs:
907+
dependents.update(dependents_by_output.get(attr_spec.name, []))
908+
dependents.add(attr_spec.name)
909+
910+
for dependent in dependents:
911+
lazy_converters_by_output[dependent].append(converter)
912+
913+
# Propagate dependencies from outputs to inputs
914+
input_specs = converter.get_input_attr_specs()
915+
for input_spec in input_specs:
916+
dependents_by_output[input_spec.name].update(dependents)
917+
918+
# Reverse all chains to get dependencies-first order
919+
for output_name, chain in lazy_converters_by_output.items():
920+
lazy_converters_by_output[output_name] = list(reversed(chain))
921+
922+
return ConversionPaths(
923+
batch_converters=batch_converters, lazy_converters=lazy_converters_by_output
924+
)

src/datumaro/experimental/converters.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -321,19 +321,23 @@ def convert(self, df: pl.DataFrame) -> pl.DataFrame:
321321

322322
# Convert to numpy array
323323
img_array = np.array(img, dtype=np.uint8)
324-
image_data.append(img_array.flatten().tolist())
324+
image_data.append(img_array.flatten())
325325
image_shapes.append(list(img_array.shape))
326326

327327
# Create image info with just width and height
328328
image_infos.append(ImageInfo(width=img_array.shape[1], height=img_array.shape[0]))
329329

330330
# Create output DataFrame
331+
image_schema = self.output_image.field.to_polars_schema("image")
332+
image_info_schema = self.output_image.field.to_polars_schema("image_info")
333+
331334
result_df = df.clone()
335+
332336
result_df = result_df.with_columns(
333337
[
334-
pl.Series(output_col, image_data),
335-
pl.Series(output_col + "_shape", image_shapes),
336-
pl.Series(output_info_col, image_infos),
338+
pl.Series(output_col, image_data, dtype=image_schema["image"]),
339+
pl.Series(output_col + "_shape", image_shapes, dtype=image_schema["image_shape"]),
340+
pl.Series(output_info_col, image_infos, dtype=image_info_schema["image_info"]),
337341
]
338342
)
339343

0 commit comments

Comments
 (0)