Skip to content

[BUG] Error preprocessing sample.save #116

@ccruizm

Description

@ccruizm

Describe the bug
I am trying to run Segger on CosMx data (6k genes). I have created the nuclei masks and exported the tx_file in .parquet format. I have followed the script to preprocess [https://github.com/EliHei2/segger_dev/blob/main/scripts/create_data_cosmx.py] the data before training.
When I tried to save the object, I got an error

[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.43.
  warnings.warn(
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.64.
  warnings.warn(
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.68.
  warnings.warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 7
      1 # Parameters:
      2 # - k_bd[/dist_bd](http://localhost:8880/dist_bd): Control nucleus boundary point connections
      3 # - k_tx[/dist_tx](http://localhost:8880/dist_tx): Control transcript neighborhood connections
      4 # - tile_width[/height](http://localhost:8880/height): Size of spatial tiles for processing
      5 # - neg_sampling_ratio: Ratio of negative to positive samples
      6 # - val_prob: Fraction of data for validation
----> 7 sample.save(
      8     data_dir=SEGGER_DATA_DIR,
      9     k_bd=3,  # Number of boundary points to connect
     10     dist_bd=15,  # Maximum distance for boundary connections
     11     k_tx=10,  # Use calculated optimal transcript neighbors
     12     dist_tx=10,  # Use calculated optimal search radius
     13     tile_width=200,  # Tile size for processing,
     14     tile_height=200,  # Tile size for processing
     15     neg_sampling_ratio=10.0,  # 5:1 negative:positive samples
     16     frac=1.0,  # Use all data
     17     val_prob=0.3,  # 30% validation set
     18     test_prob=0,  # No test set
     19 )

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:471](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=470), in STSampleParquet.save(self, data_dir, k_bd, dist_bd, k_tx, dist_tx, tile_size, tile_width, tile_height, neg_sampling_ratio, frac, val_prob, test_prob)
    469 outs = []
    470 for region in regions:
--> 471     outs.append(func(region))
    472 return outs

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:453](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=452), in STSampleParquet.save.<locals>.func(region)
    448 data_type = np.random.choice(
    449     a=["train_tiles", "test_tiles", "val_tiles"],
    450     p=[1 - (test_prob + val_prob), test_prob, val_prob],
    451 )
    452 xt = STTile(dataset=xm, extents=tile)
--> 453 pyg_data = xt.to_pyg_dataset(
    454     k_bd=k_bd,
    455     dist_bd=dist_bd,
    456     k_tx=k_tx,
    457     dist_tx=dist_tx,
    458     neg_sampling_ratio=neg_sampling_ratio,
    459 )
    460 if pyg_data is not None:
    461     if pyg_data["tx", "belongs", "bd"].edge_index.numel() == 0:
    462         # this tile is only for testing

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:1238](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=1237), in STTile.to_pyg_dataset(self, neg_sampling_ratio, k_bd, dist_bd, k_tx, dist_tx, area, convexity, elongation, circularity)
   1235     polygons = gpd.GeoSeries(self.boundaries[geometry_column], index=self.boundaries.index)
   1236 else:
   1237     # Fallback: compute polygons
-> 1238     polygons = utils.get_polygons_from_xy(
   1239         self.boundaries,
   1240         x=self.settings.boundaries.x,
   1241         y=self.settings.boundaries.y,
   1242         label=self.settings.boundaries.label,
   1243         scale_factor=self.settings.boundaries.scale_factor,
   1244     )
   1246 # Ensure self.boundaries is a GeoDataFrame with correct geometry
   1247 self.boundaries = gpd.GeoDataFrame(self.boundaries.copy(), geometry=polygons)

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/_utils.py:189](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/_utils.py#line=188), in get_polygons_from_xy(boundaries, x, y, label, scale_factor)
    186 part_offset = np.arange(len(np.unique(ids)) + 1)
    188 # Convert to GeoSeries of polygons
--> 189 polygons = shapely.from_ragged_array(
    190     shapely.GeometryType.POLYGON,
    191     coords=boundaries[[x, y]].values.copy(order="C"),
    192     offsets=(geometry_offset, part_offset),
    193 )
    194 gs = gpd.GeoSeries(polygons, index=np.unique(ids))
    196 # print(gs)

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py:467](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py#line=466), in from_ragged_array(geometry_type, coords, offsets)
    465     return _linestring_from_flatcoords(coords, *offsets)
    466 elif geometry_type == GeometryType.POLYGON:
--> 467     return _polygon_from_flatcoords(coords, *offsets)
    468 elif geometry_type == GeometryType.MULTIPOINT:
    469     return _multipoint_from_flatcoords(coords, *offsets)

File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py:400](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py#line=399), in _polygon_from_flatcoords(coords, offsets1, offsets2)
    397 offsets2 = np.asarray(offsets2, dtype="int64")
    399 # recreate polygons
--> 400 result = _from_ragged_array_multi_linear(
    401     coords, offsets1, offsets2, geometry_type=GeometryType.POLYGON
    402 )
    403 return result

File shapely[/_geometry_helpers.pyx:511](http://localhost:8880/_geometry_helpers.pyx#line=510), in shapely._geometry_helpers._from_ragged_array_multi_linear()

File shapely[/_geometry_helpers.pyx:537](http://localhost:8880/_geometry_helpers.pyx#line=536), in shapely._geometry_helpers._from_ragged_array_multi_linear()

File shapely[/_geometry_helpers.pyx:123](http://localhost:8880/_geometry_helpers.pyx#line=122), in shapely._geometry_helpers._create_simple_geometry_raise_error()

ValueError: A linearring requires at least 4 coordinates.

Expected behavior
Should save the tiled data into the test, train, and validation folders for the training part.
When reading CosMx data into a spatialdata object and saving the polygon data, I had issues when the polygons were incomplete and broke the code. Skipping those fixed the problem. Is this a similar issue? Or something else?

Environment (please complete the following information):

  • OS: Linux (HPC)
  • Python version: 3.11.13
  • Package version: 0.1.0

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions