-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Describe the bug
I am trying to run Segger on CosMx data (6k genes). I have created the nuclei masks and exported the tx_file in .parquet format. I have followed the script to preprocess [https://github.com/EliHei2/segger_dev/blob/main/scripts/create_data_cosmx.py] the data before training.
When I tried to save the object, I got an error
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.43.
warnings.warn(
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.64.
warnings.warn(
[/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py:245](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/torch_geometric/transforms/random_link_split.py#line=244): UserWarning: There are not enough negative edges to satisfy the provided sampling ratio. The ratio will be adjusted to 0.68.
warnings.warn(
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 7
1 # Parameters:
2 # - k_bd[/dist_bd](http://localhost:8880/dist_bd): Control nucleus boundary point connections
3 # - k_tx[/dist_tx](http://localhost:8880/dist_tx): Control transcript neighborhood connections
4 # - tile_width[/height](http://localhost:8880/height): Size of spatial tiles for processing
5 # - neg_sampling_ratio: Ratio of negative to positive samples
6 # - val_prob: Fraction of data for validation
----> 7 sample.save(
8 data_dir=SEGGER_DATA_DIR,
9 k_bd=3, # Number of boundary points to connect
10 dist_bd=15, # Maximum distance for boundary connections
11 k_tx=10, # Use calculated optimal transcript neighbors
12 dist_tx=10, # Use calculated optimal search radius
13 tile_width=200, # Tile size for processing,
14 tile_height=200, # Tile size for processing
15 neg_sampling_ratio=10.0, # 5:1 negative:positive samples
16 frac=1.0, # Use all data
17 val_prob=0.3, # 30% validation set
18 test_prob=0, # No test set
19 )
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:471](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=470), in STSampleParquet.save(self, data_dir, k_bd, dist_bd, k_tx, dist_tx, tile_size, tile_width, tile_height, neg_sampling_ratio, frac, val_prob, test_prob)
469 outs = []
470 for region in regions:
--> 471 outs.append(func(region))
472 return outs
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:453](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=452), in STSampleParquet.save.<locals>.func(region)
448 data_type = np.random.choice(
449 a=["train_tiles", "test_tiles", "val_tiles"],
450 p=[1 - (test_prob + val_prob), test_prob, val_prob],
451 )
452 xt = STTile(dataset=xm, extents=tile)
--> 453 pyg_data = xt.to_pyg_dataset(
454 k_bd=k_bd,
455 dist_bd=dist_bd,
456 k_tx=k_tx,
457 dist_tx=dist_tx,
458 neg_sampling_ratio=neg_sampling_ratio,
459 )
460 if pyg_data is not None:
461 if pyg_data["tx", "belongs", "bd"].edge_index.numel() == 0:
462 # this tile is only for testing
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py:1238](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/sample.py#line=1237), in STTile.to_pyg_dataset(self, neg_sampling_ratio, k_bd, dist_bd, k_tx, dist_tx, area, convexity, elongation, circularity)
1235 polygons = gpd.GeoSeries(self.boundaries[geometry_column], index=self.boundaries.index)
1236 else:
1237 # Fallback: compute polygons
-> 1238 polygons = utils.get_polygons_from_xy(
1239 self.boundaries,
1240 x=self.settings.boundaries.x,
1241 y=self.settings.boundaries.y,
1242 label=self.settings.boundaries.label,
1243 scale_factor=self.settings.boundaries.scale_factor,
1244 )
1246 # Ensure self.boundaries is a GeoDataFrame with correct geometry
1247 self.boundaries = gpd.GeoDataFrame(self.boundaries.copy(), geometry=polygons)
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/_utils.py:189](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/segger/data/parquet/_utils.py#line=188), in get_polygons_from_xy(boundaries, x, y, label, scale_factor)
186 part_offset = np.arange(len(np.unique(ids)) + 1)
188 # Convert to GeoSeries of polygons
--> 189 polygons = shapely.from_ragged_array(
190 shapely.GeometryType.POLYGON,
191 coords=boundaries[[x, y]].values.copy(order="C"),
192 offsets=(geometry_offset, part_offset),
193 )
194 gs = gpd.GeoSeries(polygons, index=np.unique(ids))
196 # print(gs)
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py:467](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py#line=466), in from_ragged_array(geometry_type, coords, offsets)
465 return _linestring_from_flatcoords(coords, *offsets)
466 elif geometry_type == GeometryType.POLYGON:
--> 467 return _polygon_from_flatcoords(coords, *offsets)
468 elif geometry_type == GeometryType.MULTIPOINT:
469 return _multipoint_from_flatcoords(coords, *offsets)
File [/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py:400](http://localhost:8880/hpc/compgen/users/cruiz/miniconda3/envs/segger/lib/python3.11/site-packages/shapely/_ragged_array.py#line=399), in _polygon_from_flatcoords(coords, offsets1, offsets2)
397 offsets2 = np.asarray(offsets2, dtype="int64")
399 # recreate polygons
--> 400 result = _from_ragged_array_multi_linear(
401 coords, offsets1, offsets2, geometry_type=GeometryType.POLYGON
402 )
403 return result
File shapely[/_geometry_helpers.pyx:511](http://localhost:8880/_geometry_helpers.pyx#line=510), in shapely._geometry_helpers._from_ragged_array_multi_linear()
File shapely[/_geometry_helpers.pyx:537](http://localhost:8880/_geometry_helpers.pyx#line=536), in shapely._geometry_helpers._from_ragged_array_multi_linear()
File shapely[/_geometry_helpers.pyx:123](http://localhost:8880/_geometry_helpers.pyx#line=122), in shapely._geometry_helpers._create_simple_geometry_raise_error()
ValueError: A linearring requires at least 4 coordinates.
Expected behavior
Should save the tiled data into the test, train, and validation folders for the training part.
When reading CosMx data into a spatialdata object and saving the polygon data, I had issues when the polygons were incomplete and broke the code. Skipping those fixed the problem. Is this a similar issue? Or something else?
Environment (please complete the following information):
- OS: Linux (HPC)
- Python version: 3.11.13
- Package version: 0.1.0
Metadata
Metadata
Assignees
Labels
No labels