Skip to content

Commit 4d26c49

Browse files
BWMacBryanFauble
andauthored
[SYNPY-1578] DatasetCollection OOP Model (#1189)
* adds initial DatasetCollection implementation * adds unit tests * pre-commit * updates docstrings * adds integration tests * adds docs pages * removes example script section from dataset documentation * adds dataset collection tutorial * fixes tutorial script * adds tutorial path to mkdocs.yml * bullet points * fixes tutorial code lines * fixes tutorial references * test doc format fix * fixes dataset docs * fixes sync integration tests * fixes DatasetCollection docstrings * refactors entity factory * fixes argument error * updates test strings * pre-commit * Update docs/tutorials/python/dataset_collection.md Co-authored-by: BryanFauble <[email protected]> * updates tutorials * removes elif block * pre-commit * removes unused cleanup * updates version handling and tests * fix async tests * addresses comments * fixes docstrings * adds retry logic for uncaught async jobs * set max on timeout * addresses comments * updates unit test for version num * fixes incorrect line number * adds missing snapshot tests * corrects type hint --------- Co-authored-by: BryanFauble <[email protected]>
1 parent c70c634 commit 4d26c49

File tree

21 files changed

+3037
-456
lines changed

21 files changed

+3037
-456
lines changed
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Dataset Collection
2+
3+
Contained within this file are experimental interfaces for working with the Synapse Python
4+
Client. Unless otherwise noted these interfaces are subject to change at any time. Use
5+
at your own risk.
6+
7+
## API reference
8+
9+
::: synapseclient.models.DatasetCollection
10+
options:
11+
inherited_members: true
12+
members:
13+
- add_item_async
14+
- remove_item_async
15+
- store_async
16+
- get_async
17+
- delete_async
18+
- update_rows_async
19+
- snapshot_async
20+
- query_async
21+
- query_part_mask_async
22+
- add_column
23+
- delete_column
24+
- reorder_column
25+
- rename_column
26+
- get_permissions
27+
- get_acl
28+
- set_permissions
29+
---
30+
::: synapseclient.models.EntityRef
31+
---

docs/reference/experimental/sync/dataset.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,6 @@ Contained within this file are experimental interfaces for working with the Syna
44
Client. Unless otherwise noted these interfaces are subject to change at any time. Use
55
at your own risk.
66

7-
## Example Script:
8-
9-
<details class="quote">
10-
<summary>Working with Synapse datasets</summary>
11-
12-
```python
13-
{!docs/scripts/object_orientated_programming_poc/oop_poc_dataset.py!}
14-
```
15-
</details>
16-
177
## API reference
188

199
::: synapseclient.models.Dataset
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Dataset Collection
2+
3+
Contained within this file are experimental interfaces for working with the Synapse Python
4+
Client. Unless otherwise noted these interfaces are subject to change at any time. Use
5+
at your own risk.
6+
7+
## API reference
8+
9+
::: synapseclient.models.DatasetCollection
10+
options:
11+
inherited_members: true
12+
members:
13+
- add_item
14+
- remove_item
15+
- store
16+
- get
17+
- delete
18+
- update_rows
19+
- snapshot
20+
- query
21+
- query_part_mask
22+
- add_column
23+
- delete_column
24+
- reorder_column
25+
- rename_column
26+
- get_permissions
27+
- get_acl
28+
- set_permissions
29+
---
30+
::: synapseclient.models.EntityRef
31+
---

docs/tutorials/python/dataset.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Datasets
22
Datasets in Synapse are a way to organize, annotate, and publish sets of files for others to use. Datasets behave similarly to Tables and EntityViews, but provide some default behavior that makes it easy to put a group of files together.
33

4-
This tutorial will walk through basics of working with datasets using the Synapse Python client.
4+
This tutorial will walk through basics of working with datasets using the Synapse Python Client.
55

66
# Tutorial Purpose
77
In this tutorial, you will:
@@ -29,15 +29,15 @@ In this tutorial, you will:
2929
Let's get started by authenticating with Synapse and retrieving the ID of your project.
3030
3131
```python
32-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=17-23}
32+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=3-24}
3333
```
3434

3535
## 2. Create your Dataset
3636

3737
Next, we will create the dataset. We will use the project ID to tell Synapse where we want the dataset to be created. After this step, we will have a Dataset object with all of the needed information to start building the dataset.
3838

3939
```python
40-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=27-28}
40+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=29-30}
4141
```
4242

4343
Because we haven't added any files to the dataset yet, it will be empty, but if you view the dataset's schema in the UI, you will notice that datasets come with default columns that help to describe each file that we add to the dataset.
@@ -50,20 +50,20 @@ Let's add some files to the dataset now. There are three ways to add files to a
5050

5151
1. Add an Entity Reference to a file with its ID and version
5252
```python
53-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=32-34}
53+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=34-36}
5454
```
5555
2. Add a File with its ID and version
5656
```python
57-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=36-38}
57+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=38-40}
5858
```
5959
3. Add a Folder. When adding a folder, all child files inside of the folder are added to the dataset recursively.
6060
```python
61-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=40-42}
61+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=42-44}
6262
```
6363

6464
Whenever we make changes to the dataset, we need to call the `store()` method to save the changes to Synapse.
6565
```python
66-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=44}
66+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=46}
6767
```
6868

6969
And now we are able to see our dataset with all of the files that we added to it.
@@ -75,37 +75,37 @@ And now we are able to see our dataset with all of the files that we added to it
7575
Now that we have a dataset with some files in it, we can retrieve the dataset from Synapse the next time we need to use it.
7676

7777
```python
78-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=48-50}
78+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=50-52}
7979
```
8080

8181
## 5. Query the dataset
8282

8383
Now that we have a dataset with some files in it, we can query the dataset to find files that match certain criteria.
8484

8585
```python
86-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=54-57}
86+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=56-59}
8787
```
8888

8989
## 6. Add a custom column to the dataset
9090

9191
We can also add a custom column to the dataset. This will allow us to annotate files in the dataset with additional information.
9292

9393
```python
94-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=61-67}
94+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=63-69}
9595
```
9696

9797
Our custom column isn't all that useful empty, so let's update the dataset with some values.
9898

9999
```python
100-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=70-78}
100+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=72-80}
101101
```
102102

103103
## 7. Save a snapshot of the dataset
104104

105105
Finally, let's save a snapshot of the dataset. This creates a read-only version of the dataset that captures the current state of the dataset and can be referenced later.
106106

107107
```python
108-
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=82-86}
108+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=84-88}
109109
```
110110

111111
## Source Code for this Tutorial
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Dataset Collections
2+
Dataset Collections are a way to organize, annotate, and publish sets of datasets for others to use. Dataset Collections behave similarly to Tables and EntityViews, but provide some default behavior that makes it easy to put a group of datasets together.
3+
4+
This tutorial will walk through basics of working with Dataset Collections using the Synapse Python Client.
5+
6+
# Tutorial Purpose
7+
In this tutorial, you will:
8+
9+
- Create a Dataset Collection
10+
- Add datasets to the collection
11+
- Add a custom column to the collection
12+
- Update the collection with new annotations
13+
- Query the collection
14+
- Save a snapshot of the collection
15+
16+
# Prerequisites
17+
* This tutorial assumes that you have a project in Synapse and have already created datasets that you would like to add to a Dataset Collection.
18+
* If you need help creating datasets, you can refer to the [dataset tutorial](./dataset.md).
19+
* Pandas must be installed as shown in the [installation documentation](../installation.md)
20+
21+
## 1. Get the ID of your Synapse project
22+
23+
Let's get started by authenticating with Synapse and retrieving the ID of your project.
24+
25+
```python
26+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=3-16}
27+
```
28+
29+
## 2. Create your Dataset Collection
30+
31+
Next, we will create the Dataset Collection using the project ID to tell Synapse where we want the Dataset Collection to be created. After this step, we will have a Dataset Collection object with all of the necessary information to start building the collection.
32+
33+
```python
34+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=25-33}
35+
```
36+
37+
Because we haven't added any datasets to the collection yet, it will be empty, but if you view the Dataset Collection's schema in the UI, you will notice that Dataset Collections come with default columns.
38+
39+
![Dataset Collection Default Schema](./tutorial_screenshots/dataset_collection_default_schema.png)
40+
41+
## 3. Add Datasets to the Dataset Collection
42+
43+
Now, let's add some datasets to the collection. We will loop through our dataset ids and add each dataset to the collection using the `add_item` method.
44+
45+
```python
46+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=37-38}
47+
```
48+
49+
Whenever we make changes to the Dataset Collection, we need to call the `store()` method to save the changes to Synapse.
50+
51+
```python
52+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=40}
53+
```
54+
55+
And now we are able to see our Dataset Collection with all of the datasets that we added to it.
56+
57+
![Dataset Collection with Datasets](./tutorial_screenshots/dataset_collection_with_datasets.png)
58+
59+
## 4. Retrieve the Dataset Collection
60+
61+
Now that our Dataset Collection has been created and we have added some Datasets to it, we can retrieve the Dataset Collection from Synapse the next time we need to use it.
62+
63+
```python
64+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=44-46}
65+
```
66+
67+
## 5. Add a custom column to the Dataset Collection
68+
69+
In addition to the default columns, you may want to annotate items in your DatasetCollection using custom columns.
70+
71+
```python
72+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=50-56}
73+
```
74+
75+
Our custom column isn't all that useful empty, so let's update the Dataset Collection with some values.
76+
77+
```python
78+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=59-67}
79+
```
80+
81+
## 6. Query the Dataset Collection
82+
83+
If you want to query your DatasetCollection for items that match certain criteria, you can do so using the `query` method.
84+
85+
```python
86+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=71-74}
87+
```
88+
89+
## 7. Save a snapshot of the Dataset Collection
90+
91+
Finally, let's save a snapshot of the Dataset Collection. This creates a read-only version of the Dataset Collection that captures the current state of the Dataset Collection and can be referenced later.
92+
93+
```python
94+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=77}
95+
```
96+
97+
## Source Code for this Tutorial
98+
99+
<details class="quote">
100+
<summary>Click to show me</summary>
101+
102+
```python
103+
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!}
104+
```
105+
</details>
106+
107+
## References
108+
- [DatasetCollection](../../reference/experimental/sync/dataset_collection.md)
109+
- [Dataset](../../reference/experimental/sync/dataset.md)
110+
- [Project](../../reference/experimental/sync/project.md)
111+
- [Column][synapseclient.models.Column]
112+
- [syn.login][synapseclient.Synapse.login]
146 KB
Loading
121 KB
Loading

docs/tutorials/python/tutorial_scripts/dataset.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,11 @@
1717
syn = Synapse()
1818
syn.login()
1919

20-
project = Project(name="My Testing Project").get() # Replace with your project name
20+
project = Project(
21+
name="My uniquely named project about Alzheimer's Disease"
22+
).get() # Replace with your project name
2123
project_id = project.id
22-
print(project_id)
24+
print(f"My project ID is {project_id}")
2325

2426
# Next, let's create the dataset. We'll use the project id as the parent id.
2527
# To begin, the dataset will be empty, but if you view the dataset's schema in the UI,
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
"""Here is where you'll find the code for the DatasetCollection tutorial."""
2+
3+
import pandas as pd
4+
5+
from synapseclient import Synapse
6+
from synapseclient.models import Column, ColumnType, Dataset, DatasetCollection, Project
7+
8+
# First, let's get the project that we want to create the DatasetCollection in
9+
syn = Synapse()
10+
syn.login()
11+
12+
project = Project(
13+
name="My uniquely named project about Alzheimer's Disease"
14+
).get() # Replace with your project name
15+
project_id = project.id
16+
print(f"My project ID is {project_id}")
17+
18+
# This tutorial assumes that you have already created datasets that you would like to add to a DatasetCollection.
19+
# If you need help creating datasets, you can refer to the dataset tutorial.
20+
21+
# For this example, we will be using datasets already created in the project.
22+
# Let's create the DatasetCollection. We'll use the project id as the parent id.
23+
# At first, the DatasetCollection will be empty, but if you view the DatasetCollection's schema in the UI,
24+
# you will notice that DatasetCollections come with default columns.
25+
DATASET_IDS = [
26+
"syn65987017",
27+
"syn65987019",
28+
"syn65987020",
29+
] # Replace with your dataset IDs
30+
test_dataset_collection = DatasetCollection(
31+
parent_id=project_id, name="test_dataset_collection"
32+
).store()
33+
print(f"My DatasetCollection's ID is {test_dataset_collection.id}")
34+
35+
# Now, let's add some datasets to the collection. We will loop through our dataset ids and add each dataset to the
36+
# collection using the `add_item` method.
37+
for dataset_id in DATASET_IDS:
38+
test_dataset_collection.add_item(Dataset(id=dataset_id).get())
39+
# Our changes won't be persisted to Synapse until we call the `store` method on our DatasetCollection.
40+
test_dataset_collection.store()
41+
42+
# Now that our DatasetCollection with all of our datasets has been created, the next time we want to use it,
43+
# we can retrieve it from Synapse.
44+
my_retrieved_dataset_collection = DatasetCollection(id=test_dataset_collection.id).get()
45+
print(f"My DatasetCollection's ID is still {my_retrieved_dataset_collection.id}")
46+
print(f"My DatasetCollection has {len(my_retrieved_dataset_collection.items)} items")
47+
48+
# In addition to the default columns, you may want to annotate items in your DatasetCollection using
49+
# custom columns.
50+
my_retrieved_dataset_collection.add_column(
51+
column=Column(
52+
name="my_annotation",
53+
column_type=ColumnType.STRING,
54+
)
55+
)
56+
my_retrieved_dataset_collection.store()
57+
58+
# Now that our custom column has been added, we can update the DatasetCollection with new annotations.
59+
modified_data = pd.DataFrame(
60+
{
61+
"id": DATASET_IDS,
62+
"my_annotation": ["good dataset" * len(DATASET_IDS)],
63+
}
64+
)
65+
my_retrieved_dataset_collection.update_rows(
66+
values=modified_data, primary_keys=["id"], dry_run=False
67+
)
68+
69+
# If you want to query your DatasetCollection for items that match certain criteria, you can do so
70+
# using the `query` method.
71+
rows = my_retrieved_dataset_collection.query(
72+
query=f"SELECT id, my_annotation FROM {my_retrieved_dataset_collection.id} WHERE my_annotation = 'good dataset'"
73+
)
74+
print(rows)
75+
76+
# Create a snapshot of the DatasetCollection
77+
my_retrieved_dataset_collection.snapshot(comment="test snapshot")

0 commit comments

Comments
 (0)