Skip to content

Commit 1fbc248

Browse files
BWMacBryanFauble
andauthored
[SYNPY-1571] Adds Dataset Model & Introduces Composition Model for Table/View-like Classes (#1175)
* add and remove item logic * fixes docstrings * adds default column logic * update default column handling * adds ViewOperator and improved default column handling * updates Dataset * addresses comments * adds integration test outline * adds WIP demo script for testing * [SYNPY-1575] Support waiting for evetually consistent view after changes are made (#1176) * Support waiting for evetually consistent view after changes are made * fix table test docstring * initial composition approach * updates type hint and bit mask handling * update to `Base` naming * updates to allow none bit masks/view types * adds decorators to base classes * adds delete * updates demo * Leave=None for tqdm bar * adds snapshot functionality * adds ViewUpdateMixin * adds reimplementation example * Patch for flipped eventual consistency check * Support downloading query results to CSV instead of returning as DataFrame * updates async_to_sync so that we don't have to use protocols * adds sync Dataset interface * updates demo script * [SYNPY-1571] Migrate table to new mixin structure (#1179) * Pull functions out into new mixin model for table/views * updates demo script * Hide upsert method from showing in view type entities and finish sync side of Table * Patch isort flagged issue * Update ref * updates docstrings * updates docstrings * pre-commit fix * adds Dataset docs pages * updates mkdocs.yml * Update synapseclient/models/mixins/table_components.py Co-authored-by: BryanFauble <[email protected]> * Update synapseclient/models/mixins/table_components.py Co-authored-by: BryanFauble <[email protected]> * fixes dataset docstrings * adds dataset creation tests * fixes dataset.to_synapse_request * splits off dataset protocol * adds integration tests * updates docs for methods * adds isSearchEnabled to Dataset * start of table_component unit tests * isort pre-commit * qadds mixin unit tests * adds dataset unit tests * pre-commit fix * cleans up unit test script * cleans up unit test script * adds demo to doc page * removes unused imports * try new ubuntu version * bump gh runner version * fix docs order * adds missing example imports + async running * removes redundant query method definitions * adds dataset tutorial * pre-commit * try latest os' * updates pytest-xdist version * Revert "updates pytest-xdist version" This reverts commit 24a6524. * bumps pytest and pytest-asyncio versions * revert os' * test pytest_asyncio fixture * bump dependencies version cache * disable failing test * make syn fixture asyncio safe * debug failing test * fixes failing unit test * lower xdist concurrency * adds Dataset tutorial to mkdocs * fixes misalinged numbers * adds missing comment * try concurrency n=6 * revert to 4 * patch async scope * adds async to failing tests --------- Co-authored-by: BryanFauble <[email protected]>
1 parent b254f2a commit 1fbc248

35 files changed

+8326
-3207
lines changed

.github/workflows/build.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ jobs:
5050

5151
strategy:
5252
matrix:
53-
os: [ubuntu-20.04, macos-13, windows-2022]
53+
os: [ubuntu-22.04, macos-13, windows-2022]
5454

5555
# if changing the below change the run-integration-tests versions and the check-deploy versions
5656
# Make sure that we are running the integration tests on the first and last versions of the matrix
@@ -83,7 +83,7 @@ jobs:
8383
path: |
8484
${{ steps.get-dependencies.outputs.site_packages_loc }}
8585
${{ steps.get-dependencies.outputs.site_bin_dir }}
86-
key: ${{ runner.os }}-${{ matrix.python }}-build-${{ env.cache-name }}-${{ hashFiles('setup.py') }}-v20
86+
key: ${{ runner.os }}-${{ matrix.python }}-build-${{ env.cache-name }}-${{ hashFiles('setup.py') }}-v21
8787

8888
- name: Install py-dependencies
8989
if: steps.cache-dependencies.outputs.cache-hit != 'true'
@@ -212,7 +212,7 @@ jobs:
212212
- name: Upload coverage report
213213
id: upload_coverage_report
214214
uses: actions/upload-artifact@v4
215-
if: ${{ contains(fromJSON('["3.13"]'), matrix.python) && contains(fromJSON('["ubuntu-20.04"]'), matrix.os)}}
215+
if: ${{ contains(fromJSON('["3.13"]'), matrix.python) && contains(fromJSON('["ubuntu-22.04"]'), matrix.os)}}
216216
with:
217217
name: coverage-report
218218
path: coverage.xml
@@ -221,7 +221,7 @@ jobs:
221221
needs: [test]
222222
if: ${{ always() && !cancelled()}}
223223
name: SonarCloud
224-
runs-on: ubuntu-20.04
224+
runs-on: ubuntu-22.04
225225
steps:
226226
- uses: actions/checkout@v4
227227
with:
@@ -256,7 +256,7 @@ jobs:
256256
package:
257257
needs: [test,pre-commit]
258258

259-
runs-on: ubuntu-20.04
259+
runs-on: ubuntu-22.04
260260

261261
if: github.event_name == 'release'
262262

@@ -404,7 +404,7 @@ jobs:
404404

405405
strategy:
406406
matrix:
407-
os: [ubuntu-20.04, macos-13, windows-2022]
407+
os: [ubuntu-24.04, macos-13, windows-2022]
408408

409409
# python versions should be consistent with the strategy matrix and the runs-integration-tests versions
410410
python: ['3.9', '3.10', '3.11', '3.12', '3.13']

Pipfile.lock

Lines changed: 967 additions & 692 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Dataset
2+
3+
Contained within this file are experimental interfaces for working with the Synapse Python
4+
Client. Unless otherwise noted these interfaces are subject to change at any time. Use
5+
at your own risk.
6+
7+
## API reference
8+
9+
::: synapseclient.models.Dataset
10+
options:
11+
inherited_members: true
12+
members:
13+
- add_item_async
14+
- remove_item_async
15+
- store_async
16+
- get_async
17+
- delete_async
18+
- update_rows_async
19+
- snapshot_async
20+
- query_async
21+
- query_part_mask_async
22+
- add_column
23+
- delete_column
24+
- reorder_column
25+
- rename_column
26+
- get_permissions
27+
- get_acl
28+
- set_permissions
29+
---
30+
::: synapseclient.models.EntityRef
31+
---

docs/reference/experimental/async/table.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,4 +46,4 @@ at your own risk.
4646
::: synapseclient.models.UploadToTableRequest
4747
::: synapseclient.models.TableUpdateTransaction
4848
::: synapseclient.models.CsvTableDescriptor
49-
::: synapseclient.models.mixins.table_operator.csv_to_pandas_df
49+
::: synapseclient.models.mixins.table_components.csv_to_pandas_df
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Dataset
2+
3+
Contained within this file are experimental interfaces for working with the Synapse Python
4+
Client. Unless otherwise noted these interfaces are subject to change at any time. Use
5+
at your own risk.
6+
7+
## Example Script:
8+
9+
<details class="quote">
10+
<summary>Working with Synapse datasets</summary>
11+
12+
```python
13+
{!docs/scripts/object_orientated_programming_poc/oop_poc_dataset.py!}
14+
```
15+
</details>
16+
17+
## API reference
18+
19+
::: synapseclient.models.Dataset
20+
options:
21+
inherited_members: true
22+
members:
23+
- add_item
24+
- remove_item
25+
- store
26+
- get
27+
- delete
28+
- update_rows
29+
- snapshot
30+
- query
31+
- query_part_mask
32+
- add_column
33+
- delete_column
34+
- reorder_column
35+
- rename_column
36+
- get_permissions
37+
- get_acl
38+
- set_permissions
39+
---
40+
::: synapseclient.models.EntityRef
41+
---

docs/reference/experimental/sync/table.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,4 +57,4 @@ at your own risk.
5757
::: synapseclient.models.UploadToTableRequest
5858
::: synapseclient.models.TableUpdateTransaction
5959
::: synapseclient.models.CsvTableDescriptor
60-
::: synapseclient.models.mixins.table_operator.csv_to_pandas_df
60+
::: synapseclient.models.mixins.table_components.csv_to_pandas_df

docs/tutorials/python/dataset.md

Lines changed: 125 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,125 @@
1-
# Datasets in Synapse
2-
![Under Construction](../../assets/under_construction.png)
1+
# Datasets
2+
Datasets in Synapse are a way to organize, annotate, and publish sets of files for others to use. Datasets behave similarly to Tables and FileViews, but provide some default behavior that makes it easy to put a group of files together.
3+
4+
This tutorial will walk through basics of working with datasets using the Synapse Python client.
5+
6+
# Tutorial Purpose
7+
In this tutorial, you will:
8+
9+
1. Create a dataset
10+
2. Add files to the dataset
11+
3. Query the dataset
12+
4. Add a custom column to the dataset
13+
6. Save a snapshot of the dataset
14+
15+
# Prerequisites
16+
* This tutorial assumes that you have a project in Synapse with one or more files in it. To test all of the ways to add files to a dataset, you will need to have at least 3 files in your project. A structure like this is recommended:
17+
```
18+
Project
19+
├── File 1
20+
├── File 2
21+
├── Folder 1
22+
│ ├── File 4
23+
│ ├── ...
24+
```
25+
* Pandas must be installed as shown in the [installation documentation](../installation.md)
26+
27+
## 1. Get the ID of your Synapse project
28+
29+
Let's get started by authenticating with Synapse and retrieving the ID of your project.
30+
31+
```python
32+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=17-23}
33+
```
34+
35+
## 2. Create your Dataset
36+
37+
Next, we will create the dataset. We will use the project ID to tell Synapse where we want the dataset to be created. After this step, we will have a Dataset object with all of the needed information to start building the dataset.
38+
39+
```python
40+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=27-28}
41+
```
42+
43+
Because we haven't added any files to the dataset yet, it will be empty, but if you view the dataset's schema in the UI, you will notice that datasets come with default columns that help to describe each file that we add to the dataset.
44+
45+
![Dataset Default Schema](./tutorial_screenshots/dataset_default_schema.png)
46+
47+
## 3. Add files to the dataset
48+
49+
Let's add some files to the dataset now. There are three ways to add files to a dataset:
50+
51+
1. Add an Entity Reference to a file with its ID and version
52+
```python
53+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=32-34}
54+
```
55+
2. Add a File with its ID and version
56+
```python
57+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=36-38}
58+
```
59+
3. Add a Folder. When adding a folder, all child files inside of the folder are added to the dataset recursively.
60+
```python
61+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=40-42}
62+
```
63+
64+
Whenever we make changes to the dataset, we need to call the `store()` method to save the changes to Synapse.
65+
```python
66+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=44}
67+
```
68+
69+
And now we are able to see our dataset with all of the files that we added to it.
70+
71+
![Dataset with Files](./tutorial_screenshots/dataset_with_files.png)
72+
73+
## 4. Retrieve the dataset
74+
75+
Now that we have a dataset with some files in it, we can retrieve the dataset from Synapse the next time we need to use it.
76+
77+
```python
78+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=48-50}
79+
```
80+
81+
## 5. Query the dataset
82+
83+
Now that we have a dataset with some files in it, we can query the dataset to find files that match certain criteria.
84+
85+
```python
86+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=54-57}
87+
```
88+
89+
## 6. Add a custom column to the dataset
90+
91+
We can also add a custom column to the dataset. This will allow us to annotate files in the dataset with additional information.
92+
93+
```python
94+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=61-67}
95+
```
96+
97+
Our custom column isn't all that useful empty, so let's update the dataset with some values.
98+
99+
```python
100+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=70-78}
101+
```
102+
103+
## 7. Save a snapshot of the dataset
104+
105+
Finally, let's save a snapshot of the dataset. This creates a read-only version of the dataset that captures the current state of the dataset and can be referenced later.
106+
107+
```python
108+
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=82-86}
109+
```
110+
111+
## Source Code for this Tutorial
112+
113+
<details class="quote">
114+
<summary>Click to show me</summary>
115+
116+
```python
117+
{!docs/tutorials/python/tutorial_scripts/dataset.py!}
118+
```
119+
</details>
120+
121+
## References
122+
- [Dataset](../../reference/experimental/sync/dataset.md)
123+
- [Column][synapseclient.models.Column]
124+
- [syn.login][synapseclient.Synapse.login]
125+
- [Project](../../reference/experimental/sync/project.md)
161 KB
Loading
332 KB
Loading
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
"""Here is where you'll find the code for the dataset tutorial."""
2+
3+
import pandas as pd
4+
5+
from synapseclient import Synapse
6+
from synapseclient.models import (
7+
Column,
8+
ColumnType,
9+
Dataset,
10+
EntityRef,
11+
File,
12+
Folder,
13+
Project,
14+
)
15+
16+
# First, let's get the project that we want to create the dataset in
17+
syn = Synapse()
18+
syn.login()
19+
20+
project = Project(name="My Testing Project").get() # Replace with your project name
21+
project_id = project.id
22+
print(project_id)
23+
24+
# Next, let's create the dataset. We'll use the project id as the parent id.
25+
# To begin, the dataset will be empty, but if you view the dataset's schema in the UI,
26+
# you will notice that datasets come with default columns.
27+
my_new_dataset = Dataset(parent_id=project_id, name="My New Dataset").store()
28+
print(f"My Dataset's ID is {my_new_dataset.id}")
29+
30+
# Now, let's add some files to the dataset. There are three ways to add files to a dataset:
31+
# 1. Add an Entity Reference to a file with its ID and version
32+
my_new_dataset.add_item(
33+
EntityRef(id="syn51790029", version=1)
34+
) # Replace with the ID of the file you want to add
35+
# 2. Add a File with its ID and version
36+
my_new_dataset.add_item(
37+
File(id="syn51790028", version_label=1)
38+
) # Replace with the ID of the file you want to add
39+
# 3. Add a Folder. In this case, all child files of the folder are added to the dataset recursively.
40+
my_new_dataset.add_item(
41+
Folder(id="syn64893446")
42+
) # Replace with the ID of the folder you want to add
43+
# Our changes won't be persisted to Synapse until we call the store() method.
44+
my_new_dataset.store()
45+
46+
# Now that our Dataset with all of our files has been created, the next time
47+
# we want to use it, we can retrieve it from Synapse.
48+
my_retrieved_dataset = Dataset(id=my_new_dataset.id).get()
49+
print(f"My Dataset's ID is {my_retrieved_dataset.id}")
50+
print(len(my_retrieved_dataset.items))
51+
52+
# If you want to query your dataset for files that match certain criteria, you can do so
53+
# using the query method.
54+
rows = Dataset.query(
55+
query=f"SELECT * FROM {my_retrieved_dataset.id} WHERE name like '%test%'"
56+
)
57+
print(rows)
58+
59+
# In addition to the default columns, you may want to annotate items in your dataset using
60+
# custom columns.
61+
my_retrieved_dataset.add_column(
62+
column=Column(
63+
name="my_annotation",
64+
column_type=ColumnType.STRING,
65+
)
66+
)
67+
my_retrieved_dataset.store()
68+
69+
# Now that our custom column has been added, we can update the dataset with new values.
70+
modified_data = pd.DataFrame(
71+
{
72+
"id": "syn51790028", # The ID of one of our Files
73+
"my_annotation": ["excellent data"],
74+
}
75+
)
76+
my_retrieved_dataset.update_rows(
77+
values=modified_data, primary_keys=["id"], dry_run=False
78+
)
79+
80+
81+
# Finally, let's save a snapshot of the dataset.
82+
snapshot_info = my_retrieved_dataset.snapshot(
83+
comment="My first snapshot",
84+
label="My first snapshot",
85+
)
86+
print(snapshot_info)

0 commit comments

Comments
 (0)