You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pages/Community and Best Practices/Data and Workflow Best Practices/Workflows/workflow-best-practices.md
+25-65Lines changed: 25 additions & 65 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,17 +12,17 @@ A well-organized project directory makes your workflow easier for others (and yo
12
12
13
13
Consider using separate directories for distinct components of your project:
14
14
15
-
*`code/` or `src/`: For primary source code files, including scripts (`.py`) and notebooks (`.ipynb`).
16
-
*`data/`: For input data files. Note that large data files should generally not be committed to version control (see next section). This directory might contain small sample datasets or scripts to download larger inputs.
17
-
*`docs/`: For detailed documentation, figures, or reports.
18
-
*`environment/`: For files defining the software environment (e.g., `environment.yml`, `Dockerfile`).
15
+
-`code/` or `src/`: For primary source code files, including scripts (`.py`) and notebooks (`.ipynb`).
16
+
-`data/`: For input data files. Note that large data files should generally not be committed to version control (see next section). This directory might contain small sample datasets or scripts to download larger inputs.
17
+
-`docs/`: For detailed documentation, figures, or reports.
18
+
-`environment/`: For files defining the software environment (e.g., `environment.yml`, `Dockerfile`).
19
19
20
20
At the root level of your project, always include a `README.md` file. This file serves as the entry point and should clearly explain:
21
21
22
-
* The purpose of the project and workflow.
23
-
* The contents of the repository and the directory structure.
24
-
* Instructions on how to set up the environment and run the workflow.
25
-
* Information about required input data and how to obtain it.
22
+
- The purpose of the project and workflow.
23
+
- The contents of the repository and the directory structure.
24
+
- Instructions on how to set up the environment and run the workflow.
25
+
- Information about required input data and how to obtain it.
26
26
27
27
## Use Version Control Effectively
28
28
@@ -41,23 +41,23 @@ git push -u origin main
41
41
```
42
42
43
43
Version control provides several key benefits:
44
-
* It acts as a complete history log, allowing you to track every change and revert to previous versions if needed.
45
-
* It facilitates collaboration by allowing multiple people to work on the same codebase simultaneously using branches and merging.
46
-
* It serves as a reliable backup mechanism for your code and project history.
47
-
* Crucially, it allows you to link specific versions of your code (via commits or tags) to the results generated, which is fundamental for reproducibility.
44
+
- It acts as a complete history log, allowing you to track every change and revert to previous versions if needed.
45
+
- It facilitates collaboration by allowing multiple people to work on the same codebase simultaneously using branches and merging.
46
+
- It serves as a reliable backup mechanism for your code and project history.
47
+
- Crucially, it allows you to link specific versions of your code (via commits or tags) to the results generated, which is fundamental for reproducibility.
48
48
49
49
When setting up your repository, carefully consider what should and should not be tracked, following common recommendations:
- Documentation files (README.md, other .md or text files).
56
56
57
-
* Do Not Track These:
58
-
* Large data files. Data should be stored separately and accessed via links or download scripts.
59
-
* Credentials, API keys, or any sensitive information (secrets).
60
-
* Generated outputs like plots, figures, or intermediate/final data files.
57
+
- Do Not Track These:
58
+
- Large data files. Data should be stored separately and accessed via links or download scripts.
59
+
- Credentials, API keys, or any sensitive information (secrets).
60
+
- Generated outputs like plots, figures, or intermediate/final data files.
61
61
62
62
Use a `.gitignore` file to explicitly tell Git which files and directories to ignore. To use version control effectively:
63
63
- Make frequent, small commits. Each commit should represent a single logical change.
@@ -369,7 +369,7 @@ Once your notebook runs reliably, think about making it even more reusable with
369
369
370
370
## Implement Basic Testing
371
371
372
-
Adding checks to your code helps ensure it behaves as expected and increases confidence in your results ✔️. Even simple tests can catch errors early, saving significant debugging time later.
372
+
Adding checks to your code helps ensure it behaves as expected and increases confidence in your results. Even simple tests can catch errors early, saving significant debugging time later.
373
373
374
374
A straightforward way to add checks directly within your notebook is using `assert` statements. These statements test whether a condition is true; if it's false, the code will stop and raise an error, immediately alerting you to a problem. Use them to verify assumptions about your data or the results of calculations.
375
375
@@ -421,48 +421,8 @@ A crucial step for ensuring true reproducibility is explicitly connecting the sp
421
421
422
422
In EarthCODE, this vital link is captured within the **Experiment** metadata record. When you publish a data **Product**, its metadata should reference the **Workflow** that created it and the details of the code run are in the metadata of an **Experiment**. The Experiment record, in turn, contains precise references to:
423
423
424
-
* The specific **Workflow** version used (e.g., a Git commit hash or tag).
425
-
* The exact **Input Data** consumed.
426
-
* The **Configuration** parameters applied during that run.
424
+
- The specific **Workflow** version used (e.g., a Git commit hash or tag).
425
+
- The exact **Input Data** consumed.
426
+
- The **Configuration** parameters applied during that run.
427
427
428
428
This creates a complete, traceable chain from the final data product back to the exact code and conditions that generated it. By formally linking the code version to the results via an Experiment, you provide the necessary provenance for others to verify your findings and confidently reproduce your work.
429
-
430
-
431
-
432
-
433
-
434
-
435
-
436
-
437
-
438
-
439
-
440
-
441
-
442
-
<!--
443
-
# Workflow Best Practices
444
-
## Plan for Reproducibility from day 1
445
-
446
-
## Best Practices for high-quality Code, Data and Workflows
447
-
448
-
Maintaining high-quality code and data throughout your project ensures that your outputs are reusable, trustworthy, and easier to publish. Below are tips and recommended practices to support quality assurance and reproducibility:
449
-
450
-
- Code Quality
451
-
- Use Version Control: Track your development using Git and a shared repository (e.g., GitHub or GitLab).
452
-
- Automate Testing: Implement unit tests and integration tests using tools like pytest, unittest, or CI/CD workflows.
453
-
- Follow Coding Standards: Adopt a consistent style (e.g., PEP8 for Python) and use linters (e.g., flake8, black) to maintain code clarity.
454
-
- Write Documentation: Provide clear usage instructions and inline comments. Consider using Jupyter Notebooks or Markdown README files to explain workflows.
455
-
- Data Quality
456
-
- Validate Your Data: Apply automated checks for data formats, missing values, and schema consistency.
457
-
- Document Your Data: Create or maintain metadata alongside your datasets, including descriptions of variables, units, and collection methods.
458
-
- Use Standard Formats: Choose interoperable, machine-readable formats (e.g., NetCDF, GeoTIFF, Zarr) and community-agreed standards (such as CF-Conventions).
459
-
- Track Data Changes: when needed, version datasets as they evolve and log processing steps to support reproducibility.
460
-
- Integration with EarthCODE
461
-
- Use EarthCODE-Compatible Tools: When possible, rely on tools and environments that are natively supported within EarthCODE platforms.
462
-
- Test Workflows in EarthCODE Early: Validate your workflows in the target platform before final publication to avoid integration issues.
463
-
- Publish Intermediate Outputs: Store and document intermediate results to help others understand and reuse your work incrementally.
464
-
- Regularly revisiting these practices during the project lifecycle will reduce last-minute issues and make your results easier to share and build upon. -->
0 commit comments