Skip to content

Commit fedd4cb

Browse files
committed
refactoring
1 parent b9c272b commit fedd4cb

File tree

1 file changed

+25
-65
lines changed

1 file changed

+25
-65
lines changed

pages/Community and Best Practices/Data and Workflow Best Practices/Workflows/workflow-best-practices.md

Lines changed: 25 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,17 @@ A well-organized project directory makes your workflow easier for others (and yo
1212

1313
Consider using separate directories for distinct components of your project:
1414

15-
* `code/` or `src/`: For primary source code files, including scripts (`.py`) and notebooks (`.ipynb`).
16-
* `data/`: For input data files. Note that large data files should generally not be committed to version control (see next section). This directory might contain small sample datasets or scripts to download larger inputs.
17-
* `docs/`: For detailed documentation, figures, or reports.
18-
* `environment/`: For files defining the software environment (e.g., `environment.yml`, `Dockerfile`).
15+
- `code/` or `src/`: For primary source code files, including scripts (`.py`) and notebooks (`.ipynb`).
16+
- `data/`: For input data files. Note that large data files should generally not be committed to version control (see next section). This directory might contain small sample datasets or scripts to download larger inputs.
17+
- `docs/`: For detailed documentation, figures, or reports.
18+
- `environment/`: For files defining the software environment (e.g., `environment.yml`, `Dockerfile`).
1919

2020
At the root level of your project, always include a `README.md` file. This file serves as the entry point and should clearly explain:
2121

22-
* The purpose of the project and workflow.
23-
* The contents of the repository and the directory structure.
24-
* Instructions on how to set up the environment and run the workflow.
25-
* Information about required input data and how to obtain it.
22+
- The purpose of the project and workflow.
23+
- The contents of the repository and the directory structure.
24+
- Instructions on how to set up the environment and run the workflow.
25+
- Information about required input data and how to obtain it.
2626

2727
## Use Version Control Effectively
2828

@@ -41,23 +41,23 @@ git push -u origin main
4141
```
4242

4343
Version control provides several key benefits:
44-
* It acts as a complete history log, allowing you to track every change and revert to previous versions if needed.
45-
* It facilitates collaboration by allowing multiple people to work on the same codebase simultaneously using branches and merging.
46-
* It serves as a reliable backup mechanism for your code and project history.
47-
* Crucially, it allows you to link specific versions of your code (via commits or tags) to the results generated, which is fundamental for reproducibility.
44+
- It acts as a complete history log, allowing you to track every change and revert to previous versions if needed.
45+
- It facilitates collaboration by allowing multiple people to work on the same codebase simultaneously using branches and merging.
46+
- It serves as a reliable backup mechanism for your code and project history.
47+
- Crucially, it allows you to link specific versions of your code (via commits or tags) to the results generated, which is fundamental for reproducibility.
4848

4949
When setting up your repository, carefully consider what should and should not be tracked, following common recommendations:
5050

51-
* Track These:
52-
* Source code files (.ipynb, .py).
53-
* Configuration files.
54-
* Environment definition files (environment.yml, requirements.txt, Dockerfile).
55-
* Documentation files (README.md, other .md or text files).
51+
- Track These:
52+
- Source code files (.ipynb, .py).
53+
- Configuration files.
54+
- Environment definition files (environment.yml, requirements.txt, Dockerfile).
55+
- Documentation files (README.md, other .md or text files).
5656

57-
* Do Not Track These:
58-
* Large data files. Data should be stored separately and accessed via links or download scripts.
59-
* Credentials, API keys, or any sensitive information (secrets).
60-
* Generated outputs like plots, figures, or intermediate/final data files.
57+
- Do Not Track These:
58+
- Large data files. Data should be stored separately and accessed via links or download scripts.
59+
- Credentials, API keys, or any sensitive information (secrets).
60+
- Generated outputs like plots, figures, or intermediate/final data files.
6161

6262
Use a `.gitignore` file to explicitly tell Git which files and directories to ignore. To use version control effectively:
6363
- Make frequent, small commits. Each commit should represent a single logical change.
@@ -369,7 +369,7 @@ Once your notebook runs reliably, think about making it even more reusable with
369369

370370
## Implement Basic Testing
371371

372-
Adding checks to your code helps ensure it behaves as expected and increases confidence in your results ✔️. Even simple tests can catch errors early, saving significant debugging time later.
372+
Adding checks to your code helps ensure it behaves as expected and increases confidence in your results. Even simple tests can catch errors early, saving significant debugging time later.
373373

374374
A straightforward way to add checks directly within your notebook is using `assert` statements. These statements test whether a condition is true; if it's false, the code will stop and raise an error, immediately alerting you to a problem. Use them to verify assumptions about your data or the results of calculations.
375375

@@ -421,48 +421,8 @@ A crucial step for ensuring true reproducibility is explicitly connecting the sp
421421

422422
In EarthCODE, this vital link is captured within the **Experiment** metadata record. When you publish a data **Product**, its metadata should reference the **Workflow** that created it and the details of the code run are in the metadata of an **Experiment**. The Experiment record, in turn, contains precise references to:
423423

424-
* The specific **Workflow** version used (e.g., a Git commit hash or tag).
425-
* The exact **Input Data** consumed.
426-
* The **Configuration** parameters applied during that run.
424+
- The specific **Workflow** version used (e.g., a Git commit hash or tag).
425+
- The exact **Input Data** consumed.
426+
- The **Configuration** parameters applied during that run.
427427

428428
This creates a complete, traceable chain from the final data product back to the exact code and conditions that generated it. By formally linking the code version to the results via an Experiment, you provide the necessary provenance for others to verify your findings and confidently reproduce your work.
429-
430-
431-
432-
433-
434-
435-
436-
437-
438-
439-
440-
441-
442-
<!--
443-
# Workflow Best Practices
444-
## Plan for Reproducibility from day 1
445-
446-
## Best Practices for high-quality Code, Data and Workflows
447-
448-
Maintaining high-quality code and data throughout your project ensures that your outputs are reusable, trustworthy, and easier to publish. Below are tips and recommended practices to support quality assurance and reproducibility:
449-
450-
- Code Quality
451-
- Use Version Control: Track your development using Git and a shared repository (e.g., GitHub or GitLab).
452-
- Automate Testing: Implement unit tests and integration tests using tools like pytest, unittest, or CI/CD workflows.
453-
- Follow Coding Standards: Adopt a consistent style (e.g., PEP8 for Python) and use linters (e.g., flake8, black) to maintain code clarity.
454-
- Write Documentation: Provide clear usage instructions and inline comments. Consider using Jupyter Notebooks or Markdown README files to explain workflows.
455-
- Data Quality
456-
- Validate Your Data: Apply automated checks for data formats, missing values, and schema consistency.
457-
- Document Your Data: Create or maintain metadata alongside your datasets, including descriptions of variables, units, and collection methods.
458-
- Use Standard Formats: Choose interoperable, machine-readable formats (e.g., NetCDF, GeoTIFF, Zarr) and community-agreed standards (such as CF-Conventions).
459-
- Track Data Changes: when needed, version datasets as they evolve and log processing steps to support reproducibility.
460-
- Integration with EarthCODE
461-
- Use EarthCODE-Compatible Tools: When possible, rely on tools and environments that are natively supported within EarthCODE platforms.
462-
- Test Workflows in EarthCODE Early: Validate your workflows in the target platform before final publication to avoid integration issues.
463-
- Publish Intermediate Outputs: Store and document intermediate results to help others understand and reuse your work incrementally.
464-
- Regularly revisiting these practices during the project lifecycle will reduce last-minute issues and make your results easier to share and build upon. -->
465-
466-
467-
468-

0 commit comments

Comments
 (0)