|
| 1 | +# Repository Guidelines |
| 2 | + |
| 3 | +Use this guide to make concise, high-signal contributions to the generalized k-means clustering library. |
| 4 | + |
| 5 | +## Project Structure & Module Organization |
| 6 | +- Scala sources live in `src/main/scala` (DataFrame/ML API under `com.massivedatascience.clusterer.ml`), with version-specific shims in `src/main/scala-2.12` and `src/main/scala-2.13`. Legacy RDD code remains in `com.massivedatascience.clusterer`. |
| 7 | +- Tests use ScalaTest under `src/test/scala` with Spark-local fixtures; shared data is in `src/test/resources`. Executable examples sit in `src/main/scala/examples`. |
| 8 | +- Python wrapper lives in `python/` (`massivedatascience` package, examples, and tests). Docs and release notes are in `docs/`, `release-notes/`, `ARCHITECTURE.md`, and `DATAFRAME_API_EXAMPLES.md`. |
| 9 | + |
| 10 | +## Build, Test, and Development Commands |
| 11 | +- `sbt compile` — compile against the default Scala/Spark matrix; use `sbt ++2.13.14` or `sbt ++2.12.18` to pin versions. |
| 12 | +- `sbt test` — full JVM suite (ScalaTest, Spark local[2]); CI mirrors this with multiple Scala/Spark combos. |
| 13 | +- `sbt scalafmtAll` then `sbt scalastyle` — required format/lint gates (`.scalafmt.conf`, `scalastyle-config.xml`). |
| 14 | +- `sbt coverage test coverageReport` — generate coverage; keep kernels and persistence paths covered. |
| 15 | +- Python: `cd python && pip install -e .[dev] && pytest` (see `python/TESTING.md`). |
| 16 | + |
| 17 | +## Coding Style & Naming Conventions |
| 18 | +- Scalafmt enforces 2-space indent and 100-col limit; keep trailing commas and aligned parameters. Prefer immutable vals, small helpers, and Spark ML `Estimator/Model` patterns (`set*`/`get*`). |
| 19 | +- Naming: PascalCase classes/objects, camelCase methods/vals/params. Document public APIs with Scaladoc and mirror existing parameter docs. |
| 20 | +- In tests, disable the Spark UI and keep partitions small (follow existing suites) to avoid flakiness. |
| 21 | + |
| 22 | +## Testing Guidelines |
| 23 | +- Add ScalaTest `AnyFunSuite` cases under `src/test/scala`; keep seeds deterministic and assert numerical tolerances for divergences. Reuse existing fixtures/utilities. |
| 24 | +- Include persistence round-trips when adding models/params; CI validates cross-version save/load. |
| 25 | +- For Python changes, update `python/tests/test_generalized_kmeans.py` and run `pytest --cov=massivedatascience tests/`. |
| 26 | + |
| 27 | +## Commit & Pull Request Guidelines |
| 28 | +- Use conventional commits (`feat|fix|docs|style|refactor|perf|test|build|ci|chore`, optional scope): `type(scope): subject`. |
| 29 | +- PRs should summarize behavior changes, list executed commands (e.g., `sbt ++2.13.14 test`, `sbt scalafmtAll`, `pytest`), and link issues (`Closes #123`). Provide before/after snippets for API or doc updates; screenshots only when user-facing outputs change. |
| 30 | +- CI runs lint, Scala/Spark matrix tests, python smoke, and CodeQL; align local runs to reduce iteration. |
| 31 | + |
| 32 | +## Security & Configuration Notes |
| 33 | +- Target Java 17; avoid committing large datasets or credentials. Report vulnerabilities via `SECURITY.md`. |
| 34 | +- When modifying dependencies or persistence formats, consult `DEPENDENCY_MANAGEMENT.md` and `PERSISTENCE_COMPATIBILITY.md` to preserve cross-version compatibility. |
0 commit comments