If you spot a problem with the docs, search if an issue already. If a related issue doesn't exist, you can open a new issue.
Scan through our existing issues to find one that interests you. If you find an issue to work on, make sure that no one else is already working on it, so you can get assigned. After that, you are welcome to open a PR with a fix.
You can find a list of good first issues which can help you better understand code base of the project.
We have a workflow that automatically assigns issues to users who comment 'take' on an issue. This is configured in the .github/workflows/assign-on-comment.yml file. When a user comments take on the issue, a GitHub Action will be run to assign the issue to the user if it's not already assigned.
To start contributing you should fork this repository and only after that clone your fork. If you accidentally forked this repository you can fix it any time by this command:
# for user-login
git remote set-url origin https://github.com/your-github-name/quinn.git
# for private keys way
git remote set-url origin git@github.com:your-github-name/quinn.gitAfter cloning the project you should install all the dependencies. We are using poetry as a build tool. You can install poetry by following this instruction.
You can create a virtualenv with poetry. The recommended version of Python is 3.9:
poetry env use python3.9After that you should install all the dependencies including development:
make install_depsTo run spark tests you need to have properly configured Java. Apache Spark currently supports mainly only Java 8 (1.8). You can find an instruction on how to set up Java here. When you are running spark tests you should have JAVA_HOME variable in your environment which points to the installation of Java 8.
We use pre-commit hooks to ensure code quality. The configuration for pre-commit hooks is in the .pre-commit-config.yaml file. To install pre-commit, run:
poetry shell
poetry run pre-commit installTo run pre-commit hooks manually, use:
pre-commit run --all-filesThis project uses pytest and chispa for running spark tests. Please run all the tests before creating a pull request. In the case when you are working on new functionality you should also add new tests.
You can run test as following:
make testYou can run GitHub Actions locally using the act tool. The configuration for GitHub Actions is in
the .github/workflows/ci.yml file. To install act, follow the
instructions here. To run a specific job, use:
act -j <job-name>For example, to run the test job, use:
act -j testIf you need help with act, use:
act --helpFor MacBooks with M1 processors, you might have to add the --container-architecture tag:
act -j <job-name> --container-architecture linux/arm64To run the Spark-Connect tests locally, follow the below steps. Please note, this only works on Mac/UNIX-based systems.
-
Set up the required environment variables: Following variables need to be setup, so that the shell script that is used to install the Spark-Connect binary & start the server picks the version.
The version can either be
3.5.1or3.4.3, as those are the ones used in our CI.export SPARK_VERSION=3.5.1 export SPARK_CONNECT_MODE_ENABLED=1
-
Check if the required environment variables are set: Run the below command to check if the required environment variables are set.
echo $SPARK_VERSION echo $SPARK_CONNECT_MODE_ENABLED
-
Install required system packages: Run the below command to install wget.
For Mac users:
brew install wget
For Ubuntu users:
sudo apt-get install wget
-
Execute the shell script: Run the below command to execute the shell script that installs the Spark-Connect & starts the server.
sh scripts/run_spark_connect_server.sh
-
Run the tests: Run the below command to execute the tests using Spark-Connect.
make test -
Cleanups: After running the tests, you can stop the Spark-Connect server and unset the environment variables.
unset SPARK_VERSION unset SPARK_CONNECT_MODE_ENABLED
This project follows the PySpark style guide. All public functions and methods should be documented in README.md and also should have docstrings in sphinx format:
"""[Summary]
:param [ParamName]: [ParamDescription], defaults to [DefaultParamVal]
:type [ParamName]: [ParamType](, optional)
...
:raises [ErrorType]: [ErrorDescription]
...
:return: [ReturnDescription]
:rtype: [ReturnType]
"""We are using isort and ruff as linters. You can find instructions on how to set up and use these tools here:
- Install the
Ruffextension by Astral Software from the VSCode marketplace (Extension ID: charliermarsh.ruff). - Open the command palette (Ctrl+Shift+P) and select
Preferences: Open Settings (JSON). - Add the following configuration to your settings.json file:
{
"python.linting.ruffEnabled": true,
"python.linting.enabled": true,
"python.formatting.provider": "none",
"editor.formatOnSave": true
}The above settings will enable linting with Ruff, and format your code with Ruff on save.
To set up Ruff in PyCharm using poetry, follow these steps:
-
Find the path to your
poetryexecutable:- Open a terminal.
- For macOS/Linux, use the command
which poetry. - For Windows, use the command
where poetry. - Note down the path returned by the command.
-
Open the
Preferenceswindow (Cmd+, on macOS). -
Navigate to
Tools>External Tools. -
Click the
+icon to add a new external tool. -
Fill in the following details:
- Name:
Ruff - Program: Enter the path to your
poetryexecutable that you noted earlier. - Arguments:
run ruff check --fix $FilePathRelativeToProjectRoot$ - Working directory:
$ProjectFileDir$
- Name:
-
Click
OKto save the configuration. -
To run Ruff, right-click on a file or directory in the project view, select
External Tools, and then selectRuff.
When you're finished with the changes, create a pull request, also known as a PR.
- Don't forget to link PR to the issue if you are solving one.
- As you update your PR and apply changes, mark each conversation as resolved.
- If you run into any merge issues, checkout this git tutorial to help you resolve merge conflicts and other issues.