Skip to content

Commit e57fbda

Browse files
authored
Merge pull request #10 from databricks-industry-solutions/feat/mqttv1
Initial inclusion of the mqtt library into the python data source connector
2 parents d2a1abb + 85b21d5 commit e57fbda

File tree

9 files changed

+878
-0
lines changed

9 files changed

+878
-0
lines changed

mqtt/.gitignore

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# Databricks-specific Zone
2+
.DS_Store
3+
.python-version
4+
5+
# Byte-compiled / optimized / DLL files
6+
__pycache__/
7+
*.py[cod]
8+
*$py.class
9+
10+
# C extensions
11+
*.so
12+
13+
# Distribution / packaging
14+
.Python
15+
build/
16+
develop-eggs/
17+
dist/
18+
downloads/
19+
eggs/
20+
.eggs/
21+
lib/
22+
lib64/
23+
parts/
24+
sdist/
25+
var/
26+
wheels/
27+
share/python-wheels/
28+
*.egg-info/
29+
.installed.cfg
30+
*.egg
31+
MANIFEST
32+
33+
# PyInstaller
34+
# Usually these files are written by a python script from a template
35+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
36+
*.manifest
37+
*.spec
38+
39+
# Installer logs
40+
pip-log.txt
41+
pip-delete-this-directory.txt
42+
43+
# Unit test / coverage reports
44+
htmlcov/
45+
.tox/
46+
.nox/
47+
.coverage
48+
.coverage.*
49+
.cache
50+
nosetests.xml
51+
coverage.xml
52+
*.cover
53+
*.py,cover
54+
.hypothesis/
55+
.pytest_cache/
56+
cover/
57+
58+
# Translations
59+
*.mo
60+
*.pot
61+
62+
# Django stuff:
63+
*.log
64+
local_settings.py
65+
db.sqlite3
66+
db.sqlite3-journal
67+
68+
# Flask stuff:
69+
instance/
70+
.webassets-cache
71+
72+
# Scrapy stuff:
73+
.scrapy
74+
75+
# Sphinx documentation
76+
docs/_build/
77+
78+
# PyBuilder
79+
.pybuilder/
80+
target/
81+
82+
# Jupyter Notebook
83+
.ipynb_checkpoints
84+
85+
# IPython
86+
profile_default/
87+
ipython_config.py
88+
89+
# pyenv
90+
# For a library or package, you might want to ignore these files since the code is
91+
# intended to run in multiple environments; otherwise, check them in:
92+
# .python-version
93+
94+
# pipenv
95+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
96+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
97+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
98+
# install all needed dependencies.
99+
#Pipfile.lock
100+
101+
# UV
102+
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
103+
# This is especially recommended for binary packages to ensure reproducibility, and is more
104+
# commonly ignored for libraries.
105+
#uv.lock
106+
107+
# poetry
108+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
109+
# This is especially recommended for binary packages to ensure reproducibility, and is more
110+
# commonly ignored for libraries.
111+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
112+
#poetry.lock
113+
114+
# pdm
115+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
116+
#pdm.lock
117+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
118+
# in version control.
119+
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
120+
.pdm.toml
121+
.pdm-python
122+
.pdm-build/
123+
124+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
125+
__pypackages__/
126+
127+
# Celery stuff
128+
celerybeat-schedule
129+
celerybeat.pid
130+
131+
# SageMath parsed files
132+
*.sage.py
133+
134+
# Environments
135+
.env
136+
.venv
137+
env/
138+
venv/
139+
ENV/
140+
env.bak/
141+
venv.bak/
142+
143+
# Spyder project settings
144+
.spyderproject
145+
.spyproject
146+
147+
# Rope project settings
148+
.ropeproject
149+
150+
# mkdocs documentation
151+
/site
152+
153+
# mypy
154+
.mypy_cache/
155+
.dmypy.json
156+
dmypy.json
157+
158+
# Pyre type checker
159+
.pyre/
160+
161+
# pytype static type analyzer
162+
.pytype/
163+
164+
# Cython debug symbols
165+
cython_debug/
166+
167+
# PyCharm
168+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
169+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
170+
# and can be added to the global gitignore or merged into this file. For a more nuclear
171+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
172+
#.idea/
173+
174+
# Ruff stuff:
175+
.ruff_cache/
176+
177+
# PyPI configuration file
178+
.pypirc

mqtt/LICENSE.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# DB license
2+
**Definitions.**
3+
4+
Agreement: The agreement between Databricks, Inc., and you governing the use of the Databricks Services, as that term is defined in the Master Cloud Services Agreement (MCSA) located at www.databricks.com/legal/mcsa.
5+
6+
Licensed Materials: The source code, object code, data, and/or other works to which this license applies.
7+
8+
**Scope of Use.** You may not use the Licensed Materials except in connection with your use of the Databricks Services pursuant to the Agreement. Your use of the Licensed Materials must comply at all times with any restrictions applicable to the Databricks Services, generally, and must be used in accordance with any applicable documentation. You may view, use, copy, modify, publish, and/or distribute the Licensed Materials solely for the purposes of using the Licensed Materials within or connecting to the Databricks Services. If you do not agree to these terms, you may not view, use, copy, modify, publish, and/or distribute the Licensed Materials.
9+
10+
**Redistribution.** You may redistribute and sublicense the Licensed Materials so long as all use is in compliance with these terms. In addition:
11+
12+
- You must give any other recipients a copy of this License;
13+
- You must cause any modified files to carry prominent notices stating that you changed the files;
14+
- You must retain, in any derivative works that you distribute, all copyright, patent, trademark, and attribution notices, excluding those notices that do not pertain to any part of the derivative works; and
15+
- If a "NOTICE" text file is provided as part of its distribution, then any derivative works that you distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the derivative works.
16+
17+
18+
You may add your own copyright statement to your modifications and may provide additional license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the Licensed Materials otherwise complies with the conditions stated in this License.
19+
20+
**Termination.** This license terminates automatically upon your breach of these terms or upon the termination of your Agreement. Additionally, Databricks may terminate this license at any time on notice. Upon termination, you must permanently delete the Licensed Materials and all copies thereof.
21+
22+
**DISCLAIMER; LIMITATION OF LIABILITY.**
23+
24+
THE LICENSED MATERIALS ARE PROVIDED “AS-IS” AND WITH ALL FAULTS. DATABRICKS, ON BEHALF OF ITSELF AND ITS LICENSORS, SPECIFICALLY DISCLAIMS ALL WARRANTIES RELATING TO THE LICENSED MATERIALS, EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, IMPLIED WARRANTIES, CONDITIONS AND OTHER TERMS OF MERCHANTABILITY, SATISFACTORY QUALITY OR FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. DATABRICKS AND ITS LICENSORS TOTAL AGGREGATE LIABILITY RELATING TO OR ARISING OUT OF YOUR USE OF OR DATABRICKS’ PROVISIONING OF THE LICENSED MATERIALS SHALL BE LIMITED TO ONE THOUSAND ($1,000) DOLLARS. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE LICENSED MATERIALS OR THE USE OR OTHER DEALINGS IN THE LICENSED MATERIALS.

mqtt/Makefile

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
.PHONY: dev test unit style check
2+
3+
all: clean style test
4+
5+
clean: ## Remove build artifacts and cache files
6+
rm -rf build/
7+
rm -rf dist/
8+
rm -rf *.egg-info/
9+
rm -rf htmlcov/
10+
rm -rf .coverage
11+
rm -rf coverage.xml
12+
rm -rf .pytest_cache/
13+
rm -rf .mypy_cache/
14+
rm -rf .ruff_cache/
15+
find . -type d -name __pycache__ -delete
16+
find . -type f -name "*.pyc" -delete
17+
18+
test:
19+
pip install -r requirements.txt
20+
pytest .
21+
22+
dev:
23+
pip install -r requirements.txt
24+
25+
style:
26+
pre-commit run --all-files
27+
28+
check: style test

mqtt/README.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# MQTT Data Source Connectors for Pyspark
2+
[![Unity Catalog](https://img.shields.io/badge/Unity_Catalog-Enabled-00A1C9?style=for-the-badge)](https://docs.databricks.com/en/data-governance/unity-catalog/index.html)
3+
[![Serverless](https://img.shields.io/badge/Serverless-Compute-00C851?style=for-the-badge)](https://docs.databricks.com/en/compute/serverless.html)
4+
# Databricks Python Data Sources
5+
6+
Introduced in Spark 4.x, Python Data Source API allows you to create PySpark Data Sources leveraging long standing python libraries for handling unique file types or specialized interfaces with spark read, readStream, write and writeStream APIs.
7+
8+
| Data Source Name | Purpose |
9+
| --- | --- |
10+
| [MQTT](https://pypi.org/project/paho-mqtt/) | Read MQTT messages from a broker |
11+
12+
---
13+
14+
## Configuration Options
15+
16+
The MQTT data source supports the following configuration options, which can be set via Spark options or environment variables:
17+
18+
| Option | Description | Required | Default |
19+
|--------|-------------|----------|---------|
20+
| `broker_address` | Hostname or IP address of the MQTT broker | Yes | - |
21+
| `port` | Port number of the MQTT broker | No | 8883 |
22+
| `username` | Username for broker authentication | No | "" |
23+
| `password` | Password for broker authentication | No | "" |
24+
| `topic` | MQTT topic to subscribe/publish to | No | "#" |
25+
| `qos` | Quality of Service level (0, 1, or 2) | No | 0 |
26+
| `require_tls` | Enable SSL/TLS (true/false) | No | true |
27+
| `keepalive` | Keep alive interval in seconds | No | 60 |
28+
| `clean_session` | Clean session flag (true/false) | No | false |
29+
| `conn_time` | Connection timeout in seconds | No | 1 |
30+
| `ca_certs` | Path to CA certificate file | No | - |
31+
| `certfile` | Path to client certificate file | No | - |
32+
| `keyfile` | Path to client key file | No | - |
33+
| `tls_disable_certs` | Disable certificate verification | No | - |
34+
35+
You can set these options in your PySpark code, for example:
36+
```python
37+
display(
38+
spark.readStream.format("mqtt_pub_sub")
39+
.option("topic", "#")
40+
.option("broker_address", "host")
41+
.option("username", "secret_user")
42+
.option("password", "secret_password")
43+
.option("qos", 2)
44+
.option("require_tls", False)
45+
.load()
46+
)
47+
```
48+
49+
---
50+
51+
## Building and Running Tests
52+
53+
* Clone repo
54+
* Create Virtual environment (Python 3.11)
55+
* Ensure Docker/Podman is installed and properly configured
56+
* Spin up a Docker container for a local MQTT Server:
57+
```yaml
58+
version: "3.7"
59+
services:
60+
mqtt5:
61+
userns_mode: keep-id
62+
image: eclipse-mosquitto
63+
container_name: mqtt5
64+
ports:
65+
- "1883:1883" # default mqtt port
66+
- "9001:9001" # default mqtt port for websockets
67+
volumes:
68+
- ./config:/mosquitto/config:rw
69+
- ./data:/mosquitto/data:rw
70+
- ./log:/mosquitto/log:rw
71+
restart: unless-stopped
72+
```
73+
74+
* Create .env file at the project root directory:
75+
```dotenv
76+
MQTT_BROKER_HOST=
77+
MQTT_BROKER_PORT=
78+
MQTT_USERNAME=
79+
MQTT_PASSWORD=
80+
MQTT_BROKER_TOPIC_PREFIX=
81+
```
82+
83+
* Run tests from project root directory
84+
```shell
85+
make test
86+
```
87+
88+
* Build package
89+
```shell
90+
python -m build
91+
```
92+
93+
---
94+
95+
## Example Usage
96+
97+
```python
98+
spark.dataSource.register(MqttDataSource)
99+
100+
display(
101+
spark.readStream.format("mqtt_pub_sub")
102+
.option("topic", "#")
103+
.option("broker_address", "host")
104+
.option("username", "secret_user")
105+
.option("password", "secret_password")
106+
.option("qos", 2)
107+
.option("require_tls", False)
108+
.load()
109+
)
110+
111+
df.writeStream.format("console").start().awaitTermination()
112+
```
113+
114+
---
115+
116+
## Project Support
117+
118+
The code in this project is provided **for exploration purposes only** and is **not formally supported** by Databricks under any Service Level Agreements (SLAs). It is provided **AS-IS**, without any warranties or guarantees.
119+
120+
Please **do not submit support tickets** to Databricks for issues related to the use of this project.
121+
122+
The source code provided is subject to the Databricks [LICENSE](https://github.com/databricks-industry-solutions/python-data-sources/blob/main/LICENSE.md) . All third-party libraries included or referenced are subject to their respective licenses set forth in the project license.
123+
124+
Any issues or bugs found should be submitted as **GitHub Issues** on the project repository. While these will be reviewed as time permits, there are **no formal SLAs** for support.
125+
126+
## 📄 Third-Party Package Licenses
127+
128+
© 2025 Databricks, Inc. All rights reserved. The source in this project is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.
129+
130+
| Datasource | Package | Purpose | License | Source |
131+
| ---------- | ---------- | --------------------------------- | ----------- | ------------------------------------ |
132+
| paho-mqtt | paho-mqtt | Python api for mqtt | EPL-v20 & EDL-v10 | https://pypi.org/project/paho-mqtt/ |
133+
134+
## References
135+
136+
- [Paho MQTT Python Client](https://pypi.org/project/paho-mqtt/)
137+
- [Eclipse Mosquitto](https://mosquitto.org/)
138+
- [Databricks Python Data Source API](https://docs.databricks.com/en/data-engineering/data-sources/python-data-sources.html)

0 commit comments

Comments
 (0)