Skip to content

Conversation

@sjyango
Copy link
Contributor

@sjyango sjyango commented Jan 4, 2026

This PR references ByteDance's implementation. Co-Author: @WencongLiu

Description

This PR introduces native support for Python User-Defined Functions (UDF), User-Defined Aggregate Functions (UDAF), and User-Defined Table Functions (UDTF) in Doris, enabling users to extend SQL capabilities with custom Python logic for complex data processing scenarios.

Key Features

🚀 Three Function Types

  • UDF: Scalar functions with row-by-row or vectorized execution (3-10x performance gain with Pandas/Arrow mode)
  • UDAF: Snowflake-style stateful aggregation with distributed merge support
  • UDTF: Table-valued functions that generate multiple output rows from a single input row

🔧 Production-Grade Architecture

  • High-Performance Communication: Arrow Flight RPC over Unix sockets with zero-copy columnar data transfer
  • Multi-Version Support: Flexible environment management via Conda or venv, allowing different UDFs to use different Python versions (e.g., 3.9, 3.10, 3.12)

🎯 Deep Integration

  • Seamless integration with Doris vectorized execution engine
  • Native support for Doris data types (including complex types like ARRAY, MAP, STRUCT)
  • Automatic conversion between Doris types and Python/Arrow types

Architecture Highlights

┌──────────────────────────────────────────────────────┐
│  Doris BE (C++)                                      │
│  ┌────────────────────────────────────────────────┐  │
│  │  PythonServerManager (Process Pool)            │  │
│  │  ├─ Health Check Thread (60s interval)         │  │
│  │  ├─ Load Balancing (min ref count)             │  │
│  │  └─ Auto Recovery (dead process detection)     │  │
│  └────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────┐  │
│  │  PythonClient (Arrow Flight RPC)               │  │
│  │  ├─ UDF: Scalar/Vectorized evaluation          │  │
│  │  ├─ UDAF: Stateful aggregation with merge      │  │
│  │  └─ UDTF: ListArray batch processing           │  │
│  └────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────┘
                      ↕ Unix Socket
┌──────────────────────────────────────────────────────┐
│  Python Process (python_server.py)                   │
│  ┌────────────────────────────────────────────────┐  │
│  │  FlightServer (Arrow Flight bidirectional)     │  │
│  │  ├─ AdaptivePythonUDF (auto mode selection)    │  │
│  │  ├─ UDAFStateManager (Snowflake interface)     │  │
│  │  └─ UDFLoader (inline/module code execution)   │  │
│  └────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────┘

Configuration

Add to be.conf:

# Enable Python UDF support
enable_python_udf_support = true

# Choose environment management mode (conda or venv)
python_env_mode = conda

# For Conda mode
python_conda_root_path = /path/to/miniconda3

# For venv mode
python_env_mode = venv
python_venv_root_path = /doris/python_envs
python_venv_interpreter_paths = /opt/python3.9/bin/python3.9:/opt/python3.12/bin/python3.12

# Process pool size (0 = use CPU core count)
max_python_process_num = 0

Technical Highlights

1. Environment Management

  • Multi-version support: Each UDF can specify its own Python version
  • Two modes: Conda (full environment isolation) or venv (lightweight)
  • Automatic discovery: Scans available Python environments at BE startup

2. Process Pool Management

  • Shared pool: One pool per Python version, shared across all threads
  • Load balancing: Distributes requests to processes with minimum load
  • Health monitoring: Background thread checks process health every 60 seconds
  • Auto recovery: Automatically recreates dead processes

3. Communication Protocol

  • Arrow Flight RPC: High-performance, language-agnostic RPC framework
  • Unix Socket: Local IPC for minimal latency and enhanced security
  • Bidirectional streaming: Efficient batch data transfer

4. Execution Modes

  • Scalar mode: Process one value at a time (simple functions)
  • Vectorized mode: Process entire columns with NumPy/Pandas (10-100x faster)
  • Adaptive selection: Automatically chooses mode based on function signature

5. UDAF State Management (Snowflake Style)

  • 5 lifecycle methods: __init__, accumulate, merge, finish, aggregate_state
  • Distributed aggregation: Serialization/deserialization for shuffle operations
  • Efficient state handling: Place-based mapping avoids redundant transfers

Limitations

  1. Performance: Python UDFs are slower than native C++ built-in functions. Best suited for complex logic that's difficult to implement in SQL.
  2. Type support: Special Doris types like HLL and Bitmap are not yet supported.
  3. Concurrency: Parallelism is limited by max_python_process_num setting.

Related Documentation

  • Python UDF User Guide
  • Python UDAF User Guide
  • Python UDTF User Guide
  • Python Environment Configuration Guide

This PR enables users to leverage the rich Python ecosystem (NumPy, Pandas, scikit-learn, etc.) directly within Doris SQL queries, significantly expanding the platform's data processing capabilities.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jan 4, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sjyango sjyango force-pushed the python_udf branch 2 times, most recently from fba1d68 to 58258db Compare January 6, 2026 05:51
@sjyango
Copy link
Contributor Author

sjyango commented Jan 6, 2026

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.57% (1780/2237)
Line Coverage 64.91% (31561/48624)
Region Coverage 65.46% (15701/23986)
Branch Coverage 56.06% (8345/14886)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/437) 🎉
Increment coverage report
Complete coverage report

@sjyango sjyango force-pushed the python_udf branch 2 times, most recently from 5ca8e8d to 511c8ad Compare January 8, 2026 14:37
@sjyango
Copy link
Contributor Author

sjyango commented Jan 8, 2026

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.57% (1784/2242)
Line Coverage 64.75% (31728/48997)
Region Coverage 65.43% (15782/24120)
Branch Coverage 56.02% (8385/14968)

@doris-robot
Copy link

TPC-H: Total hot run time: 32292 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fd27a83e92c9b225b457c2281ed688f086663e88, data reload: false

------ Round 1 ----------------------------------
q1	17628	4186	4081	4081
q2	2033	371	248	248
q3	10511	1284	726	726
q4	10371	901	333	333
q5	9689	2063	1990	1990
q6	246	171	143	143
q7	964	813	674	674
q8	9288	1444	1181	1181
q9	5106	4668	4585	4585
q10	6880	1807	1401	1401
q11	544	312	289	289
q12	726	745	605	605
q13	17810	3928	3108	3108
q14	291	301	289	289
q15	598	520	513	513
q16	706	709	652	652
q17	676	841	495	495
q18	6987	6523	7475	6523
q19	1177	1036	641	641
q20	451	430	249	249
q21	3174	2560	2589	2560
q22	1127	1091	1006	1006
Total cold run time: 106983 ms
Total hot run time: 32292 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4429	4273	4354	4273
q2	333	433	317	317
q3	2303	2830	2444	2444
q4	1526	1948	1633	1633
q5	4439	4468	4609	4468
q6	220	170	131	131
q7	1974	1904	1755	1755
q8	2662	2472	2425	2425
q9	7176	7303	7048	7048
q10	2489	2732	2283	2283
q11	527	464	446	446
q12	673	751	620	620
q13	3399	3867	3107	3107
q14	285	284	260	260
q15	536	484	489	484
q16	623	663	626	626
q17	1135	1378	1376	1376
q18	7453	7325	7200	7200
q19	858	833	836	833
q20	1919	1963	1820	1820
q21	4609	4301	4150	4150
q22	1071	1013	971	971
Total cold run time: 50639 ms
Total hot run time: 48670 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172159 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fd27a83e92c9b225b457c2281ed688f086663e88, data reload: false

query5	4445	591	439	439
query6	339	237	219	219
query7	4206	458	262	262
query8	325	252	256	252
query9	8761	2669	2689	2669
query10	520	380	333	333
query11	15167	15124	14834	14834
query12	178	113	117	113
query13	1266	493	383	383
query14	6157	3038	2785	2785
query14_1	2666	2676	2699	2676
query15	199	196	175	175
query16	989	485	458	458
query17	1112	699	592	592
query18	2489	423	339	339
query19	226	222	192	192
query20	118	113	113	113
query21	209	135	125	125
query22	3795	3990	3738	3738
query23	15909	15576	15295	15295
query23_1	15324	15373	15450	15373
query24	7430	1535	1160	1160
query24_1	1164	1148	1167	1148
query25	529	439	392	392
query26	1257	263	152	152
query27	2777	458	286	286
query28	4549	2143	2133	2133
query29	760	547	434	434
query30	316	246	222	222
query31	827	616	563	563
query32	77	79	71	71
query33	540	339	290	290
query34	928	872	529	529
query35	713	754	670	670
query36	865	883	744	744
query37	125	93	74	74
query38	2739	2677	2710	2677
query39	781	757	734	734
query39_1	729	699	698	698
query40	220	134	116	116
query41	64	61	62	61
query42	104	102	105	102
query43	467	432	425	425
query44	1319	719	718	718
query45	182	185	177	177
query46	851	969	581	581
query47	1369	1458	1353	1353
query48	312	316	239	239
query49	606	417	323	323
query50	641	289	204	204
query51	3769	3774	3744	3744
query52	110	110	101	101
query53	292	327	272	272
query54	281	248	247	247
query55	77	75	71	71
query56	303	293	300	293
query57	1049	971	953	953
query58	303	255	259	255
query59	1963	2143	2103	2103
query60	323	320	297	297
query61	161	167	158	158
query62	395	363	307	307
query63	299	274	271	271
query64	4956	1306	1019	1019
query65	3754	3736	3709	3709
query66	1435	412	302	302
query67	14863	15061	14397	14397
query68	2820	1029	735	735
query69	448	347	313	313
query70	1023	902	925	902
query71	320	289	274	274
query72	6233	3659	3668	3659
query73	599	730	303	303
query74	8741	8763	8650	8650
query75	2773	2828	2457	2457
query76	2870	1061	656	656
query77	359	383	287	287
query78	9785	10071	9170	9170
query79	1065	900	595	595
query80	1425	569	474	474
query81	552	265	227	227
query82	989	144	111	111
query83	362	260	239	239
query84	255	117	101	101
query85	1209	538	470	470
query86	407	303	290	290
query87	2885	2841	2794	2794
query88	3268	2233	2229	2229
query89	396	343	335	335
query90	1911	154	145	145
query91	172	164	147	147
query92	66	68	64	64
query93	1166	901	524	524
query94	628	323	282	282
query95	562	371	309	309
query96	587	463	210	210
query97	2314	2363	2304	2304
query98	213	202	195	195
query99	583	592	538	538
Total cold run time: 248003 ms
Total hot run time: 172159 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants