-
Notifications
You must be signed in to change notification settings - Fork 527
Description
Hi all, I am recently debugging a SIGSEGV that happens in Ray's OTEL integration. Ray is a library for building distributed Python applications. It helps dispatch users' Python functions to different machines for distributed processing. With that being said, Ray allows users to do setenv at anytime in the user's thread while the OTEL integration works in a background thread.
And the SIGSEGV happens on a race condition between the user's setenv and the getenv in OtlpGrpcClient::MakeChannel. Specifically, it happens at grpc::CreateCustomChannel:
| grpc::CreateCustomChannel(grpc_target, grpc::InsecureChannelCredentials(), grpc_arguments); |
When the grpc_arguments doesn't have grpc_arguments->SetInt(GRPC_ARG_ENABLE_HTTP_PROXY, 0), grpc::CreateCustomChannel will invoke GetEnv("no_grpc_proxy") and that will cause the race and SIGSEGV.
We want to avoid the getenv call. Is it possible to have an option in OtlpGrpcClientOptions and OtlpGrpcMetricExporterOptions for us to set grpc_arguments->SetInt(GRPC_ARG_ENABLE_HTTP_PROXY, 0)? Thanks.
Here is the SIGSEGV backtrace for your reference:
Core was generated by `ray::IDLE '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __pthread_kill_implementation (no_tid=0, signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x76c5f0ff9640 (LWP 401864))]
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=130592523916864, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3 0x000076c618f9d476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4 <signal handler called>
#5 __pthread_kill_implementation (no_tid=0, signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:44
#6 __pthread_kill_internal (signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:78
#7 __GI___pthread_kill (threadid=130592523916864, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#8 0x000076c618f9d476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#9 <signal handler called>
#10 __GI_getenv (name=0x76c61840a863 "pc_proxy") at ./stdlib/getenv.c:84
#11 0x000076c61823e9ea in grpc_core::GetEnv(char const*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#12 0x000076c6180fa0cf in grpc_core::HttpProxyMapper::MapName(std::basic_string_view<char, std::char_traits<char> >, grpc_core::ChannelArgs*) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#13 0x000076c61822315b in grpc_core::ProxyMapperRegistry::MapName(std::basic_string_view<char, std::char_traits<char> >, grpc_core::ChannelArgs*) const ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#14 0x000076c6180e8e1a in grpc_core::ClientChannel::ClientChannel(grpc_channel_element_args*, absl::lts_20230802::Status*) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#15 0x000076c6180e9707 in grpc_core::ClientChannel::Init(grpc_channel_element*, grpc_channel_element_args*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#16 0x000076c618134b01 in grpc_channel_stack_init(int, void (*)(void*, absl::lts_20230802::Status), void*, grpc_channel_filter const**, unsigned long, grpc_core::ChannelArgs const&, char const*, grpc_channel_stack*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#17 0x000076c6181369be in grpc_core::ChannelStackBuilderImpl::Build() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#18 0x000076c618164e28 in grpc_core::Channel::CreateWithBuilder(grpc_core::ChannelStackBuilder*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#19 0x000076c6181656df in grpc_core::Channel::Create(char const*, grpc_core::ChannelArgs, grpc_channel_stack_type, grpc_transport*) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#20 0x000076c617ef8277 in grpc_channel_create () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#21 0x000076c617d8815e in grpc::(anonymous namespace)::InsecureChannelCredentialsImpl::CreateChannelWithInterceptors(std::string const&, grpc::ChannelArguments const&, std::vector<std::unique_ptr<grpc::experimental::ClientInterceptorFactoryInterface, std::default_delete<grpc::experimental::ClientInterceptorFactoryInterface> >, std::allocator<std::unique_ptr<grpc::experimental::ClientInterceptorFactoryInterface, std::default_delete<grpc::experimental::ClientInterceptorFactoryInterface> > > >) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#22 0x000076c617d88274 in grpc::(anonymous namespace)::InsecureChannelCredentialsImpl::CreateChannelImpl(std::string const&, grpc::ChannelArguments const&) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#23 0x000076c617d863a9 in grpc::CreateCustomChannel(std::string const&, std::shared_ptr<grpc::ChannelCredentials> const&, grpc::ChannelArguments const&) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#24 0x000076c617d12dbe in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::MakeChannel(opentelemetry::v1::exporter::otlp::OtlpGrpcClientOptions const&) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#25 0x000076c617d135bf in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::OtlpGrpcClient(opentelemetry::v1::exporter::otlp::OtlpGrpcClientOptions const&) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#26 0x000076c617d13923 in opentelemetry::v1::exporter::otlp::OtlpGrpcClientFactory::Create(opentelemetry::v1::exporter::otlp::OtlpGrpcClientOptions const&) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#27 0x000076c617d0a1c9 in opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporter::OtlpGrpcMetricExporter(opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporterOptions const&) ()
--Type <RET> for more, q to quit, c to continue without paging--c
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#28 0x000076c617d00f8f in ray::observability::OpenTelemetryMetricRecorder::Start(std::string const&, std::chrono::duration<long, std::ratio<1l, 1000l> >, std::chrono::duration<long, std::ratio<1l, 1000l> >) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#29 0x000076c61772a5a0 in ray::core::CoreWorkerProcessImpl::CoreWorkerProcessImpl(ray::core::CoreWorkerOptions const&)::{lambda(ray::Status const&)#2}::operator()(ray::Status const&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#30 0x000076c617b247ca in auto ray::rpc::MetricsAgentClientImpl::WaitForServerReadyWithRetry(std::function<void (ray::Status const&)>, int, int, int)::{lambda(auto:1&, auto:2&&)#1}::operator()<ray::Status const, ray::rpc::HealthCheckReply>(ray::Status const&, ray::rpc::HealthCheckReply&&) const [clone .constprop.0] () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#31 0x000076c617b26a35 in ray::rpc::ClientCallImpl<ray::rpc::HealthCheckReply>::OnReplyReceived() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#32 0x000076c61772c2b5 in std::_Function_handler<void (), ray::rpc::ClientCallManager::PollEventsFromCompletionQueue(int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#33 0x000076c617cf906b in EventTracker::RecordExecution(std::function<void ()> const&, std::shared_ptr<StatsHandle>) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#34 0x000076c617cefb8b in std::_Function_handler<void (), instrumented_io_context::post(std::function<void ()>, std::string, long)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#35 0x000076c61782d08b in boost::asio::detail::executor_op<boost::asio::detail::binder0<std::function<void ()> >, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#36 0x000076c61825462b in boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#37 0x000076c618255fc9 in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#38 0x000076c6182566d2 in boost::asio::io_context::run() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#39 0x000076c617729994 in ray::core::CoreWorkerProcessImpl::CreateCoreWorker(ray::core::CoreWorkerOptions, ray::WorkerID const&)::{lambda()#1}::operator()() const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#40 0x000076c617860270 in thread_proxy () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#41 0x000076c618fefac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#42 0x000076c6190818d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81