-
Notifications
You must be signed in to change notification settings - Fork 0
Fix Alluxio worker/master/job-server communication to enable S3 access #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: phrocker <[email protected]>
|
@copilot marc@pop-os:/mnt/ExtraDrive/repos/accumulo$ kubectl logs accumulo-alluxio-worker-hxkxs
2025-10-19 16:39:26,735 INFO main - Logging metrics and jstack on WORKER exit... marc@pop-os:/mnt/ExtraDrive/repos/accumulo$ kubectl logs accumulo-alluxio-master-6457b4cc59-5sg9x |
Co-authored-by: phrocker <[email protected]>
Fixed in commit 0c8daa5. The error occurred because Alluxio doesn't support |
|
@copilot worker not failing, but seeing this in master.out: WARNING: [Channel<32>: (accumulo-alluxio-worker-77zvt:29999)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host accumulo-alluxio-worker-77zvt, cause=java.lang.RuntimeException: java.net.UnknownHostException: accumulo-alluxio-worker-77zvt and master.log: |
Co-authored-by: phrocker <[email protected]>
Fixed in commit c8b4f91. Three issues resolved:
Workers now register with resolvable FQDNs, and both job master and job worker processes run with auto-restart monitoring. |
|
@copilot can write to s3 but instance id isn't being initialized properly and 2025-10-19T17:03:08,693 [manager.Manager] INFO : Version 4.0.0-SNAPSHOT |
Good news that S3 writes are working! The Alluxio communication fixes are complete. The ZooKeeper initialization error is a separate Accumulo issue unrelated to Alluxio configuration. This indicates Accumulo's instance hasn't been initialized in ZooKeeper yet. You'll need to run Accumulo's initialization command first: accumulo initOr if using the Helm chart, ensure the init job runs successfully before starting the manager. This is outside the scope of the Alluxio configuration fixes. |
Fix Alluxio Configuration Issues ✅ UPDATED (3rd commit)
Fixed critical Alluxio configuration issues preventing worker/master/job-server communication and S3 access for Accumulo.
Latest Fix (3rd commit - c8b4f91)
publishNotReadyAddresses: trueto worker service for immediate DNS registrationPrevious Fixes
Invalid property key env.POD_IPerror (2nd commit)Issues Fixed ✅
${env.POD_IP}syntax in Alluxio properties fileChanges Applied ✅
$(POD_NAME).<service>.<namespace>.svc.cluster.localpublishNotReadyAddresses: trueto worker servicejob_masterprocess on master with monitoringjob_workerprocess on workers with monitoringALLUXIO_WORKER_HOSTNAMEandALLUXIO_JOB_WORKER_HOSTNAMEenvironment variables${env.POD_IP}syntax from ConfigMap properties{{include "accumulo.fullname" .}}templatePort Assignments
Master: 19998 (RPC), 19999 (Web), 20001 (Job RPC)
Worker: 29999 (RPC), 29998 (Data), 30000 (Web), 30002-30004 (Job Worker)
DNS Resolution
Workers now use FQDN for hostname registration:
<pod-name>.<service-name>.<namespace>.svc.cluster.localaccumulo-alluxio-worker-77zvt.accumulo-alluxio-worker.default.svc.cluster.localpublishNotReadyAddresses: trueensures DNS records exist immediatelyProcess Architecture
Master Pod:
AlluxioMaster(main master process on port 19998)AlluxioJobMaster(job coordination on port 20001)Worker Pod:
AlluxioWorker(data worker on ports 29998-30000)AlluxioJobWorker(job worker on ports 30002-30004)Files Modified
charts/accumulo/templates/alluxio-master-deployment.yaml(+8 lines)charts/accumulo/templates/alluxio-master-service.yaml(+4 lines)charts/accumulo/templates/alluxio-worker-daemonset.yaml(+33 lines)charts/accumulo/templates/alluxio-worker-service.yaml(+17 lines)charts/accumulo/templates/configmap.yaml(+14 lines)Total: 5 files changed, 76 insertions(+), 5 deletions(-)
Expected Impact 🎯
✅ Workers use resolvable FQDN hostnames
✅ Master can connect to workers via DNS
✅ Job master process runs and is accessible
✅ Job workers connect to job master successfully
✅ No more "Unable to resolve host" errors
✅ No more "Failed to connect to master:20001" errors
✅ Accumulo can now communicate with S3 via Alluxio
Original prompt
Fixes #18
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.