Skip to content

Conversation

Copy link

Copilot AI commented Oct 19, 2025

Fix Alluxio Configuration Issues ✅ UPDATED (3rd commit)

Fixed critical Alluxio configuration issues preventing worker/master/job-server communication and S3 access for Accumulo.

Latest Fix (3rd commit - c8b4f91)

  • Fixed worker hostname DNS resolution issue
  • Changed worker hostname from POD_IP to FQDN for proper DNS resolution
  • Added publishNotReadyAddresses: true to worker service for immediate DNS registration
  • Started job_master process on master (was only configured but not started)
  • Started job_worker process on workers (was only configured but not started)
  • Added monitoring and auto-restart for job master and job worker processes

Previous Fixes

  • Fixed Invalid property key env.POD_IP error (2nd commit)
  • All original communication and port configuration issues (1st commit)

Issues Fixed ✅

  • Worker daemonset uses hardcoded hostname instead of templated value
  • Worker daemonset missing POD_NAME and POD_NAMESPACE environment variables
  • Worker hostname using static service name instead of POD_IP
  • Job master port (20001) not exposed in master deployment or service
  • Job worker configuration completely missing from worker ConfigMap
  • Worker data port incorrectly set to same value as RPC port (29999)
  • Worker bind host configuration preventing self-communication
  • Invalid ${env.POD_IP} syntax in Alluxio properties file
  • Worker hostname not resolvable via DNS (pod name not in DNS)
  • Job master process not started on master
  • Job worker process not started on workers

Changes Applied ✅

  • Added POD_NAME and POD_NAMESPACE environment variables to worker daemonset
  • Changed worker hostname to use FQDN: $(POD_NAME).<service>.<namespace>.svc.cluster.local
  • Added publishNotReadyAddresses: true to worker service
  • Started job_master process on master with monitoring
  • Started job_worker process on workers with monitoring
  • Changed to use ALLUXIO_WORKER_HOSTNAME and ALLUXIO_JOB_WORKER_HOSTNAME environment variables
  • Removed invalid ${env.POD_IP} syntax from ConfigMap properties
  • Fixed hardcoded master hostname to use {{include "accumulo.fullname" .}} template
  • Added job master RPC port (20001) to master deployment and service
  • Added complete job worker configuration to worker ConfigMap
  • Fixed worker port configuration: RPC=29999, Data=29998 (separate ports)
  • Updated worker bind host to 0.0.0.0 for proper networking
  • Added all job worker ports (30002-30004) to worker deployment and service

Port Assignments

Master: 19998 (RPC), 19999 (Web), 20001 (Job RPC)
Worker: 29999 (RPC), 29998 (Data), 30000 (Web), 30002-30004 (Job Worker)

DNS Resolution

Workers now use FQDN for hostname registration:

  • Format: <pod-name>.<service-name>.<namespace>.svc.cluster.local
  • Example: accumulo-alluxio-worker-77zvt.accumulo-alluxio-worker.default.svc.cluster.local
  • The headless service with publishNotReadyAddresses: true ensures DNS records exist immediately

Process Architecture

Master Pod:

  • AlluxioMaster (main master process on port 19998)
  • AlluxioJobMaster (job coordination on port 20001)

Worker Pod:

  • AlluxioWorker (data worker on ports 29998-30000)
  • AlluxioJobWorker (job worker on ports 30002-30004)

Files Modified

  • charts/accumulo/templates/alluxio-master-deployment.yaml (+8 lines)
  • charts/accumulo/templates/alluxio-master-service.yaml (+4 lines)
  • charts/accumulo/templates/alluxio-worker-daemonset.yaml (+33 lines)
  • charts/accumulo/templates/alluxio-worker-service.yaml (+17 lines)
  • charts/accumulo/templates/configmap.yaml (+14 lines)

Total: 5 files changed, 76 insertions(+), 5 deletions(-)

Expected Impact 🎯

✅ Workers use resolvable FQDN hostnames
✅ Master can connect to workers via DNS
✅ Job master process runs and is accessible
✅ Job workers connect to job master successfully
✅ No more "Unable to resolve host" errors
✅ No more "Failed to connect to master:20001" errors
Accumulo can now communicate with S3 via Alluxio

Original prompt

This section details on the original issue you should resolve

<issue_title>Fix alluxio configuration</issue_title>
<issue_description>alluxio worker can't comunicate with itself or with the master and that's not allowing us to communicate to s3. additionally on the master we're not seeing communication with the job server. need to make sure they all work so accumulo can work. </issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #18


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

@phrocker
Copy link

@copilot marc@pop-os:/mnt/ExtraDrive/repos/accumulo$ kubectl logs accumulo-alluxio-worker-hxkxs
Waiting for Alluxio master to be ready...
Waiting for master...
Waiting for master...
Waiting for master...
Waiting for master...
mount: only root can use "--options" option
Assuming NoMount by default.
Exception in thread "main" java.lang.RuntimeException: Invalid property key env.POD_IP
at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:442)
at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:413)
at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:152)
at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:162)
at alluxio.conf.AlluxioConfiguration.getOrDefault(AlluxioConfiguration.java:65)
at alluxio.cli.GetConf.getConfImpl(GetConf.java:189)
at alluxio.cli.GetConf.getConf(GetConf.java:146)
at alluxio.cli.GetConf.main(GetConf.java:267)
Exception in thread "main" java.lang.RuntimeException: Invalid property key env.POD_IP
at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:442)
at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:413)
at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:152)
at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:162)
at alluxio.conf.AlluxioConfiguration.getOrDefault(AlluxioConfiguration.java:65)
at alluxio.cli.GetConf.getConfImpl(GetConf.java:189)
at alluxio.cli.GetConf.getConf(GetConf.java:146)
at alluxio.cli.GetConf.main(GetConf.java:267)
Exception in thread "main" java.lang.RuntimeException: Invalid property key env.POD_IP
at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:442)
at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:413)
at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:152)
at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:162)
at alluxio.conf.AlluxioConfiguration.getOrDefault(AlluxioConfiguration.java:65)
at alluxio.cli.GetConf.getConfImpl(GetConf.java:189)
at alluxio.cli.GetConf.getConf(GetConf.java:146)
at alluxio.cli.GetConf.main(GetConf.java:267)
Starting worker @ accumulo-alluxio-worker-hxkxs. Logging to /opt/alluxio-2.9.4/logs
Exception in thread "main" java.lang.RuntimeException: Invalid property key env.POD_IP
at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:442)
at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:413)
at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:152)
at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:162)
at alluxio.util.network.NetworkAddressUtils.getConnectHost(NetworkAddressUtils.java:332)
at alluxio.util.network.NetworkAddressUtils.getConnectAddress(NetworkAddressUtils.java:273)
at alluxio.worker.AlluxioWorkerMonitor.main(AlluxioWorkerMonitor.java:50)
--- [ FAILED ] The worker @ accumulo-alluxio-worker-hxkxs is not serving requests after 120s. Please check if the process is running and the logs/ if necessary.
--- Printing the log tail for /opt/alluxio-2.9.4/logs/worker.log

BEGIN
at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:162)
at alluxio.util.network.NetworkAddressUtils.getLocalNodeName(NetworkAddressUtils.java:445)
at alluxio.network.TieredIdentityFactory.create(TieredIdentityFactory.java:97)
at alluxio.network.TieredIdentityFactory.localIdentity(TieredIdentityFactory.java:61)
at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:37)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:70)

2025-10-19 16:39:26,735 INFO main - Logging metrics and jstack on WORKER exit...
2025-10-19 16:39:26,842 ERROR main - Failed to dump metrics and jstacks
java.lang.ExceptionInInitializerError
at alluxio.ProcessUtils.dumpMetrics(ProcessUtils.java:222)
at alluxio.ProcessUtils.dumpInformationOnExit(ProcessUtils.java:166)
at alluxio.ProcessUtils.fatalError(ProcessUtils.java:121)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:72)
Caused by: java.lang.RuntimeException: Invalid property key env.POD_IP
at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:442)
at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:413)
at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:152)
at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:162)
at alluxio.metrics.MetricsSystem.constructSourceName(MetricsSystem.java:254)
at alluxio.metrics.MetricsSystem.lambda$static$0(MetricsSystem.java:95)
at alluxio.util.CommonUtils$2.firstTime(CommonUtils.java:808)
at alluxio.util.CommonUtils$2.get(CommonUtils.java:803)
at alluxio.metrics.MetricsSystem.getMetricNameWithUniqueId(MetricsSystem.java:488)
at alluxio.metrics.MetricsSystem.lambda$getWorkerMetricName$3(MetricsSystem.java:400)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at alluxio.metrics.MetricsSystem.getWorkerMetricName(MetricsSystem.java:399)
at alluxio.metrics.MetricsSystem.getMetricName(MetricsSystem.java:359)
at alluxio.metrics.MetricsSystem.(MetricsSystem.java:165)
... 4 more
<<< EOF
--- Printing the log tail for /opt/alluxio-2.9.4/logs/worker.out
--- EMPTY ---

marc@pop-os:/mnt/ExtraDrive/repos/accumulo$ kubectl logs accumulo-alluxio-master-6457b4cc59-5sg9x
Formatting Alluxio Master @ accumulo-alluxio-master-6457b4cc59-5sg9x
2025-10-19 16:38:59,317 INFO main - Formatting master journal: /opt/alluxio/journal/
2025-10-19 16:38:59,487 INFO main - Loading core jars from /opt/alluxio-2.9.4/lib
2025-10-19 16:38:59,567 INFO main - Loading extension jars from /opt/alluxio-2.9.4/extensions
2025-10-19 16:38:59,601 INFO main - Formatting /opt/alluxio/journal/BlockMaster/v1
2025-10-19 16:38:59,618 INFO main - Formatting /opt/alluxio/journal/JournalMaster/v1
2025-10-19 16:38:59,620 INFO main - Formatting /opt/alluxio/journal/FileSystemMaster/v1
2025-10-19 16:38:59,621 INFO main - Formatting /opt/alluxio/journal/ThrottleMaster/v1
2025-10-19 16:38:59,622 INFO main - Formatting /opt/alluxio/journal/MetaMaster/v1
2025-10-19 16:38:59,624 INFO main - Formatting /opt/alluxio/journal/MetricsMaster/v1
2025-10-19 16:38:59,625 INFO main - Formatting complete
Starting master @ accumulo-alluxio-master-6457b4cc59-5sg9x. Logging to /opt/alluxio-2.9.4/logs
--- [ OK ] The master service @ accumulo-alluxio-master-6457b4cc59-5sg9x is in a healthy state.

Copilot AI changed the title [WIP] Fix Alluxio worker communication issues Fix Alluxio worker/master/job-server communication to enable S3 access Oct 19, 2025
Copilot AI requested a review from phrocker October 19, 2025 16:41
Copy link
Author

Copilot AI commented Oct 19, 2025

@copilot marc@pop-os:/mnt/ExtraDrive/repos/accumulo$ kubectl logs accumulo-alluxio-worker-hxkxs
Waiting for Alluxio master to be ready...
Waiting for master...
...

Fixed in commit 0c8daa5. The error occurred because Alluxio doesn't support ${env.POD_IP} syntax in properties files. Changed to use Alluxio's native environment variable support with ALLUXIO_WORKER_HOSTNAME and ALLUXIO_JOB_WORKER_HOSTNAME set to $(POD_IP) in the worker DaemonSet.

@phrocker
Copy link

@copilot worker not failing, but seeing this in master.out:

WARNING: [Channel<32>: (accumulo-alluxio-worker-77zvt:29999)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host accumulo-alluxio-worker-77zvt, cause=java.lang.RuntimeException: java.net.UnknownHostException: accumulo-alluxio-worker-77zvt
at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)
at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.UnknownHostException: accumulo-alluxio-worker-77zvt
at java.net.InetAddress$CachedAddresses.get(InetAddress.java:764)
at java.net.InetAddress.getAllByName0(InetAddress.java:1291)
at java.net.InetAddress.getAllByName(InetAddress.java:1144)
at java.net.InetAddress.getAllByName(InetAddress.java:1065)
at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:632)
at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:219)
... 5 more
}
[alluxio@accumulo-alluxio-master-6457b4cc59-g659j alluxio-2.9.4]$ ping accumulo-alluxio-worker
PING accumulo-alluxio-worker.default.svc.cluster.local (10.244.13.205) 56(84) bytes of data.
64 bytes from 10-244-13-205.accumulo-alluxio-worker.default.svc.cluster.local (10.244.13.205): icmp_seq=1 ttl=64 time=0.059 ms
64 bytes from 10-244-13-205.accumulo-alluxio-worker.default.svc.cluster.local (10.244.13.205): icmp_seq=2 ttl=64 time=0.092 ms
64 bytes from 10-244-13-205.accumulo-alluxio-worker.default.svc.cluster.local (10.244.13.205): icmp_seq=3 ttl=64 time=0.083 ms
^C
--- accumulo-alluxio-worker.default.svc.cluster.local ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2074ms
rtt min/avg/max/mdev = 0.059/0.078/0.092/0.013 ms
[alluxio@accumulo-alluxio-master-6457b4cc59-g659j alluxio-2.9.4]$ ping accumulo-alluxio-worker-77zvt
ping: accumulo-alluxio-worker-77zvt: Name or service not known
[alluxio@accumulo-alluxio-master-6457b4cc59-g659j alluxio-2.9.4]$

and master.log:
loadMetadataOnly: false
excludeMountInfo: false
}
, Error=alluxio.exception.FileDoesNotExistException: Path "/accumulo/instance_id" does not exist.
2025-10-19 16:47:37,387 WARN Master Persistence Scheduler - Run(jobConfig=PersistConfig{filePath=/LICENSE, mountId=1, overwrite=false, ufsPath=s3://accumulo-data/LICENSE}) exits with exception [alluxio.exception.status.UnavailableException: Failed to connect to master (GrpcServerAddress{HostName=accumulo-alluxio-master, SocketAddress=accumulo-alluxio-master:20001}) after 6 attempts.Please check if Alluxio master is currently running on "GrpcServerAddress{HostName=accumulo-alluxio-master, SocketAddress=accumulo-alluxio-master:20001}". Service="JobMasterClient"] in 10007 ms (>=10000ms)
2025-10-19 16:47:37,388 WARN Master Persistence Scheduler - Unexpected exception encountered when scheduling the persist job for file /LICENSE (id=33554431) : alluxio.exception.status.UnavailableException: Failed to connect to master (GrpcServerAddress{HostName=accumulo-alluxio-master, SocketAddress=accumulo-alluxio-master:20001}) after 6 attempts.Please check if Alluxio master is currently running on "GrpcServerAddress{HostName=accumulo-alluxio-master, SocketAddress=accumulo-alluxio-master:20001}". Service="JobMasterClient"
2025-10-19 16:47:49,458 WARN master-rpc-executor-TPE-thread-143 - Exit (Error): ListStatus: request=path: "/accumulo/instance_id"
options {
loadMetadataType: ONCE

Copy link
Author

Copilot AI commented Oct 19, 2025

@copilot worker not failing, but seeing this in master.out:

WARNING: [Channel<32>: (accumulo-alluxio-worker-77zvt:29999)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable...

Fixed in commit c8b4f91. Three issues resolved:

  1. DNS resolution: Changed worker hostname from POD_IP to FQDN $(POD_NAME).<service>.<namespace>.svc.cluster.local and added publishNotReadyAddresses: true to worker service
  2. Job master: Added startup of job_master process on master (was configured but never started)
  3. Job worker: Added startup of job_worker process on workers (was configured but never started)

Workers now register with resolvable FQDNs, and both job master and job worker processes run with auto-restart monitoring.

@phrocker
Copy link

@copilot can write to s3 but instance id isn't being initialized properly and

2025-10-19T17:03:08,693 [manager.Manager] INFO : Version 4.0.0-SNAPSHOT
2025-10-19T17:03:08,693 [manager.Manager] INFO : Instance instance_id
Thread 'manager' died.
java.lang.IllegalStateException: Instance may not have been initialized, provided root node path does not exist in ZooKeeper
at org.apache.accumulo.server.conf.store.impl.ZooPropStore.(ZooPropStore.java:92)
at org.apache.accumulo.server.conf.store.impl.ZooPropStore.initialize(ZooPropStore.java:108)
at org.apache.accumulo.server.ServerContext.lambda$new$0(ServerContext.java:138)
at com.google.common.base.Suppliers$NonSerializableMemoizingSupplier.get(Suppliers.java:198)
at org.apache.accumulo.server.ServerContext.getPropStore(ServerContext.java:463)
at org.apache.accumulo.server.conf.ZooBasedConfiguration.(ZooBasedConfiguration.java:63)
at org.apache.accumulo.server.conf.SystemConfiguration.(SystemConfiguration.java:36)
at org.apache.accumulo.server.conf.ServerConfigurationFactory.lambda$new$0(ServerConfigurationFactory.java:86)
at com.google.common.base.Suppliers$NonSerializableMemoizingSupplier.get(Suppliers.java:198)
at org.apache.accumulo.server.conf.ServerConfigurationFactory.getSystemConfiguration(ServerConfigurationFactory.java:125)
at org.apache.accumulo.server.conf.ServerConfigurationFactory.lambda$new$1(ServerConfigurationFactory.java:92)
at com.google.common.base.Suppliers$NonSerializableMemoizingSupplier.get(Suppliers.java:198)
at org.apache.accumulo.server.conf.ServerConfigurationFactory.getResourceGroupConfiguration(ServerConfigurationFactory.java:120)
at org.apache.accumulo.server.ServerContext.getConfiguration(ServerContext.java:194)
at org.apache.accumulo.server.ServerContext.init(ServerContext.java:386)
at org.apache.accumulo.server.AbstractServer.(AbstractServer.java:142)
at org.apache.accumulo.manager.Manager.(Manager.java:492)
at org.apache.accumulo.manager.Manager.main(Manager.java:486)
at org.apache.accumulo.manager.ManagerExecutable.execute(ManagerExecutable.java:45)
at org.apache.accumulo.start.Main.lambda$execKeyword$0(Main.java:81)
at java.base/java.lang.Thread.run(Unknown Source)
2025-10-19T17:03:08,704 [start.Main] ERROR: Thread 'manager' died.
java.lang.IllegalStateException: Instance may not have been initialized, provided root node path does not exist in ZooKeeper
at org.apache.accumulo.server.conf.store.impl.ZooPropStore.(ZooPropStore.java:92)
at org.apache.accumulo.server.conf.store.impl.ZooPropStore.initialize(ZooPropStore.java:108)
at org.apache.accumulo.server.ServerContext.lambda$new$0(ServerContext.java:138)
at com.google.common.base.Suppliers$NonSerializableMemoizingSupplier.get(Suppliers.java:198)
at org.apache.accumulo.server.ServerContext.getPropStore(ServerContext.java:463)
at org.apache.accumulo.server.conf.ZooBasedConfiguration.(ZooBasedConfiguration.java:63)
at org.apache.accumulo.server.conf.SystemConfiguration.(SystemConfiguration.java:36)
at org.apache.accumulo.server.conf.ServerConfigurationFactory.lambda$new$0(ServerConfigurationFactory.java:86)
at com.google.common.base.Suppliers$NonSerializableMemoizingSupplier.get(Suppliers.java:198)
at org.apache.accumulo.server.conf.ServerConfigurationFactory.getSystemConfiguration(ServerConfigurationFactory.java:125)
at org.apache.accumulo.server.conf.ServerConfigurationFactory.lambda$new$1(ServerConfigurationFactory.java:92)
at com.google.common.base.Suppliers$NonSerializableMemoizingSupplier.get(Suppliers.java:198)
at org.apache.accumulo.server.conf.ServerConfigurationFactory.getResourceGroupConfiguration(ServerConfigurationFactory.java:120)
at org.apache.accumulo.server.ServerContext.getConfiguration(ServerContext.java:194)
at org.apache.accumulo.server.ServerContext.init(ServerContext.java:386)
at org.apache.accumulo.server.AbstractServer.(AbstractServer.java:142)
at org.apache.accumulo.manager.Manager.(Manager.java:492)
at org.apache.accumulo.manager.Manager.main(Manager.java:486)
at org.apache.accumulo.manager.ManagerExecutable.execute(ManagerExecutable.java:45)
at org.apache.accumulo.start.Main.lambda$execKeyword$0(Main.java:81)
at java.base/java.lang.Thread.run(Unknown Source)

Copy link
Author

Copilot AI commented Oct 19, 2025

@copilot can write to s3 but instance id isn't being initialized properly and

2025-10-19T17:03:08,693 [manager.Manager] INFO : Version 4.0.0-SNAPSHOT
...

Good news that S3 writes are working! The Alluxio communication fixes are complete.

The ZooKeeper initialization error is a separate Accumulo issue unrelated to Alluxio configuration. This indicates Accumulo's instance hasn't been initialized in ZooKeeper yet. You'll need to run Accumulo's initialization command first:

accumulo init

Or if using the Helm chart, ensure the init job runs successfully before starting the manager. This is outside the scope of the Alluxio configuration fixes.

@phrocker phrocker marked this pull request as ready for review October 19, 2025 19:17
@phrocker phrocker merged commit 3ab2c90 into main Oct 19, 2025
1 check passed
@phrocker phrocker deleted the copilot/fix-alluxio-configuration branch October 19, 2025 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix alluxio configuration

2 participants