This document outlines security best practices for deploying GSMLG EPMD, particularly when using the TLS auto-mesh mode with certificate-based trust groups.
- Security Model Overview
- Certificate Management
- CA Security
- Private Key Protection
- Group Isolation
- TLS Configuration
- Network Security
- Operational Security
- Common Pitfalls
- Security Checklist
GSMLG EPMD's TLS mode implements a certificate-based trust system with the following security layers:
- CA-Based Authentication: Only nodes with certificates signed by a trusted CA can connect
- Group Membership: Certificate OU (Organizational Unit) field defines trust groups
- Group Isolation: Different OU values = no connection, even with same CA
- Dynamic Cookies: 256-bit random cookies exchanged over TLS (no pre-shared secrets)
- Mutual TLS: Both client and server validate each other's certificates
What GSMLG EPMD TLS mode protects against:
- Unauthorized nodes joining the cluster (no valid certificate)
- Cross-group connections (different OU values)
- Man-in-the-middle attacks (mutual TLS authentication)
- Cookie compromise (dynamic generation, secure exchange)
- EPMD port mapper vulnerabilities (no EPMD daemon)
What it does NOT protect against:
- Compromised CA private key (attacker can issue valid certificates)
- Compromised node private keys (attacker can impersonate that node)
- Physical access to certificate files
- Insider threats with valid certificates
- Application-level vulnerabilities in Erlang code
DO:
- Use strong key sizes: 2048-bit RSA minimum, 4096-bit recommended, or ECDSA P-256+
- Set appropriate certificate validity periods (1 year recommended, max 2 years)
- Use unique CN (Common Name) for each node
- Set OU field to match trust group name exactly
- Include SAN (Subject Alternative Name) if using hostnames
DON'T:
- Reuse certificates across nodes
- Use overly long validity periods (>2 years)
- Use weak key sizes (<2048-bit RSA)
- Share private keys between nodes
Example: Secure certificate generation
# Use the provided script with strong defaults
./tools/generate_certs.sh production node1
# Verify certificate details
openssl x509 -in certs/production/node1/cert.pem -noout -textFile Permissions:
# CA private key: Most sensitive, never copy to nodes
chmod 400 certs/ca/ca-key.pem
chown root:root certs/ca/ca-key.pem
# Node private keys: Read-only by the Erlang process owner
chmod 400 certs/production/node1/key.pem
chown erlang:erlang certs/production/node1/key.pem
# Certificates and CA cert: Readable
chmod 444 certs/production/node1/cert.pem
chmod 444 certs/production/node1/ca-cert.pemStorage Locations:
- Production: Use secrets management (Kubernetes Secrets, HashiCorp Vault, AWS Secrets Manager)
- Development: Local filesystem with proper permissions
- Never: Environment variables (too easy to leak in logs), version control, shared filesystems
Secure distribution methods:
- Configuration management tools (Ansible, Chef, Puppet) with encrypted vaults
- Secrets managers (Kubernetes Secrets with encryption at rest)
- Manual secure copy (scp with key-based auth)
Avoid:
- Copying via unencrypted channels
- Storing in Docker images (use volumes/secrets instead)
- Committing to git repositories
When to rotate:
- Before expiration (30-60 days in advance)
- After suspected compromise
- When employee/admin with access leaves
- Periodically (every 12-24 months)
Rotation process:
# 1. Generate new certificates with same OU
./tools/generate_certs.sh production node1-new
# 2. Deploy new certificates to nodes (blue-green or rolling)
# 3. Restart nodes with new certificates
# 4. Verify connections still work
# 5. Revoke old certificates (if using CRL/OCSP)
# 6. Delete old private keys securely
shred -vfz -n 10 old-key.pemCurrent limitation: GSMLG EPMD does not currently support CRL (Certificate Revocation Lists) or OCSP (Online Certificate Status Protocol).
Workaround:
- Rotate CA certificate when compromise suspected
- Remove compromised node certificates from nodes
- Use short certificate validity periods (6-12 months)
Future enhancement: Add CRL/OCSP support (see PROJECT_STATUS.md)
The CA private key is the most critical secret. Compromise = attacker can issue valid certificates for any group.
Best practice:
- Generate CA on an air-gapped machine
- Store CA private key on encrypted USB drive
- Keep in physical safe
- Only connect to sign new certificates
- Never copy to production systems
# Generate CA offline
openssl genrsa -aes256 -out ca-key.pem 4096 # Password-protected
# Sign certificates offline, then copy only the signed cert
openssl x509 -req -in node.csr -CA ca-cert.pem -CAkey ca-key.pem -out node-cert.pemIf you must use an online CA:
- Store CA key in hardware security module (HSM) or key management service (KMS)
- Use strong encryption at rest
- Restrict access via IAM policies
- Enable audit logging for all CA operations
- Use separate CAs for dev/staging/production
Support for intermediate CAs:
Root CA (offline, long-lived)
└─> Intermediate CA (online, shorter-lived)
└─> Node certificates
Benefits:
- Root CA can be kept completely offline
- Intermediate CA compromise doesn't require re-trusting root
- Easier rotation (just rotate intermediate)
Configuration:
# Concatenate chain for cacertfile
cat intermediate-ca.pem root-ca.pem > ca-chain.pem
export GSMLG_EPMD_TLS_CACERTFILE=/path/to/ca-chain.pemEach node has its own private key. Compromise = attacker can impersonate that specific node.
DO:
- Store on encrypted filesystems
- Use restrictive file permissions (chmod 400)
- Use secrets management in production
- Encrypt in transit when distributing
DON'T:
- Store in Docker images
- Commit to version control
- Share between nodes
- Store in environment variables
- Log private key paths/contents
Strong randomness:
# Good: Use OpenSSL with proper entropy
openssl genrsa -out key.pem 4096
# Better: Use hardware RNG if available
openssl genrsa -rand /dev/hwrng -out key.pem 4096For extra security, encrypt private keys with a passphrase:
# Generate encrypted key
openssl genrsa -aes256 -out key-encrypted.pem 4096
# Erlang needs decrypted key at runtime (use encrypted volume or HSM instead)Note: Erlang's SSL library requires access to decrypted keys at runtime, so encryption at rest is most effective with full-disk encryption or encrypted volumes.
The OU (Organizational Unit) field is critical for group isolation.
Correct:
# OU set to match trust group
./tools/generate_certs.sh production node1
# Result: OU=production
./tools/generate_certs.sh staging node2
# Result: OU=stagingVerify:
openssl x509 -in cert.pem -noout -subject
# Should show: OU=productionThreat: Attacker with CA access issues certificate with wrong OU
Mitigation:
- Strict CA access controls
- Audit all certificate issuance
- Use OU naming conventions (e.g.,
prod-us-east, not justproduction) - Validate OU matches expected value on node startup
Validation example:
% In sys.config
{gsmlg_epmd, [
{group, "production"}, % Explicit group check
...
]}.Design guarantee: Nodes with different OU fields cannot connect, even if:
- Signed by same CA
- On same network
- Discovered via mDNS
- Have correct private keys
Verification:
% On production node
nodes().
% Should only show other production nodes, never stagingRecommended configuration in ssl_dist.config:
[
{server, [
{certfile, "/path/to/cert.pem"},
{keyfile, "/path/to/key.pem"},
{cacertfile, "/path/to/ca.pem"},
{verify, verify_peer},
{fail_if_no_peer_cert, true},
% Strong cipher suites (TLS 1.2+)
{versions, ['tlsv1.2', 'tlsv1.3']},
{ciphers, [
"TLS_AES_256_GCM_SHA384", % TLS 1.3
"TLS_AES_128_GCM_SHA256", % TLS 1.3
"ECDHE-RSA-AES256-GCM-SHA384", % TLS 1.2
"ECDHE-ECDSA-AES256-GCM-SHA384", % TLS 1.2
"ECDHE-RSA-AES128-GCM-SHA256", % TLS 1.2
"ECDHE-ECDSA-AES128-GCM-SHA256" % TLS 1.2
]},
% Additional security
{honor_cipher_order, true},
{secure_renegotiate, true}
]},
{client, [
% Same options as server
{certfile, "/path/to/cert.pem"},
{keyfile, "/path/to/key.pem"},
{cacertfile, "/path/to/ca.pem"},
{verify, verify_peer},
{versions, ['tlsv1.2', 'tlsv1.3']},
{server_name_indication, disable} % Not needed for IP-based connections
]}
].Key security options:
verify_peer: Always verify certificate (neververify_none)fail_if_no_peer_cert: Reject connections without client certversions: TLS 1.2 minimum, TLS 1.3 preferredhonor_cipher_order: Prefer server's cipher order
VM args for secure distribution:
-proto_dist inet_tls
-ssl_dist_optfile /path/to/ssl_dist.configSecurity considerations:
- Distribution uses separate TLS connection from EPMD
- Cookie still used (but exchanged securely via gsmlg_epmd_cookie)
- All inter-node RPC encrypted via TLS
mDNS (_epmd._tcp.local) uses multicast and is unauthenticated.
Attack: Malicious node advertises fake service, attempts connection
Mitigations:
- Certificate validation (primary defense): Rogue node rejected during TLS handshake if no valid cert
- Network segmentation: Use VLANs/network policies to restrict multicast
- Group filtering: Only nodes with matching OU connect
Network security layers:
Layer 1: Network segmentation (mDNS limited to trusted VLAN)
Layer 2: TLS handshake (certificate validation)
Layer 3: OU verification (group membership check)
Layer 4: Cookie exchange (secure authentication)
Container environments (Docker, Kubernetes):
- Use bridge networks, not host networking
- Enable network policies to restrict multicast
- Consider service mesh (Istio, Linkerd) for additional security
Cloud environments:
- Use private VPCs/VNets
- Restrict security groups to cluster nodes only
- Enable VPC flow logs for auditing
Physical networks:
- Segment production/staging/dev networks
- Use 802.1X for node authentication
- Monitor multicast traffic for anomalies
Minimum required ports:
| Port | Protocol | Purpose | Access |
|---|---|---|---|
| 4369 | TCP | GSMLG EPMD TLS server | Inter-node only |
| 8001+ | TCP | Erlang distribution | Inter-node only |
| 5353 | UDP | mDNS | Multicast (cluster network) |
Example iptables rules:
# Allow mDNS from cluster network only
iptables -A INPUT -p udp --dport 5353 -s 10.0.0.0/24 -j ACCEPT
iptables -A INPUT -p udp --dport 5353 -j DROP
# Allow EPMD TLS from cluster nodes only
iptables -A INPUT -p tcp --dport 4369 -s 10.0.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 4369 -j DROP
# Allow Erlang distribution (adjust range as needed)
iptables -A INPUT -p tcp --dport 8001:8999 -s 10.0.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 8001:8999 -j DROPEnable Erlang logging:
% In sys.config
[{kernel, [
{logger_level, info},
{logger, [
{handler, default, logger_std_h, #{
formatter => {logger_formatter, #{
single_line => false,
template => [time, " ", level, " ", msg, "\n"]
}}
}}
]}
]}].Monitor for:
- TLS handshake failures (potential attacks or misconfigurations)
- Group mismatch errors (cross-group connection attempts)
- Certificate expiration warnings
- Unexpected node discoveries
- Failed cookie exchanges
Log retention:
- Keep logs for 90+ days for security audits
- Centralize logs (ELK, Splunk, CloudWatch)
- Alert on security events (failed TLS, group mismatches)
Regular audits:
- Review certificate expiration dates monthly
- Audit CA operations (who issued certificates, when)
- Review node connection patterns (unexpected connections?)
- Check file permissions on private keys
- Verify firewall rules haven't been weakened
Automated checks:
# Check certificate expiration
openssl x509 -in cert.pem -noout -enddate
# Verify file permissions
find /path/to/certs -name "*.pem" -exec ls -la {} \;
# Check for certificates expiring soon (30 days)
for cert in certs/**/cert.pem; do
expiry=$(openssl x509 -in "$cert" -noout -enddate | cut -d= -f2)
echo "$cert expires: $expiry"
doneIf private key compromised:
- Immediately revoke certificate (if CRL/OCSP supported)
- Generate new certificate for affected node
- Deploy new certificate
- Restart affected node
- Monitor logs for unauthorized connections
- Review how compromise occurred
- Update security procedures
If CA key compromised:
- Critical incident - entire trust system compromised
- Generate new CA immediately
- Generate new certificates for ALL nodes
- Deploy new CA + certificates to all nodes
- Restart all nodes
- Audit all recent certificate issuances
- Investigate root cause
Preparation:
- Document incident response procedures
- Practice certificate rotation drills
- Maintain offline backups of CA (separate from compromised system)
- Have emergency contact list for security team
WRONG:
# Same certificate on multiple nodes
scp node1/cert.pem node2:/etc/certs/
scp node1/cert.pem node3:/etc/certs/Why it's bad:
- Compromise of one node = all nodes compromised
- Cannot revoke individual nodes
- Violates certificate uniqueness
CORRECT:
# Unique certificate per node
./tools/generate_certs.sh production node1
./tools/generate_certs.sh production node2
./tools/generate_certs.sh production node3WRONG:
chmod 644 key.pem # World-readable private key!CORRECT:
chmod 400 key.pem
chown erlang:erlang key.pemWRONG:
git add certs/
git commit -m "Add certificates"
git pushWhy it's bad:
- Secrets exposed in git history forever
- Hard to revoke (need to change all certs)
- Visible to anyone with repo access
CORRECT:
# .gitignore
certs/
*.pem
*.key
# Generate certs outside repo or in CI/CDWRONG:
# 10-year certificate
openssl x509 ... -days 3650Why it's bad:
- Longer exposure window if compromised
- Harder to rotate (infrequent process)
- Industry standards moving to 1-year max
CORRECT:
# 1-year certificate (default in generate_certs.sh)
openssl x509 ... -days 365WRONG:
- No monitoring of expiration dates
- Certificates expire, cluster breaks
CORRECT:
# Automated expiration monitoring
openssl x509 -in cert.pem -noout -checkend $((86400 * 30))
if [ $? -ne 0 ]; then
echo "Certificate expires in <30 days!"
# Alert ops team
fiWRONG:
{ciphers, ["RC4-SHA", "DES-CBC3-SHA"]} % Weak/broken ciphers
{versions, ['tlsv1', 'tlsv1.1']} % Deprecated TLS versionsCORRECT:
{ciphers, ["ECDHE-RSA-AES256-GCM-SHA384", "TLS_AES_256_GCM_SHA384"]}
{versions, ['tlsv1.2', 'tlsv1.3']}WRONG:
{verify, verify_none} % NEVER DO THISWhy it's bad:
- Completely bypasses TLS security
- Allows any node to connect
- Defeats entire trust system
CORRECT:
{verify, verify_peer}
{fail_if_no_peer_cert, true}WRONG:
# Copying CA private key to production nodes
scp ca-key.pem node1:/etc/certs/Why it's bad:
- CA key compromise if node compromised
- No reason for nodes to have CA key
- Violates principle of least privilege
CORRECT:
- CA key stays on offline certificate-signing machine
- Nodes only get their own cert + public CA cert
- CA key never leaves signing machine
- CA private key generated on secure/offline machine
- CA private key stored encrypted with strong passphrase
- CA private key has restricted access (only cert admins)
- Unique certificates generated for each node
- Certificates use 2048-bit RSA minimum (4096-bit preferred)
- OU field correctly set for each trust group
- Certificate validity period ≤ 1 year
- Private keys have chmod 400 permissions
- Private keys owned by correct user (e.g., erlang:erlang)
- Certificates not committed to version control
-
.gitignoreincludescerts/,*.pem,*.key
-
verify_peerenabled (neververify_none) -
fail_if_no_peer_certset totrue - TLS 1.2 minimum, TLS 1.3 preferred
- Strong cipher suites configured
- Weak ciphers disabled (RC4, DES, 3DES, MD5)
-
ssl_dist.confighas same security settings for client and server
- Firewall rules restrict EPMD port (4369) to cluster network
- Firewall rules restrict distribution ports to cluster network
- mDNS multicast restricted to cluster network (VLAN/network policy)
- Production network segmented from dev/staging
- VPC/VNet configured with private subnets
- Security groups/network policies implemented
- Certificate expiration monitoring enabled
- Alerting configured for expiring certificates (<30 days)
- Centralized logging enabled
- Security event alerts configured (TLS failures, group mismatches)
- Certificate rotation procedure documented
- Incident response plan documented
- Certificate rotation drills scheduled (quarterly)
- Certificates stored in Secrets (not ConfigMaps)
- Secrets encryption at rest enabled
- Secrets not mounted to unnecessary pods
- RBAC restricts Secret access
- Secrets not logged or printed
- Secrets not in container images
- Pod security policies restrict privileged containers
- Trust group membership documented
- Certificate rotation procedure documented
- Emergency response procedures documented
- Security contact list maintained
- Architecture diagrams include security boundaries
- Runbooks include certificate troubleshooting
If you discover a security vulnerability in GSMLG EPMD:
- DO NOT open a public GitHub issue
- Email security details to: [email protected]
- Include:
- Description of vulnerability
- Steps to reproduce
- Impact assessment
- Suggested fix (if any)
We will respond within 48 hours and work with you to address the issue.
- Erlang/OTP SSL/TLS Documentation
- X.509 Certificate Best Practices
- TLS Best Practices
- OWASP Transport Layer Protection Cheat Sheet
Last Updated: 2025-10-26 Version: 1.0.0