TEE-Based AI Execution on Akash with Open-Source SDK for Confidential Compute #872

bruno353 · 2025-04-15T16:45:34Z

bruno353
Apr 15, 2025

1. Context of Proposal

DOOOR is pioneering a decentralized AI execution framework by integrating Trusted Execution Environments (TEEs) into the Akash Network, enabling secure and verifiable AI workloads. This initiative enhances Akash’s push into AI by ensuring privacy-preserving AI model execution and secure computing for LLMs (Large Language Models) and autonomous AI agents.

Our TEE integration SDK will enable AI developers and cloud providers on Akash to run workloads with cryptographic attestation, ensuring that sensitive AI models and computations are protected from unauthorized access. By supporting AMD SEV-SNP, NVIDIA H100, AWS Nitro Enclaves, and Azure Confidential Computing, DOOOR will provide an open-source framework for verifiable AI inference and execution within Akash.

2. Problem Statements

2.1 Lack of Automated TEE Management for AI Deployments

Problem: Complex and Error-Prone TEE Integration

Trusted Execution Environments (TEEs) provide an essential security layer for AI workloads, but their manual configuration and deployment on Akash are complex, inconsistent, and prone to errors. Current Akash deployments rely on standard Kubernetes-based container management, which lacks built-in attestation verification and automatic TEE enforcement. As a result, developers must manually set up, configure, and verify their workloads, leading to significant deployment overhead. This complexity discourages adoption and makes secure execution inaccessible to many AI developers.

Impact: Security Gaps and Slower Adoption

Without automated TEE integration, workloads cannot be easily verified, leading to security gaps where AI models may run on untrusted infrastructure. This discourages enterprises and privacy-conscious developers from using Akash, as they cannot ensure their data remains secure during execution. Furthermore, the time-consuming manual setup limits scalability, making it difficult for providers to offer TEE-backed services at scale.

2.2 Fragmented & Incompatible TEE Workflows Across Cloud Providers

Problem: No Unified Approach to Secure Compute on Akash

Akash aims to be a decentralized alternative to major cloud providers, but the lack of standardization in TEE workflows across different hardware and cloud environments creates fragmentation. Providers using Azure Confidential VMs, AWS Nitro Enclaves, IBM Cloud Hyper Protect, or self-hosted hardware each require different verification and attestation methods, making it difficult for developers to deploy across multiple providers. This lack of interoperability means AI workloads cannot easily migrate between providers, limiting the network’s efficiency.

Impact: Vendor Lock-in and Barriers to Adoption

Because each cloud provider requires different security protocols, developers are often locked into specific providers, reducing the decentralized value of Akash. Workloads that should be able to migrate between providers seamlessly are instead restricted to specific hardware, limiting availability, scalability, and pricing flexibility. The barriers to standardization also slow innovation, as developers must rewrite security protocols for each environment, delaying deployments and increasing costs.

2.3 No Verifiable AI Execution on Akash (GPU Attestation)

Problem: AI Inference Workloads Lack Security Guarantees

Machine learning and AI inference workloads on Akash lack any form of verifiable execution guarantees, meaning users cannot trust that their AI model is running on genuine, unmodified hardware. Unlike centralized cloud services, where security certifications and audits verify compute environments, Akash currently provides no cryptographic proof that an AI model is executing in a secure and unaltered GPU environment.

Impact: Risks of Model Theft, Data Manipulation, and AI Bias

Without hardware attestation and verifiable execution, AI model providers are at risk of model theft, where an untrusted provider extracts or manipulates a deployed model. Additionally, data processed on unverified GPUs could be compromised, leading to biased AI results, security breaches, or tampered outputs. This lack of trust deters high-value AI applications—such as financial modeling, autonomous vehicle AI, and healthcare AI—from using Akash due to security concerns.

2.4 Absence of Secure & Decentralized Key Management for AI Models

Problem: Centralized Key Storage Risks and No Encrypted Model Execution

AI workloads require secure storage and execution of private cryptographic keys, especially when handling proprietary models, encrypted datasets, or zero-knowledge computations. However, current Akash deployments lack decentralized private key management. Without an established framework for private key sharing, sealing, and attestation verification, AI models running on Akash are vulnerable to key exposure, unauthorized access, or centralized attack vectors.

Impact: Loss of Privacy and Security Vulnerabilities

Without trustless key management, sensitive AI workloads cannot be protected, making Akash an unsuitable environment for enterprise-grade AI applications. Existing cloud services use proprietary HSM (Hardware Security Modules) or centralized key vaults, but these introduce single points of failure and limit decentralization. A decentralized key bootstrap system, where AI models can securely share and verify encryption keys, is necessary to guarantee data integrity and confidentiality for sensitive computations.

2.5 No Cross-Provider TEE Attestation Mechanism

Problem: Users Cannot Verify That Their Workloads Run in a Trusted Environment

Even if AI models and workloads are deployed on Akash, there is currently no built-in way for users to verify that their workloads are running inside a TEE-protected environment. Unlike traditional cloud providers that offer security compliance guarantees (e.g., ISO 27001, SOC 2), Akash lacks a decentralized attestation mechanism that allows users to check if their AI workload is actually executing inside a trusted enclave.

Impact: No Confidence in Data Confidentiality & Execution Integrity

Without a cross-provider attestation system, AI model owners and enterprises cannot trust that data confidentiality and execution integrity are maintained. This lack of transparency prevents privacy-sensitive industries from adopting Akash, such as healthcare AI (HIPAA compliance), finance (confidential trading algorithms), and defense AI applications. Establishing hardware-backed, remote attestation across Akash providers is essential to build trust and attract high-value workloads.

3. Implementation Plan & Feature Set

Phases	Feature	Description	Estimated Duration
Automated TEE Management	Sidecar-based TEE Enforcement	Develop docker-based sidecars that automatically attach to user workloads on Akash, providing attestation, abstracting TPM2/NVML with Unified SDK, and secure communication. Ensures every deployment meets TEE security standards.	8 weeks
	Secure Communication Protocol Between Sidecars and Containers	Implement mTLS authentication using OpenSSL 3.0+ to verify bidirectional container communication. Uses TPM2-backed certificates to prevent unauthorized access.
	Unix Domain Socket with Enhanced Security	Use Unix Domain Sockets with SELinux policies to restrict access and ensure container isolation. Prevents unauthorized data leaks.
	Encrypted gRPC Communication Channel	Secure inter-container API calls via mTLS, AES-256-GCM encryption, and JWT authentication for trusted attestation workflows.
	Container Security Context Configuration	Restrict container capabilities using Kubernetes security policies, ensuring privilege restricted only to attestation calls.
	Docker Containers and Webhooks	A set of Docker containers and webhooks, being developed with Python, that will automatically manage sidecars, detecting new containers deployed on the kubernets previously installed by Akash Praetor and attach its correspondent sidecar.
Unified SDK for
Multi-Platform TEE Workflows	Cross-Cloud Attestation & Verification	Develop an SDK aggregating tools like sev-snp-utils, sevctl, snpguest, TPM2 utilities, and vTPM verification across multiple providers (Azure, AWS, IBM). Unifies and standardize attestation across different hardware environments.	8 weeks
	Secure VM Attestation on multi-environments	Enable remote attestation on Azure via vTPM and paravisor verification. For Linux KVM hosts, supports default attestation with snpguest VMPL 1. Simplifies TEE verification across platforms.
	TPM-Based Secure Data Sealing & Policy Enforcement	Implement sealing/unsealing for secure AI model storage and execution. Uses policy-based PCR hashing to verify integrity before execution.
AI Cross-Provider TEE Attestation	Remote Attestation Service for TEE-backed Fullstack Workloads	Develop a cross-provider attestation verification framework to ensure private fullstack application on Akash can be executed only in TEE-backed environments. Supports AMD SEV-SNP & NVIDIA H100 attestation.	6 weeks
	Cryptographic Proof of AI Execution Integrity	Provide hash-based attestation of AI workloads to verify that models execute unmodified inside an authenticated GPU. Ensures AI outputs remain secure and untampered.
	Verifiable AI Execution on Akash (GPU Attestation)	Enable hardware-backed attestation for NVIDIA H100 GPUs. Uses NVIDIA DCGM attestation API to verify GPU security before AI inference execution.
	Multi-layered TEE Trust Chain	Establish a verified chain of execution from CPU to GPU ensuring that workloads maintain continuous attestation guarantees throughout execution.
Decentralized Key Sharing Management*	Trustless Private Key Bootstrap System	Implement a decentralized bootstrap system for securely distributing encryption keys to AI models and any other type of running software deployed on Akash. Ensures only attested workloads can access encryption keys.	10 weeks
	Remote Key Sealing & Verification	Use PCR-based policies to seal encryption keys within trusted nodes. Sealed keys can only be retrieved and used within a verified TEE environment.
	Decentralized Sharded Key Sharing	Nodes verify each other’s attestation reports before exchanging private keys, ensuring only legitimate nodes participate in confidential AI computation.
	Sharded Database for Secure Private Data Storage & Computing	New Node triggers a function on any already registered node to be selected as part of the private database sharding. The Registered Node verifies the New Node's system state, code, and any variables attested to in the previous PCR rules. Uses attestation to enable storage of databases independently, without relying on the origin node

Below you may find relevant Demos, Github repositories and additional technical documentation on the features, their specs and code snippets:

*Note: The Decentralized Key Sharing Management Phase is still in the design stage, and its full viability is yet to be assessed. At this stage, we commit to delivering a comprehensive report outlining the requirements, feasibility, and proposed design of the solution. Based on these findings, we will be able to better estimate the timeline, effort, and resources needed to implement the final scope effectively.

4. Projected allocation & budget

To successfully develop and deploy a fully verifiable TEE-based AI execution framework on Akash, we are requesting $331,000 in funding, carefully allocated to cover infrastructure, security audits, development, and operational costs. This budget ensures that we can build, test, and deploy the project over a period of 8 months, delivering a scalable, open-source solution that strengthens Akash’s security and AI capabilities.

Category	Details	Estimated Total Cost (USD)
Infrastructure Costs	Cloud & Hardware Resources for developing and testing TEE solutions. This includes provisioning Azure Confidential VMs, AMD SEV-SNP machines, and NVIDIA H100 GPUs for six months.	$30,000
Developer Salaries	3 full-stack developers working full time on TEE implementation, cryptography, AI model security, and machine learning optimizations at $70/hour. Estimated total engineering time: 3,800+ hours over 32 weeks.	$266,000
Security Audits	Third-party security review and verification of the TEE attestation, encryption protocols, and workload integrity. Covers auditing cryptographic implementations, attestation verification, and sidecar security.**	$20,000
Contingency	~5% contingency over price fluctuation on instances rental, development overrun and others.	$15,000
Total		$331,000

**Note: This is an estimated amount, and costs may fluctuate based on audit firm rates and bug bounty program pricing.

Conclusion

The DOOOR protocol is committed to strengthening Akash’s ecosystem by building trusted execution environments (TEEs) and verifiable AI execution, ensuring secure, private, and decentralized computing. Akash was chosen because of its decentralized, cost-effective, and censorship-resistant infrastructure, making it the ideal platform for confidential AI computing. Our work will enhance security, attract enterprise AI users, and position Akash as the go-to network for privacy-first AI deployments. This project is not just something we want to build—it is something Akash needs to advance its security and attract high-value workloads.

Beyond technology, we are deeply invested in Akash’s community and want to ensure that our contributions bring direct, lasting value to providers, developers, and users. By making our TEE-based security tools open-source, providing educational resources, and incentivizing community-driven security testing, we ensure that Akash’s entire ecosystem gains long-term value. Additionally, we are committed to collaborating both on and off the field—not only by strengthening Akash’s infrastructure but also by actively engaging with its community. As we progress through testnet and mainnet launches, we will introduce initiatives that drive adoption, engagement, and long-term participation, ensuring that our success directly benefits the broader Akash community and ecosystem.

This proposal is about long-term commitment. We are dedicated builders in the Akash ecosystem, and this project is only the beginning. With this funding, we will develop the first fully verifiable TEE-based AI execution framework, setting the stage for future security innovations. We see this as an ongoing partnership where we continue to enhance Akash’s capabilities and expand privacy and security features in decentralized AI computing. Investing in this proposal is investing in Akash’s future as the leader in secure, decentralized AI infrastructure.

Appendix

Links & References

Team Background

Thiago Castroneves — CEO

Thiago de Castro Neves, Co-Founder and CEO of DOOOR, brings a unique blend of engineering, management, and blockchain expertise to the forefront of decentralized AI execution. With over 15 years of experience in the industrial engineering, he served as an engineer, manager and innovation lead for a global steel multinational, honing his skills in strategic leadership, infrastructure development, and operations management. Transitioning into the blockchain industry, Thiago played a pivotal role as Head of Business Development at Moonbeam. He led the Grants Program, interviewing 700+ teams around the world, and leading ecosystem growth initiatives, which today represents more than 60% of the overall transactions in the Moonbeam Network. His deep technical and strategic acumen now drives DOOOR’s mission to redefine secure AI execution in decentralized networks.

Bruno Laureano — CTO

Bruno Laureano dos Santos, Co-Founder and CTO of DOOOR, brings extensive expertise in AI, blockchain technology, and full-stack development, with exceptional back-end and front-end engineering skills. His ability to lead and manage high-performing technical teams has been instrumental in driving innovation, adapting to the rapidly evolving AI and blockchain landscape, and building cutting-edge decentralized solutions. With a deep understanding of trustless computing and secure AI execution, Bruno is at the forefront of DOOOR’s mission to develop scalable, secure, and efficient AI frameworks for decentralized ecosystems.

Contact Information

Thiago Castroneves (CEO) – [email protected] — https://www.linkedin.com/in/thiagocastroneves/
Bruno Laureano (CTO) – [email protected] — https://github.com/bruno353
Sofia Lacerda (Head of Ecosystem) – [email protected]

Background on what is Dooor and how it works with Akash

By enabling a set of tenant providers to rent computing power on a decentralized marketplace, Akash serves as a crucial layer in the Dooor protocol, allowing LLMs to run on TEE GPUs while connecting with context and fine-tuning data from data storages. This integration is key to the autonomous execution of agents with the orchestration layer. Through HTTPS calls, secure key storage (currently under development with VetKeys), and the T-ECDSA protocol for signing multi-chain messages and transactions, canisters form a network of trustless oracles that underpin the orchestration layer for the Akash chain. This creates a true on-chain environment for AI without reliance on centralized institutions. Furthermore, the smart contracts in the Aggregation Layer are mapped to the data peer tools' canisters, ensuring synchronization of AI data across chains.

The Dooor architecture introduces an innovative infrastructure that unlocks unprecedented use cases for the Akash ecosystem. By leveraging canister utilization as key components in a multi-tenant LLM management system, Dooor enables the first truly decentralized AI ecosystem on Akash, complete with real-world sub-applications.

Currently implemented with the Azle TypeScript framework, the canisters maintain their own database relationships for critical, incorruptible data points. These canisters facilitate user interactions with agents, context, and LLMs on Akash. A multiset of canisters ensures data redundancy for corruption resistance, while a master canister, governed by the Dooor DAO, ensures decentralization throughout the workflow. This architecture enables on-chain management of a swarm of community-deployed agents without intermediaries, with LLM scripts stored and executed on the computing layer.

The Dooor backend, deployed on Akash, employs a load balancer system to guarantee consistent LLM availability and fast responses for end users. The current implementation uses Flask to manage user credits and agent configurations. The developer workflow involves a Docker container that creates an image for upload, along with an SDL file and a startup.sh script to initialize the framework for running LLM instances. The backend server manages the business logic.

For fine-tuning operations, Unsloth is used for intensive data handling on the backend, receiving commands and data directory authorizations from the ICP canister. For running LLMs, Ollama is employed to retrieve .gguf models from a shared Dooor database, with the business logic also controlled by the canister.

#!/bin/bash

max_retries=30
count=0
while [ ! -d "/data" ] && [ $count -lt $max_retries ]; do
    echo "Waiting for /data directory to be mounted... ($count/$max_retries)"
    sleep 2
    count=$((count + 1))
done

if [ ! -d "/data" ]; then
    echo "Error: /data directory not available after waiting"
    exit 1
fi

if [ ! -f "/data/user_credits.json" ]; then
    echo "{}" > /data/user_credits.json
    chmod 666 /data/user_credits.json
fi

chmod 777 /data
chmod 666 /data/user_credits.json

ollama serve &

while ! curl -s http://localhost:11434/api/tags >/dev/null; do
    echo "Waiting for Ollama to start..."
    sleep 1
done

echo "Downloading test model..."
ollama pull nomic-embed-text:latest

echo "Starting Flask application..."
exec gunicorn -w 4 -b 0.0.0.0:8080 --timeout 6000 app:app

Why Not Use the Akash Computing Layer Directly?

The Akash deployment process includes a centralized step for the deployer, requiring a certificate generated locally on the user’s computer to interact with containers. This step creates a potential point of failure, as access to the certificate is necessary to provide the one-time SDL rules to the provider, which could allow unauthorized access to Dooor's provider configurations.

Using ICP canister management, this Akash certificate can be generated in-memory and the one-time SDL configuration sent to the provider without storing the certificate data. This approach ensures a fully decentralized flow. Once the configuration is completed, all communication between the canister and Akash can be securely managed using T-ECDSA.

Dooor's mission is to become an aggregation protocol for AI agents. By integrating with the Akash chain, the protocol ensures redundancy in its infrastructure resources, leveraging Akash's unique capabilities while offering competitive pricing for end users. With the load-balancer system under development by Dooor—a queue-cron system within the canister that coordinates providing bids and resources on Akash—the protocol replicates the complexity and reliability of infrastructures that are typically only achievable with centralized solutions.

How Does It Work?

EVM smart contracts handle user transactions and manage token pools. Mapping and queues allow users to create orders (such as uploading, patching, or deleting models, tools, and contexts). ICP canisters perform continuous HTTPS calls via a set of Base RPCs (acting as oracles) to gather transaction data, process it according to the defined business logic, and interact with the Akash-ICP SDK for provider management on the Akash layer.

Users or tenants specify deployment parameters, including data centers, requirements, and pricing, in a manifest file (deploy.yaml). This file, written in the declarative Stack Definition Language (SDL), simplifies the process of defining deployment attributes.

The system is designed with idempotence in mind, ensuring that even in extreme scenarios, redundant calls can reliably guarantee the successful execution of transactions without failure.

Dooor is developing streamlined functions to convert manifests from LLM developers into readable SDL for Akash, enabling efficient deployment of virtually any tenant. This ensures flexibility and simplifies the deployment process for a wide range of use cases.

//Example - one of the convertion functions
export function convertManifest(manifest: string) {
    const manToTreat = JSON.parse(manifest)[0]
    const body = convertNumbersToStrings(manToTreat)
    const jsonBody = JSON.stringify([body]);
    return (jsonBody)
}

function convertNumbersToStrings(obj) {
    return Object.entries(obj).reduce((newObj, [key, value]) => {
        if (typeof value === 'number' && key === 'val') {
            newObj[key] = value.toString();
        } else if (typeof value === 'object' && value !== null) {
            newObj[key] = convertNumbersToStrings(value);
        } else {
            newObj[key] = value;
        }
        return newObj;
    }, Array.isArray(obj) ? [] : {});
}

ZK, privacy and multichain operations

Dooor’s vision prioritizes privacy for developers and users, ensuring that no external entity—individual or company—can extract user code, content, or manipulate data in a malicious way (even when using TEE GPUs). To achieve this, Dooor developed the ECDSA-JWT interauth system. In this system, the canister generates a time-based expiration token via its T-ECDSA signature, enabling users to directly interact with agents and models running on the computing layer. This approach enhances the speed of data updates and retrievals while reducing the load on the canister system, as it no longer needs to approve every interaction.

The ECDSA-JWT inter-auth system moves part of the business logic to the computing layer, such as credit systems, view permissions, and ZK storage authentication. Only authorized users can access specific data, with the server provider verifying that the signature was derived and created by the canister's public key. This ensures secure and controlled access without compromising user privacy.

By enabling the provider to validate user identity and approve guard authentication to process requests, the system allows for responses to be hashed and copied at the computing layer. This ensures agent responses can be verified without requiring public access to the provider’s code.

To facilitate seamless integration with the Akash SDK, Dooor developed new interfaces for the canister. These interfaces ensure compatibility, as the default Akash SDK would otherwise disrupt canister execution.

            export declare class CertificateManager {
		    parsePem(certPEM: string): CertificateInfo;
		    generatePEM(address: string, options?: ValidityRangeOptions): CertificatePem;
		    dooorGeneratePEM(address: string, options?: ValidityRangeOptions): CertificatePemAny;
		    dooorGetPEM(cert: any): any;
		    private createValidityRange;
		    private dateToStr;
		    private strToDate;
			}
      async generatePEM(address, options) {
        const { notBeforeStr, notAfterStr } = this.createValidityRange(options);
        const { prvKeyObj, pubKeyObj } = jsrsasign_1.default.KEYUTIL.generateKeypair("EC", "secp256r1");
        // ...
        // ...
        // ...
        return {
          cert: certPEM,
          publicKey,
          privateKey: jsrsasign_1.default.KEYUTIL.getPEM(prvKeyObj, "PKCS8PRV")
        };
      }
      dooorGeneratePEM(address, options) {
        const { notBeforeStr, notAfterStr } = this.createValidityRange(options);
        const { prvKeyObj, pubKeyObj } = jsrsasign_1.default.KEYUTIL.generateKeypair("EC", "secp256r1");
        // ...
        // ...
        // ...
      dooorGetPEM(certString) {
        const certPEM = certString.getPEM();
        return certPEM;
      }

ICP HTTPS: The HTTPS outcalls (https://internetcomputer.org/docs/current/references/https-outcalls-how-it-works) enabled by ICP are a crucial component of its integration with Akash, allowing canisters to interact with external environments. All the necessary properties to initiate bidding (https://akash.network/docs/akash-provider-service-and-associated-sub-services/bid-engine-overview/) can be implemented by the canister in a workflow that involves signing transactions and sending them to external systems. With an RPC swarm developed within the canister, the Dooor framework ensures connectivity with virtually any computing chain layer, serving as the intermediary between agents running on the computing layer and their mapped addresses on ICP. This setup allows each agent to have its own blockchain address, enabling them to perform truly autonomous operations through the canister.

//RPC connection with base

	const provider = new ethers.JsonRpcProvider(chainRPC);

  const contract = new ethers.Contract(
    baseContractAddress,
    baseABI,
    provider,
  );

  const functionSignature = 'depositById(uint256)';
  const data = contract.interface.encodeFunctionData(functionSignature, [tokenId]);
  const jsonValue = {
      jsonrpc: "2.0",
      method: "eth_call",
      params: [{
        to: baseContractAddress,
        data: data
      }, "latest"],
      id: 1
    };
  
  const resTransaction = await callRpc(chainRPC, jsonValue)
  const transaction = (contract.interface.decodeFunctionResult(functionSignature, resTransaction.result))

brewsterdrinkwater · 2025-04-16T12:32:56Z

brewsterdrinkwater
Apr 16, 2025
Maintainer

The work outlined above is in response to AEP-12 on the Akash Roadmap. The DOOOR team will be talking about their discussion and they will do a live demo at sig-providers monthly meeting April 23rd, 2025.

Please track the Akash Community group calendar, and feel free to attend.

0 replies

anilmurty · 2025-04-18T17:55:29Z

anilmurty
Apr 18, 2025
Maintainer

Hi @bruno353 - thanks for submitting this super detailed proposal and for the demo recordings. I'll start by saying that this is a very important piece of functionality that would bring a lot of value to the network, increase trust and provide a way to remove a common security based friction point for users/ companies adopting akash so I support the motivation and appreciate you talking the initiative on this.

I reviewed the demos and have some questions (apologies if I'm unable to discern this from your writeup):

I understand the architectural approach with the side-car and the part about needing to modify the core code to support the the side-car, the addition of ENV variables in the SDL and potential introduction of new provider attributes to allow the user to filter providers that are "TEE Cabable". My question (probably to @brewsterdrinkwater) is -has this design approach and potential code changes been discussed with the core eng team ( @troian @chainzero @cloud-j-luna ) and have they signed off on the general approach?
From the phased implementation plan: I understand that the "Automated TEE management" is what focuses on implementing 1 (above) and when that is complete we should essentially have a solution that lets users utilize TEE/ CEE on capable GPUs (like H100s) on akash and will let providers set themselves up for being able to support and advertize those capabilities, through attributes. Is this correct? Or does this phase not include building the full SDK to support being able to receive requests for attestation, sealing and unsealing requests?
Assuming 2 is correct - what are the remaining phases for? I noticed in Detailed Technical Explanation with Code Snippets that you are also working on building a protocol around this - I assume the development work for this has not begun yet? Are the remaining phases part of this work?
Assuming 2 and 3 are correct - I think it would make sense to split off the "Get TEE working on Akash" as a separate proposal and just submit that for review and on-chain approval first (assuming 1 is done)
Why do you need to use Azure VMs for your testing (why not akash compute)?
When you say "cross provider attestation" - are you talking about cross provider within akash (aka between akash providers) or cross provider as in "between Akash and AWS or Akash and Azure etc"?

Thanks again!
Anil

6 replies

me-thiago Apr 19, 2025

Thank you for the questions, @anilmurty! Just to complement @bruno353's response:

From a product perspective, it's important to highlight that we're not just proposing this as a development team - we're also the first real user of this stack. DOOOR will be running our own AI workloads using the exact TEE workflows outlined in this proposal. That means:

We're highly incentivized to deliver a fully functional, production-ready solution
We'll be stress-testing every component in real-world conditions
Akash gains a flagship example of secure, decentralized AI execution

What we're offering isn't just infrastructure, it's a complete adoption story. We're excited to strengthen our collaboration with Akash community through this grant, and mindful that the grant itself isn't our end goal: it's the foundation for building a shared long-term vision.

Cheers!

Thiago

anilmurty Apr 21, 2025
Maintainer

Thanks @bruno353 and @me-thiago - appreciate the detail and sharing the code repos with me. I haven't reviewed it all and may have more questions when I do but I do I have one important one for now:

From my perspective, splitting the proposal after Phase 1 would effectively fund the infrastructure layer without delivering any functional utility for end users

My understanding is that clients/ end-users should be able to call into the Nvidia Attestation SDK which provides core functionality like RIM and OCSP (for comparing hardware measurements) and allows the compute service to plug in a verifier -- which maybe a local one that the cloud provider hosts (like azure and aws do) or Nvidia's or some trusted authority like Intel's.
https://github.com/NVIDIA/nvtrust/tree/main/guest_tools/attestation_sdk

If this is true then shouldn't it be enough for us to just implement phase-1 + offering an easy way to interact with this SDK? What am I missing?

anilmurty Apr 21, 2025
Maintainer

To atriculate what I said above a little better, what I understand from reading through the Nvidia Attestation SDK is that:

Nvidia provides core services for hardware verification (RIM & OCSP) that essentially verify certs and provide "golden measurements" for their TEE Capable GPUs
Nvidia allows you to either use your own verifier (aka "local verifier" similar to what aws and azure have) or Nvidia's verifier or something from a third party (like Intel).

If that is correct - Akash users/ customers should be able to just call into that SDK directly from their applications (which are generally written in Python and so will work great with the SDK). In fact Nvidia themselves provide two examples (local and remote verifier) to show how clean/ simple this can be:
https://github.com/NVIDIA/nvtrust/blob/main/guest_tools/attestation_sdk/tests/end_to_end/hardware/LocalGPUTest.py
https://github.com/NVIDIA/nvtrust/blob/main/guest_tools/attestation_sdk/tests/end_to_end/hardware/RemoteGPUTest.py

And if my understanding of all that is correct, then I think the main scope of what's needed to be done on the Akash side is:

Ensuring that we have a way of collecting measurements from the hardware (aka "evidence")
Deciding on who/ what we want to use as the "trusted verifier" (Nvidia, our own or third-party)
Offering a way for akash users to be able to specify that they want TEE capable providers through some attribute in the SDL (and corresponding settings on the Akash provider that has been attested and can set the same attribute so it is advertised and selected in the bid selection process) (this is what I think phase-1 of this/ your proposal does?)

And each client should be specifying the "Policy" that they want to use against the verifier to decide whether to "move forward" with running the workload or not. For example:

For a medical application (like a hospital using AI to say process medical records or X-rays) they'll want a very restrictive policy (and that policy may even be dictated by HIPPA etc) and they wouldn't mind the overhead that introduces to each inference request
For a chatbot - they may choose to tradeoff security for UX (by reducing latency for each inference request)

Although - given that no attestation is necessary on every single call and that the sealing/ unsealing should be pretty minimal overhead (like HTTPS) - maybe it's not as big of a concern?

bruno353 Apr 21, 2025
Author

Hey @anilmurty !

Yes, actually, the nvtrust is the one being used for the AI verification – we are running the tests here https://github.com/Dooor-AI/akash-tee-nvidia-h100-azure; it’s the one we use when running the az-cgpu workflows in our code. We are not building the hardware verification from scratch, but connecting and integrating the available frameworks.

On this topic, we ran into issues because NVIDIA GPUs don’t support data sealing, as I discussed with the Azure team here: Azure/az-cgpu-onboarding#53 (comment)

Basically, we will handle data sealing and software verification on the CCPU side, while the GPU side only verifies AI inference. What we’re building in the Go API TEE can act as the bridge between the TPM2 tools (for AMD), Open Enclave (for Intel), and the nvtrust framework. While nvtrust seems to integrate more seamlessly with AMD SEV-SNP – I haven’t found any examples of integration for Intel SGX CPUs yet – so we’ll work on a ready-to-use integration for Intel SGX as well (since most Akash providers use Intel).

The major point (as stated by the Azure team) is: when you do SKR, the TPM2 will only attest what runs in its own enclave. The nvtrust release driver provides GPU quotes, but these results don’t go into the TPM2 policy, so the TPM doesn’t factor in the nvtrust outcome. If you release keys based only on a CPU quote, you risk the GPU being compromised and accepting malicious data.
To resolve this, our idea is to call the nvtrust local_gpu_verifier inside the CPU enclave to fetch the GPU quote; we validate it inside the enclave, and only after both attestations pass (CPU and GPU) do we proceed with the TEE workflows—such as calling TPM2_Unseal—to release the keys.

Nvtrust wouldnt be plug-and-play solution for the Akash flow – since you need root access to the host machine and /dev/nvidia* access – this is what we are doing with the sidecar container attached in the core code, we give this Go API TEE sidecar (public and standardized code) root access, so the docker the user is running doesn’t need to (as it doesn’t seem very secure for the host). This sidecar container is the one we plan to integrate within the akash-provider core code. Then, this sidecar creates the policies, the auth session, manages sealing and unsealing, manages keys, orchestrates the endpoints for users to verify by themselves (or any third party to create its verification methods), orchestrates the connection between sidecar and main container (so it also has programmatic capability to create TEE workflows), and orchestrates the CCPU + CGPU TEE environment (by connecting with the tools we stated).

TL;DR

Yes! We are using nvtrust
Nvtrust needs root access to the host - and access to /dev/nvidia* on the node
Issues on end-to-end TEE flow, because of SKR not available on NVIDIA gpus
Our sidecar:
- Runs alongside the user’s container and has the elevated permissions required.
- Abstracts the complexity of attestation, key management, and sealing/unsealing.
- offers a programmatic interface for developers to define policies and connect secure workflows
- coordinates TPM2 (AMD), Open Enclave and others (Intel), and nvtrust (NVIDIA) together

anilmurty Apr 21, 2025
Maintainer

that all makes sense - thank you! - and yes, nvtrust is mainly an SDK for "attestation" right now and they expect clients to handle sealing/ unsealing

cloud-j-luna · 2025-04-20T22:26:42Z

cloud-j-luna
Apr 20, 2025
Maintainer

Thank you for submitting your proposal to enhance the Akash Network. I appreciate the technical expertise and enthusiasm demonstrated by your team in addressing the challenges outlined here.
Upon reviewing your proposal, I believe it presents a solid foundation for discussion on these rather complex and important topics.
However, I do have some concerns regarding the scope and complexity of the proposed solution. Specifically, the fact that the proposal encompasses multiple major features, including Verifiable Compute, Secret Management and Confidential Computing.
While I acknowledge the value of these features, I recommend breaking down the proposal into separate, more focused proposals that can be addressed with the support of the Core Engineering team and worked as individual deliverables. This approach will enable a more granular evaluation and implementation of each feature, ultimately enhancing the overall quality and effectiveness of the solutions as well as improving synergies across the teams. As this proposal currently stands, if "Decentralized Key Sharing Management" phase, which you state is still in the design phase, couldn't be implemented as initially designed it would jeopardise the whole proposal. While I understand your point that all these features are required for a production-grade TEE workflow, some phases are still immature to consider for a proposal, and I would rather see smaller deliverables that bring value to the users, for example 'Decentralized Key Sharing Management" which I believe has uses beyond TEE. This is just my personal opinion.

It would be useful for the community to have some clarity on some of the technical decisions behind your proposal and its advantages over existing solutions, for example, how does it compare to some Kubernetes native solutions.

As stated previously by @anilmurty this will require alignment with the Core Engineering team (@troian @chainzero @cloud-j-luna) on the impact of the required code changes. We’ll be able to have a more clear picture after the SIG Providers meeting which will allow me to provide deeper feedback and better understand the proposal.

Related AEPs

Without trustless key management, sensitive AI workloads cannot be protected, making Akash an unsuitable environment for enterprise-grade AI applications. Existing cloud services use proprietary HSM (Hardware Security Modules) or centralized key vaults, but these introduce single points of failure and limit decentralization. A decentralized key bootstrap system, where AI models can securely share and verify encryption keys, is necessary to guarantee data integrity and confidentiality for sensitive computations.

This seems related to AEP-50.

Related to verifiable compute AEP-29 aims to provide hardware verification. This could be relevant for the verifiable compute portion of the proposal.

EDIT: Fixed the AEP links.

1 reply

bruno353 Apr 21, 2025
Author

Regarding Kubernetes, I believe you're referring to Confidential Containers (CNCF CoCo) — if so, I'd say the main differences are:

CoCo focuses on runtime confidentiality, not so much on secret management.
It doesn’t provide an easy way to verify software code / application integrity.
It's designed around a single-cluster model (our framework also supports that, but the design phase of the DKM targets cooperation between nodes and clusters for private decentralized databases).
Also, if I’m not mistaken, CoCo currently doesn’t support NVIDIA Confidential Compute.

That said, the goal isn’t to compete with these solutions, but rather to offer a standard that’s easy to integrate on-the-fly with current or future confidential computing solutions.
For instance, let’s say a provider wants to offer IBM Secure Execution (offered by CoCo) for its instance — it could still use our endpoints, automated TEE sidecar deployment, and functions to verify code execution via policy hashes and response patterns — allowing users to easily validate the system. The difference is it would now be integrated with IBM’s native validation system.

Regarding the AEPs, I don’t have access to those GitHub links, but based on what I see here — AEP-50 and AEP-29 — you are absolutely right about both.
The first phase addresses the hardware verification aspects of AEP-29, providing a foundation for secure computation that the community can build upon right away.

Thanks!

31trainman · 2025-04-23T18:40:57Z

31trainman
Apr 23, 2025

I support this proposal and am ready to help in anyway I can.

0 replies

rodri-r · 2025-04-27T06:40:16Z

rodri-r
Apr 27, 2025

This proposal looks great. I also heard and saw some demos. As Anil mentioned above this looks like it would add a "piece of functionality that would bring a lot of value to the network". I would be interested in testing this when possible and also support this proposal.

0 replies

anilmurty · 2025-05-16T02:13:32Z

anilmurty
May 16, 2025
Maintainer

I've created this AEP to outline the overall plan/ direction with Confidential Computing- FYI https://akash.network/roadmap/aep-65/

0 replies

Akash Network

TEE-Based AI Execution on Akash with Open-Source SDK for Confidential Compute #872

Uh oh!

Uh oh!

bruno353 Apr 15, 2025

1. Context of Proposal

2. Problem Statements

2.1 Lack of Automated TEE Management for AI Deployments

2.2 Fragmented & Incompatible TEE Workflows Across Cloud Providers

2.3 No Verifiable AI Execution on Akash (GPU Attestation)

2.4 Absence of Secure & Decentralized Key Management for AI Models

2.5 No Cross-Provider TEE Attestation Mechanism

3. Implementation Plan & Feature Set

4. Projected allocation & budget

Conclusion

Appendix

Links & References

Team Background

Thiago Castroneves — CEO

Bruno Laureano — CTO

Contact Information

Background on what is Dooor and how it works with Akash

Why Not Use the Akash Computing Layer Directly?

How Does It Work?

ZK, privacy and multichain operations

Replies: 6 comments · 7 replies

Uh oh!

brewsterdrinkwater Apr 16, 2025 Maintainer

Uh oh!

Uh oh!

anilmurty Apr 18, 2025 Maintainer

Uh oh!

me-thiago Apr 19, 2025

Uh oh!

Uh oh!

anilmurty Apr 21, 2025 Maintainer

Uh oh!

Uh oh!

anilmurty Apr 21, 2025 Maintainer

Uh oh!

Uh oh!

bruno353 Apr 21, 2025 Author

Uh oh!

Uh oh!

anilmurty Apr 21, 2025 Maintainer

Uh oh!

Uh oh!

cloud-j-luna Apr 20, 2025 Maintainer

Related AEPs

Uh oh!

bruno353 Apr 21, 2025 Author

Uh oh!

31trainman Apr 23, 2025

Uh oh!

rodri-r Apr 27, 2025

Uh oh!

anilmurty May 16, 2025 Maintainer

bruno353
Apr 15, 2025

Replies: 6 comments 7 replies

brewsterdrinkwater
Apr 16, 2025
Maintainer

anilmurty
Apr 18, 2025
Maintainer

anilmurty Apr 21, 2025
Maintainer

anilmurty Apr 21, 2025
Maintainer

bruno353 Apr 21, 2025
Author

anilmurty Apr 21, 2025
Maintainer

cloud-j-luna
Apr 20, 2025
Maintainer

bruno353 Apr 21, 2025
Author

31trainman
Apr 23, 2025

rodri-r
Apr 27, 2025

anilmurty
May 16, 2025
Maintainer