Skip to content

feature: add compute provider to worker deployment#704

Open
jaypipes wants to merge 1 commit intotemporalio:masterfrom
jaypipes:jaypipes/compute-provider
Open

feature: add compute provider to worker deployment#704
jaypipes wants to merge 1 commit intotemporalio:masterfrom
jaypipes:jaypipes/compute-provider

Conversation

@jaypipes
Copy link

@jaypipes jaypipes commented Jan 28, 2026

feature: introduce compute config

Introduces a new compute.v1 package of proto messages that house the
various instructions and information needed by worker control plane
controllers

temporal.api.compute.v1.Config has two fields, provider and scaler
that store the configuration information for the server-owned worker
setup and how the worker control plane should treat worker lifecycle
events (such as when no pollers exist for a task queue that has received
a task).

The temporal.api.compute.v1.Config.provider field is of type
temporal.api.compute.v1.Provider. This message is a simple wrapper
around a blob of bytes that contains configuration settings for a worker
control plane controller.

The configuration settings for the worker control plane controller are
specific to different compute providers. The first compute provider we
are modeling is AWS Lambda, and the ProviderDetailAWSLambda message
houses the ARN of the AWS Lambda Function to be invoked and an optional
ARN of the IAM Role that should be assumed by a worker control plane
controller when invoking the function.

The temporal.api.compute.v1.Config.scaler field is of type
temporal.api.compute.v1.Scaler. This message contains instructions for
when the worker lifecycle controller should scale up or down the number
of workers in a WorkerDeployment or WorkerDeploymentVersion.

Adds a new field compute_config to WorkerDeploymentInfo and
WorkerDeploymentVersionInfo that contains a
temporal.api.compute.v1.Config message.

Both WorkerDeployment and WorkerDeploymentVersion have this field
in order to handle versioned workers properly. For example, if the user
wants to use PINNED temporal versioning, they will be creating a
WorkerDeploymentVersion that is specific to, say, a qualified Lambda
Function ARN. Therefore we need to place compute provider configuration
on a WorkerDeploymentVersion in addition to WorkerDeployment.

Adds three new gRPC API methods to WorkflowService:

  • CreateWorkerDeployment creates a new WorkerDeployment containing the
    supplied compute configuration.
  • UpdateWorkerDeploymentComputeConfig updates an existing
    WorkerDeployment's compute configuration.
  • UpdateWorkerDeploymentComputeConfig updates an existing
    WorkerDeploymentVersion's compute configuration.

Signed-off-by: Jay Pipes [email protected]

@jaypipes jaypipes requested review from a team as code owners January 28, 2026 15:30
@CLAassistant
Copy link

CLAassistant commented Jan 28, 2026

CLA assistant check
All committers have signed the CLA.

@jaypipes jaypipes force-pushed the jaypipes/compute-provider branch 3 times, most recently from 2efdaa8 to 5a5db55 Compare January 28, 2026 17:10
// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message ComputeProvider {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this have a field like provider_type or similar to identify who should interpret the details?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in adjacent comment at https://github.com/temporalio/api/pull/704/changes#r2737966940, but I think the config about which control plane this goes to is missing atm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google.protobuf.Any is made for this sort of situation, but it doesn't handle the encryption and other payload metadata. Maybe we should say the Payload contains an Any, which contains the underlying config plus a type url?

Copy link
Member

@cretz cretz Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Payloads for proto messages actually do put their type in the metadata just like Any, so they work for this use case, but I am going to be making a comment down there right now concerning issues with using proto here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no... it happened when I ran make buf-lint locally :(

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good (and simple which is awesome). This cannot really be merged to main anytime soon until the implementation using it has basically been completely built due to concerns of this being mutated as we work on the impl.

// Function via its ARN. This will start the Worker that is housed within that
// AWS Lambda Function and the Worker will begin listening to Tasks on the
// WorkerDeployment's configured TaskQueue.
message ComputeProviderAWSLambda {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message ComputeProviderAWSLambda {
message ComputeProviderDetailsAWSLambda {

Because this is not a "compute provider" as we've now defined it since "compute provider" as we've now defined it is "compute provider details" + "scaling config" + "endpoint config".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz I don't think putting scaling config into ComputeProvider is appropriate. The scaling algorithm for how often a StartWorker API is called is orthogonal to configuration for how that StartWorker code is implemented for a particular compute platform.

I've added a different Scaler field to WorkerDeploymentVersionInfo that is used to store settings related to auto-scaling algorithms.

Copy link
Member

@cretz cretz Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but we are back at the problem of "what do you call this collection of things needed"? Because you are still putting scaling config on a ComputeConfigRequest. Is it part of "compute config"? If so, IMO there should be an all encompassing ComputeConfig (even if that has just a field for scaler, a field for compute_provider, etc). If not, then it shouldn't be in a call about updating compute config. These two fields have no value independently and aren't independent based on the current modelling/RPC.

// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message ComputeProvider {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even as placeholders, we should go ahead and add fields that represent scaling algo config and control plane Nexus endpoint config (even if for many they will use a "system" one that we run)

// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message ComputeProvider {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in adjacent comment at https://github.com/temporalio/api/pull/704/changes#r2737966940, but I think the config about which control plane this goes to is missing atm


// Contains the new worker compute provider configuration for the
// WorkerDeployment.
temporal.api.deployment.v1.ComputeProvider compute_provider = 20;
Copy link
Member

@cretz cretz Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the absence/null of this represents removing compute provider?

Comment on lines 922 to 923
// Creates or replaces the compute provider configuration for a Worker
// Deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be pretty clear/obvious here or somewhere that this completely changes how task queues operate and completely changes the worker paradigm.

I fear the triviality of the API/CLI names and such don't convey the significance of this "switch flip" that completely changes how these task queues will work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? Can these task queues not be polled like normal? (as a "base load" below burstable compute) I know we're not going to focus on that case at first but will we disable it?

Copy link
Member

@cretz cretz Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the proper server-side worker lifecycle management and clearer understanding by users, a task queue needs to use server-owned workers or user-owned workers. Having a hybrid can cause unpredictability and shift expected numbers, known workers, etc. If a base load is needed, why can't the server control the lifetimes of those workers too (or at least know about them)? If that means that you just tell the server that these may exist, that can make sense maybe. But to have the server not even know about workers on a task queue where it expects to control workers is going to cause confusion and unpredictability IMO.

I'd go as far as saying from a server POV a worker should not be allowed to poll on a serverless task queue unless it was a worker the server configured. This way you don't have silent processing workers out there the server can't account for in its scaling/versioning logic.

// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message ComputeProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google.protobuf.Any is made for this sort of situation, but it doesn't handle the encryption and other payload metadata. Maybe we should say the Payload contains an Any, which contains the underlying config plus a type url?

message ComputeProviderAWSLambda {
// The Qualified or Unqualified ARN of the AWS Lambda Function to be
// invoked.
string invoke = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe:

Suggested change
string invoke = 1;
string invoke_arn = 1;

without arn in the name I would have assumed this is an http url

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnr I've updated the comment above. It can actually be the function URL too :) Which is why I left the type suffix off the field name.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaypipes this needs to be the ARN - no plan for supporting function URLs at this point.

Comment on lines 922 to 923
// Creates or replaces the compute provider configuration for a Worker
// Deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? Can these task queues not be polled like normal? (as a "base load" below burstable compute) I know we're not going to focus on that case at first but will we disable it?

// An individual ComputeProviderAWSLambda,
// ComputeProviderGoogleCloudRun, etc will be encrypted and
// serialized into compute_provider_config.
temporal.api.common.v1.Payload provider_details = 1;
Copy link
Member

@cretz cretz Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to add a note here based on internal discussions. This can only really be a payload for custom providers, because people have "encrypt every payload" proxies and cloud side needing to read these prevents people from encrypting this.

But there is a problem using proto at all here IMO, see comment below at https://github.com/temporalio/api/pull/704/changes#r2741776135.

So we should have provider type w/ a oneof I think:

Suggested change
temporal.api.common.v1.Payload provider_details = 1;
// Passed to provider Nexus endpoint to identify the contents of the provider details
string provider_type = 1;
oneof provider_details {
// Non-encoded provider details, used in situations where decoding cannot occur
bytes provider_details_open = 2;
// Encoded provider details
temporal.api.common.v1.Payload provider_details_payload = 3;
}

Note how provider_details_open is bytes and not google.protobuf.Any here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz I've given it another try... please see the updated struct.

// Function via its ARN. This will start the Worker that is housed within that
// AWS Lambda Function and the Worker will begin listening to Tasks on the
// WorkerDeployment's configured TaskQueue.
message ComputeProviderAWSLambda {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO these should not be protos, these should be JSON schema defined as part of a Nexus IDL (https://github.com/nexus-rpc/nexus-rpc-gen). People should have to use proto to implement a compute provider accepting this (including us).

@jaypipes jaypipes force-pushed the jaypipes/compute-provider branch from 5a5db55 to add3118 Compare January 30, 2026 21:41
Copy link
Author

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz @02strich @dnr updated in the following major ways:

  1. Made the compute configuration a field on WorkerDeploymentVersionInfo instead of WorkerDeploymentInfo. This was so we can handle the versioning use cases properly (we will be making the UX for unversioned transparently auto-create a "latest" version when one doesn't already exist, BTW).
  2. Added a Scaler message type with simple min_replicas/max_replicas fields to start off with. Like ComputeProvider, I've placed Scaler on WorkerDeploymentVersionInfo instead of WorkerDeploymentVersion.
  3. Updated the gRPC API call from UpdateWorkerDeploymentComputeProvider to UpdateWorkerDeploymentComputeConfig and added the (orthogonal) compute_provider and scaler fields to UpdateWorkerDeploymentComputeConfigRequest.
  4. Changed the ComputeProvider structure from the single Payload field to using a oneOf with either a google.protobuf.Any (for unencrypted) or temporal.api.common.v1.Payload (for encrypted) provider details.

Please let me know what you think.

// Function via its ARN. This will start the Worker that is housed within that
// AWS Lambda Function and the Worker will begin listening to Tasks on the
// WorkerDeployment's configured TaskQueue.
message ComputeProviderAWSLambda {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz I don't think putting scaling config into ComputeProvider is appropriate. The scaling algorithm for how often a StartWorker API is called is orthogonal to configuration for how that StartWorker code is implemented for a particular compute platform.

I've added a different Scaler field to WorkerDeploymentVersionInfo that is used to store settings related to auto-scaling algorithms.

message ComputeProviderAWSLambda {
// The Qualified or Unqualified ARN of the AWS Lambda Function to be
// invoked.
string invoke = 1;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnr I've updated the comment above. It can actually be the function URL too :) Which is why I left the type suffix off the field name.

// An individual ComputeProviderAWSLambda,
// ComputeProviderGoogleCloudRun, etc will be encrypted and
// serialized into compute_provider_config.
temporal.api.common.v1.Payload provider_details = 1;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz I've given it another try... please see the updated struct.

@jaypipes jaypipes force-pushed the jaypipes/compute-provider branch from add3118 to 21d4b1d Compare January 30, 2026 22:30
Comment on lines 32 to 36
google.protobuf.Any object = 2;
// will be an encrypted, encoded bytestring containing
// provider-specific information. The implementation must understand
// how to decrypt the payload.
temporal.api.common.v1.Payload payload = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic, but would qualify these field names (e.g. provider_detail_any and provider_detail_payload), they are not implicitly qualified by their oneof parent in many cases

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

// Function via its ARN. This will start the Worker that is housed within that
// AWS Lambda Function and the Worker will begin listening to Tasks on the
// WorkerDeployment's configured TaskQueue.
message ComputeProviderDetailAWSLambda {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe this should be defined as proto but rather JSON schema or some other form. There is no server API use of this value, it should be blindly passed to the compute provider, and those consuming compute provider details should not have to know anything about our API or have to run protoc to consume this (or future specific provider detail models).

Comment on lines 2299 to 2304
// Contains the new worker compute provider configuration for the
// WorkerDeploymentVersion.
temporal.api.deployment.v1.ComputeProvider compute_provider = 6;

// Contains the new scaler configuration for the WorkerDeploymentVersion.
temporal.api.deployment.v1.Scaler scaler = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned as reply to other comment, but if we want compute_provider + scaler to be in something called "compute config" (assuming based on request name), IMO we should explicit model such a thing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Comment on lines +17 to +21
int32 min_replicas = 1;
// The upper limit for the number of Workers (in the WorkerDeployment) to
// which the Scaler can scale up. Must be greater than or equal to
// minReplicas.
int32 max_replicas = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has it been decided that these are the knobs we want to expose for controlling scaling, or are these just PoC-style placeholders? (trying to confirm whether I should comment on our scaling knobs yet or not)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz POC style placeholders for now :)

@jaypipes jaypipes force-pushed the jaypipes/compute-provider branch from 21d4b1d to cb31098 Compare February 3, 2026 16:49
Introduces a new `compute.v1` package of proto messages that house the
various instructions and information needed by worker control plane
controllers

`temporal.api.compute.v1.Config` has two fields, `provider` and `scaler`
that store the configuration information for the server-owned worker
setup and how the worker control plane should treat worker lifecycle
events (such as when no pollers exist for a task queue that has received
a task).

The `temporal.api.compute.v1.Config.provider` field is of type
`temporal.api.compute.v1.Provider`. This message is a simple wrapper
around a blob of bytes that contains configuration settings for a worker
control plane controller.

The configuration settings for the worker control plane controller are
specific to different compute providers. The first compute provider we
are modeling is AWS Lambda, and the `ProviderDetailAWSLambda` message
houses the ARN of the AWS Lambda Function to be invoked and an optional
ARN of the IAM Role that should be assumed by a worker control plane
controller when invoking the function.

The `temporal.api.compute.v1.Config.scaler` field is  of type
`temporal.api.compute.v1.Scaler`. This message contains instructions for
when the worker lifecycle controller should scale up or down the number
of workers in a WorkerDeployment or WorkerDeploymentVersion.

Adds a new field `compute_config` to `WorkerDeploymentInfo` and
`WorkerDeploymentVersionInfo` that contains a
`temporal.api.compute.v1.Config` message.

Both WorkerDeployment and WorkerDeploymentVersion have this field
in order to handle versioned workers properly. For example, if the user
wants to use PINNED temporal versioning, they will be creating a
WorkerDeploymentVersion that is specific to, say, a qualified Lambda
Function ARN. Therefore we need to place compute provider configuration
on a WorkerDeploymentVersion in addition to WorkerDeployment.

Adds three new gRPC API methods to WorkflowService:

* `CreateWorkerDeployment` creates a new WorkerDeployment containing the
  supplied compute configuration.
* `UpdateWorkerDeploymentComputeConfig` updates an existing
  WorkerDeployment's compute configuration.
* `UpdateWorkerDeploymentComputeConfig` updates an existing
  WorkerDeploymentVersion's compute configuration.

Signed-off-by: Jay Pipes <[email protected]>
@jaypipes jaypipes force-pushed the jaypipes/compute-provider branch from cb31098 to a09df28 Compare February 3, 2026 16:57
Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, mostly only pedantic things, but structure looks great (can't approve for obvious reasons, we don't want to merge this anytime soon)

// Lambda Function via its ARN. This will start the Worker that is housed
// within that AWS Lambda Function and the Worker will begin listening to Tasks
// on the WorkerDeployment's configured TaskQueue.
message ProviderDetailAWSLambda {
Copy link
Member

@cretz cretz Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in other comment at #704 (comment), but I think this should not be a (orphaned) proto, but rather defined in a more generic way that doesn't require protobuf/protoc to consume. If we don't want to define JSON schema currently, even just a Go struct w/ JSON tags and a document somewhere saying the definition of this JSON object I think would be good enough. But would like to not marry protobuf here.

Also, somewhere we need to define this compute provider service and what it may accept (but it won't be in this repo).

message Scaler {
// The lower limit for the number of Workers (in the WorkerDeployment) to
// which the Scaler can scale down.
int32 min_replicas = 1;
Copy link
Member

@cretz cretz Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic a bit, but IMO this should be called min_instances. I am worried it may get confused with the use of the term "replicas" by some providers where I could see an instance wanting many replicas (but from server POV, it's an "instance" w/ an "instance key").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic, but can just put everything in this proto namespace in the message.proto file, not a ton of benefit in the separation IMO

// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message Provider {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also going to need to be configured with the target Nexus/implementation endpoint. For Lambda use on our built-in implementation, this can be a built-in "system" string or something, but users will need to be able to provide their own implementation endpoint.

IMO a simple nexus_endpoint field would be great. Don't even need nexus_service because it should be a well-known, pre-defined service that users (or us) would implement. Personally I would like this to be non-optional and we can set a certain system endpoint for wherever we host ours. For example, it is no problem for users to host our Lambda compute provider impl themselves on their side with their endpoint and their end-to-end encryption and such.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be partial to an address field outside the details. That said, no fields in here should be nexus specific (as it is just one of multiple avenues)

}

// Creates a new WorkerDeployment.
message CreateWorkerDeploymentRequest {
Copy link
Member

@cretz cretz Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - can I see the user experience of configuring compute config for workflows only (not activities) on task queue X? Specifically, can worker deployments be independent by task queue type? If they can't, that's probably ok, but we will want to make sure the worker instances the server tells the provider to spin up are by task queue type, and we arguably should be able to provide unique configuration by task queue type (or at least disallow other task queue types from appearing on the task queue at all).

@dnr
Copy link
Contributor

dnr commented Feb 3, 2026

we will be making the UX for unversioned transparently auto-create a "latest" version when one doesn't already exist

If we do that, why do we need a config at the deployment level at all? Can't it just be on that implicit "latest" version?

// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message Provider {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be partial to an address field outside the details. That said, no fields in here should be nexus specific (as it is just one of multiple avenues)

message ComputeProviderAWSLambda {
// The Qualified or Unqualified ARN of the AWS Lambda Function to be
// invoked.
string invoke = 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaypipes this needs to be the ARN - no plan for supporting function URLs at this point.

Comment on lines +219 to +221
// Contains information used by worker control plane controllers to handle
// scale events.
temporal.api.compute.v1.Config compute_config = 20;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would this be used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants