feat(durable): Add durable session persistence layer for long-horizon agents #4351

caohy1988 · 2026-02-02T09:03:07Z

Summary

This PR implements a durable session persistence layer for ADK, enabling cross-process checkpoint-based recovery for long-running agent tasks. This addresses the "12-minute barrier" problem where agents lose state during long BigQuery jobs or other async operations.

Key Features

DurableSessionConfig: Configuration for durable cross-process checkpointing
BigQueryCheckpointStore: Two-phase commit checkpoint storage (BigQuery metadata + GCS blobs)
CheckpointableAgentState: Abstract interface for agents supporting durability
WorkspaceSnapshotter: GCS-based workspace directory snapshotting
Lease-based concurrency: Safe resume with optimistic locking

Implementation Highlights

Component	Description
Two-phase commit	GCS blob upload → BigQuery metadata insert (atomic visibility)
SHA-256 verification	Checkpoint integrity verification on read
Async-first API	All store methods are async for non-blocking I/O
Experimental decorators	All public classes marked `@experimental`

Files Added

Core Module (src/google/adk/durable/)

config.py - DurableSessionConfig
checkpointable_state.py - CheckpointableAgentState ABC
stores/base_checkpoint_store.py - DurableSessionStore ABC
stores/bigquery_checkpoint_store.py - BigQuery + GCS implementation
workspace_snapshotter.py - GCS workspace snapshots

Demo (contributing/samples/long_running_task/)

agent.py - Demo agent with durable config
demo_server.py - FastAPI server with checkpoint APIs
demo_ui.html - Real-time visualization UI
long_running_task_design.md - Detailed design document

Tests (tests/unittests/durable/)

Unit tests for all components

Live Demo

A fully functional demo is deployed on Cloud Run:

URL: https://durable-demo-201486563047.us-central1.run.app

The demo showcases:

Real-time checkpoint visualization
Task failure simulation
Checkpoint-based recovery
BigQuery metadata queries
Final task output display

Infrastructure:

BigQuery Dataset: test-project-0728-467323.adk_metadata
GCS Bucket: gs://test-project-0728-467323-adk-checkpoints

Test plan

Unit tests pass for all durable components
End-to-end checkpoint write/read verified against real BigQuery/GCS
Demo deployed and functional on Cloud Run
Failure simulation and recovery tested
SHA-256 integrity verification tested

Design Document

See contributing/samples/long_running_task/long_running_task_design.md for the full design including:

Architecture overview
API contracts
BigQuery schema definitions
Security considerations
Cost estimation
Monitoring recommendations

🤖 Generated with Claude Code

google-cla · 2026-02-02T09:03:13Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

adk-bot · 2026-02-02T09:04:08Z

Response from ADK Triaging Agent

Hello @caohy1988, thank you for creating this PR!

Before we can proceed with the review, could you please address the following items from our contribution guidelines:

Sign our Contributor License Agreement (CLA): It looks like this may be your first contribution, and the CLA has not been signed yet. You can sign it at https://cla.developers.google.com/.
Associate a GitHub Issue: For new features like this, we require an associated GitHub issue to track the work. If one doesn't exist, could you please create one and link it in the PR description?

This information will help us to review your PR more efficiently. Thanks!

gemini-code-assist · 2026-02-02T09:13:24Z

Summary of Changes

Hello @caohy1988, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances ADK's capabilities by introducing a robust durable session persistence layer. This new feature allows long-running agent tasks to maintain their state across process boundaries and system failures, ensuring continuity and reliability for complex, time-consuming operations. By leveraging BigQuery and Google Cloud Storage, it provides an auditable and scalable solution for managing agent progress, effectively overcoming limitations previously faced in cloud environments.

Highlights

Durable Session Persistence Layer: Introduces a durable session persistence layer for ADK, enabling cross-process checkpoint-based recovery for long-running agent tasks and addressing the '12-minute barrier' problem.
Key Components: Adds DurableSessionConfig for configuration, BigQueryCheckpointStore for two-phase commit checkpoint storage (BigQuery metadata + GCS blobs), CheckpointableAgentState as an abstract interface for agents, and WorkspaceSnapshotter for GCS-based workspace directory snapshotting.
Concurrency and Integrity: Implements lease-based concurrency for safe resume with optimistic locking and includes SHA-256 verification for checkpoint integrity on read.
Async-first API: All store methods are designed as async for non-blocking I/O, ensuring efficient operation.
Comprehensive Demo: Includes a fully functional demo with an agent, FastAPI server, and real-time visualization UI, showcasing task failure simulation, checkpoint-based recovery, and BigQuery metadata queries.
Design Document and Review: A detailed design document (long_running_task_design.md) is added, along with a review feedback document (REVIEW_FEEDBACK.md) that critically assesses the design against existing ADK capabilities.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive and well-designed durable session persistence layer, which is a significant feature for enabling long-horizon agents. The use of BigQuery for metadata and GCS for blobs is a robust pattern, and the implementation correctly includes key features like two-phase commits and lease-based concurrency. The accompanying demo is excellent for showcasing the functionality. My review identifies a few important issues to address, primarily concerning security (a hardcoded API key and a potential path traversal vulnerability), a race condition in session creation, and several opportunities for code refinement and improved maintainability. Overall, this is a strong feature addition, and addressing these points will make it even more robust.

gemini-code-assist · 2026-02-02T09:15:04Z

contributing/samples/long_running_task/agent.py

+GOOGLE_CLOUD_API_KEY = os.environ.get(
+    "GOOGLE_CLOUD_API_KEY",
+    "AQ.Ab8RN6L12XpDo1x7Gf2w87EfspguWGrjZPW6XocNy2og_-z_jg",
+)


A default API key is hardcoded as a fallback value. This is a significant security risk, as it could be accidentally committed and exposed. Even for a demo, it's best practice to avoid hardcoding secrets. The application should fail explicitly if the key is not provided in the environment, rather than falling back to a hardcoded value.

GOOGLE_CLOUD_API_KEY = os.environ.get("GOOGLE_CLOUD_API_KEY") if not GOOGLE_CLOUD_API_KEY: raise ValueError("GOOGLE_CLOUD_API_KEY environment variable not set.")

gemini-code-assist · 2026-02-02T09:15:04Z

src/google/adk/durable/stores/bigquery_checkpoint_store.py

+    existing = await self.get_session(session_id=session_id)
+    if existing:
+      raise ValueError(f"Session {session_id} already exists")


There is a race condition here. Two concurrent requests could both check for an existing session, find none, and then both attempt to create it. Since BigQuery PRIMARY KEY constraints are not enforced, this could lead to duplicate session entries. The session creation logic should be made idempotent. One approach is to use a unique ID for the BigQuery insert job, which makes the insertion retryable and idempotent within a certain window.

gemini-code-assist · 2026-02-02T09:15:04Z

src/google/adk/durable/workspace_snapshotter.py

+      safe_members = [
+          m for m in tar.getmembers() if not m.name.startswith(("/", ".."))
+      ]


The check to prevent path traversal attacks (tar-slip) is insufficient. An attacker could craft a filename like a/../../etc/passwd which would bypass the current check. A more robust approach is to resolve the real path of each member and ensure it is within the intended destination directory before extraction. Using tar.extractall with a filtered list is risky; it's safer to iterate through members and extract them individually with proper path validation.

for member in tar.getmembers(): member_path = os.path.join(self._workspace_dir, member.name) # Resolve the absolute path and ensure it's within the workspace if os.path.realpath(member_path).startswith(os.path.realpath(self._workspace_dir)): tar.extract(member, self._workspace_dir) else: logger.warning("Skipping potentially unsafe path in tarball: %s", member.name)

gemini-code-assist · 2026-02-02T09:15:04Z

contributing/samples/long_running_task/demo_server.py

+async def list_sessions():
+    """List all sessions from BigQuery."""
+    try:
+        client = checkpoint_store._get_bq_client()


Accessing a "private" member _get_bq_client from outside the class is generally discouraged as it breaks encapsulation. If external access to the client is needed, consider providing a public property or method in the BigQueryCheckpointStore class.

gemini-code-assist · 2026-02-02T09:15:04Z

contributing/samples/long_running_task/demo_server.py

+    except Exception as e:
+        return {"sessions": [], "error": str(e)}


Catching a broad Exception can hide bugs and make debugging difficult. It's better to catch more specific exceptions that you expect to handle (e.g., exceptions from the BigQuery client). Additionally, returning a 200 OK status with an error message in the body for a failed API call is not standard practice. Consider raising an HTTPException with a 5xx status code to provide a more accurate API response.

Suggested change

except Exception as e:

return {"sessions": [], "error": str(e)}

except Exception as e:

raise HTTPException(status_code=500, detail=f"Failed to list sessions: {e}")

gemini-code-assist · 2026-02-02T09:15:04Z

contributing/samples/long_running_task/demo_server.py

+
+async def run_task_with_checkpoints(session_id: str, duration: int, resume: bool = False):
+    """Run a long-running task with periodic checkpoints."""
+    import random


For better code organization and to adhere to standard Python style (PEP 8), imports should be placed at the top of the file. Moving import random to the top module level will improve readability and consistency.

gemini-code-assist · 2026-02-02T09:15:04Z

contributing/samples/long_running_task/setup.py

+  with open("/tmp/lifecycle.json", "w") as f:
+    f.write(lifecycle_config)


Using a hardcoded path like /tmp/lifecycle.json can be problematic in environments where /tmp is not writable or has specific restrictions (e.g., some serverless environments). It's more robust to use Python's tempfile module to create temporary files in a secure and platform-independent manner.

import tempfile # ... with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".json") as tmp_file: tmp_file.write(lifecycle_config) lifecycle_path = tmp_file.name run_command( [ "gsutil", "lifecycle", "set", lifecycle_path, f"gs://{GCS_BUCKET}", ], check=False, ) os.remove(lifecycle_path)

gemini-code-assist · 2026-02-02T09:15:05Z

src/google/adk/durable/stores/bigquery_checkpoint_store.py

+          active_lease_id=row.active_lease_id,
+          lease_expiry=row.lease_expiry,
+          ttl_expiry=row.ttl_expiry,
+          metadata=row.metadata if isinstance(row.metadata, dict) else (json.loads(row.metadata) if row.metadata else None),


This complex one-liner for handling JSON parsing from BigQuery is duplicated in get_session, read_checkpoint, and list_checkpoints. To improve maintainability and reduce redundancy, this logic should be extracted into a private helper method.

… agents This PR implements a durable session persistence layer for ADK, enabling cross-process checkpoint-based recovery for long-running agent tasks. ## Key Features - **DurableSessionConfig**: Configuration for durable cross-process checkpointing - **BigQueryCheckpointStore**: Two-phase commit checkpoint storage (BQ metadata + GCS blobs) - **CheckpointableAgentState**: Abstract interface for agents supporting durability - **WorkspaceSnapshotter**: GCS-based workspace directory snapshotting ## Implementation Details - Two-phase commit: GCS blob upload → BigQuery metadata insert - SHA-256 checkpoint integrity verification - Lease-based concurrency control for safe resume - Async-first API design for non-blocking I/O ## Demo A fully functional demo is deployed on Cloud Run showcasing: - Real-time checkpoint visualization - Task failure simulation and recovery - BigQuery metadata queries - Final task output display Demo URL: https://durable-demo-201486563047.us-central1.run.app ## Files Added - src/google/adk/durable/ - Core durable module - contributing/samples/long_running_task/ - Demo agent and UI - tests/unittests/durable/ - Unit tests Co-Authored-By: Claude Opus 4.5 <[email protected]>

Addresses the comment: "Session service is the durable session persistence" - Clarifies distinction between SessionService (conversation history) and CheckpointStore (execution state) - Provides three potential approaches (separate store, extend service, event type) - Recommends Option A (separate CheckpointStore) for v1 - Suggests specific updates to design doc sections - Lists action items and open questions for ADK team Co-Authored-By: Claude Opus 4.5 <[email protected]>

Addresses ADK team comment: "ArtifactService is designed for large blobs. Have you checked GcsArtifactService?" Response: - Reviewed GcsArtifactService capabilities and interface - Identified key gaps: two-phase commit, SHA-256 verification, key structure - Compared three approaches: adapt ArtifactService, direct GCS, extend interface - Recommends direct GCS client for v1 due to simpler implementation - Suggests design doc updates for Section 5.3 and Section 15 Co-Authored-By: Claude Opus 4.5 <[email protected]>

Addresses ADK team comment on Section 7.3: "This is not only applicable to resume. Runner.run_async also requires this. Leasing is a general requirement for app developers." Response: - Acknowledges leasing is a general ADK requirement, not durable-specific - Identifies scenarios: run_async, resume, Pub/Sub redelivery, horizontal scaling - Reviews current state: no built-in lease in Runner - Proposes three options: durable-only, Runner-level, SessionService-level - Recommends keeping durable-only for v1, consider SessionService for v2 - Suggests design doc updates for Section 7.3 and Section 18 Co-Authored-By: Claude Opus 4.5 <[email protected]>

Addresses ADK team comment: "Could you elaborate? Agent state is persisted in events which are persisted in session service." Response: - Clarifies what IS preserved: conversation history, tool call records - Clarifies what is NOT preserved: job ledgers, aggregated results, execution plans - Provides concrete example: 50-table PII scan recovery comparison - Distinguishes Session Events (LLM context) vs Checkpoint State (execution recovery) - Identifies when session events alone are sufficient vs when checkpoints add value - Suggests design doc revision to Section 1.2 with clarification table Co-Authored-By: Claude Opus 4.5 <[email protected]>

Adds comprehensive "Enterprise PII Compliance Audit" example showing: - 100-table scan across 5 datasets (~8 hour operation) - Process dies at hour 3:15 with 35 tables done, 2 jobs running - Path A (Events Only): LLM re-deduction, duplicate jobs, ~30 min recovery - Path B (Checkpoint): 5-second deterministic recovery, job reconciliation - Side-by-side comparison table - Cost impact analysis ($75.50 extra cost with events-only) - Five specific capabilities checkpoints enable that events cannot Co-Authored-By: Claude Opus 4.5 <[email protected]>

adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Feb 2, 2026

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

caohy1988 force-pushed the feature/durable-session-persistence branch from 99e7726 to 7d946ed Compare February 2, 2026 09:38

ryanaiagent self-assigned this Feb 2, 2026

caohy1988 and others added 5 commits February 2, 2026 22:29

		except Exception as e:
		return {"sessions": [], "error": str(e)}

		with open("/tmp/lifecycle.json", "w") as f:
		f.write(lifecycle_config)

feat(durable): Add durable session persistence layer for long-horizon agents #4351

Are you sure you want to change the base?

feat(durable): Add durable session persistence layer for long-horizon agents #4351

Uh oh!

Conversation

caohy1988 commented Feb 2, 2026

Summary

Key Features

Implementation Highlights

Files Added

Live Demo

Test plan

Design Document

Uh oh!

google-cla bot commented Feb 2, 2026

Uh oh!

adk-bot commented Feb 2, 2026

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

haiyuan-eng-google Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants