[Bug]: GLM gateway sessions can undercount request size, overflow late, and persist guessed fallback context limits

### Bug Description

Long, tool-heavy gateway sessions using `glm-5-turbo` can still hit provider-side context overflow even when Hermes believes the request is still under the compaction threshold.

  This appears to be a combination of three related problems:

  1. Hermes can undercount the real request size before the API call by reasoning mainly from the conversation transcript while the actual payload also includes large tool schemas.
  2. Z.AI returns a generic overflow message:
     - `Prompt exceeds max length`
     which needs to be treated as a context-overflow signal.
  3. If Hermes steps down to a lower fallback tier after a generic overflow, that guessed lower tier can end up influencing future behavior more than it should unless it is clearly
  treated as a temporary fallback.

  There is also a related gateway UX issue:
  - the post-compaction token number can be misleading if it reflects only the stripped transcript rather than the full next request payload.

### Steps to Reproduce

1. Configure Hermes to use:
     - provider: `zai`
     - model: `glm-5-turbo`
     - base URL: `https://api.z.ai/api/coding/paas/v4`
  2. Use Hermes via a gateway platform (I observed this on Discord DM) in one long session with many tool calls, file reads, patches, and searches.
  3. Keep working in the same session until Hermes starts reporting context pressure / compaction pressure.
  4. Continue the same session.
  5. Observe that the provider can still reject the request with:
     - `Error code: 400 - {'error': {'code': '1261', 'message': 'Prompt exceeds max length'}}`

### Expected Behavior

- Hermes should compact based on the full request shape it actually sends, including tool schemas.
  - Provider-specific overflow messages like `Prompt exceeds max length` should trigger context-overflow recovery.
  - Temporary fallback step-downs should not be treated as confirmed provider limits unless the provider actually reported a numeric limit.
  - Gateway post-compaction reporting should describe the real next-request estimate, not only the stripped transcript.

### Actual Behavior

- Hermes can believe a session is still below the compaction threshold, but the provider rejects the next request anyway.
  - A generic GLM overflow can push Hermes toward a lower fallback context tier.
  - Gateway compaction output can be misleading. Example shape:

  ```text
  Session is large (171 messages, ~124,701 tokens). Auto-compressing...
  Compressed: 171 → 7 messages, ~124,701 → ~402 tokens

  That post-compaction ~402 tokens number does not reflect the full next request payload.

### Affected Component

Gateway (Telegram/Discord/Slack/WhatsApp), Agent Core (conversation loop, context compression, memory)

### Messaging Platform (if gateway-related)

Discord

### Operating System

Ubuntu 25.10

### Python Version

3.11.13

### Hermes Version

0.4.0

### Relevant Logs / Traceback

```shell
Error code: 400 - {'error': {'code': '1261', 'message': 'Prompt exceeds max length'}}
```

### Root Cause Analysis (optional)

 Observed contributing factors:

  | # | Issue | Area |
  |---|-------|------|
  | 1 | Real request size can be underestimated when tool schemas are large | run_agent.py preflight/request estimation |
  | 2 | Z.AI overflow string is generic and needs explicit handling | provider-specific context overflow detection |
  | 3 | Fallback step-down behavior can be confused with confirmed provider metadata if not handled carefully | context-length caching / probing |
  | 4 | Gateway post-compaction reporting can describe transcript-only size instead of full request size | gateway session hygiene messaging |

### Proposed Fix (optional)

 - Estimate the full request payload for compaction decisions, not just the transcript.
  - Treat Z.AI Prompt exceeds max length as a context-overflow signal.
  - Only persist provider-confirmed numeric context limits.
  - Keep guessed fallback step-downs temporary unless later confirmed.
  - Make gateway post-compaction reporting use a full-request estimate.

### Are you willing to submit a PR for this?

- [x] I'd like to fix this myself and submit a PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: GLM gateway sessions can undercount request size, overflow late, and persist guessed fallback context limits #2599

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Operating System

Python Version

Hermes Version

Relevant Logs / Traceback

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Issue	Area
1	Real request size can be underestimated when tool schemas are large	run_agent.py preflight/request estimation
2	Z.AI overflow string is generic and needs explicit handling	provider-specific context overflow detection
3	Fallback step-down behavior can be confused with confirmed provider metadata if not handled carefully	context-length caching / probing
4	Gateway post-compaction reporting can describe transcript-only size instead of full request size	gateway session hygiene messaging

[Bug]: GLM gateway sessions can undercount request size, overflow late, and persist guessed fallback context limits #2599

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Operating System

Python Version

Hermes Version

Relevant Logs / Traceback

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions