Skip to content

Modernize#25

Draft
JarbasAl wants to merge 10 commits intodevfrom
modernize
Draft

Modernize#25
JarbasAl wants to merge 10 commits intodevfrom
modernize

Conversation

@JarbasAl
Copy link
Member

@JarbasAl JarbasAl commented Mar 18, 2026

Summary by CodeRabbit

  • New Features

    • Added support for multiple API-compatible endpoints (Anthropic, Gemini, Cohere, HuggingFace TGI, AWS Bedrock).
    • Introduced A2A (Agent-to-Agent) endpoint for integration with other agents.
    • Added embeddings support across multiple providers.
    • Added /models endpoint for model listing.
    • Enabled manual workflow triggering via workflow_dispatch.
  • Documentation

    • Comprehensive API compatibility, A2A, streaming, embeddings, and deprecation guides.
    • Enhanced README with quick start and troubleshooting.
    • Added FAQ, audit, maintenance, and suggestions documentation.
  • Chores

    • Modernized build system (setup.py → pyproject.toml).
    • Consolidated CI/CD workflows.
  • Tests

    • Added A2A and compatibility router test suites.

JarbasAl and others added 4 commits March 17, 2026 23:04
- Add lint, build_tests, license_tests, pip_audit workflows (@dev)
- Add docs/, QUICK_FACTS.md, FAQ.md, AUDIT.md, SUGGESTIONS.md, MAINTENANCE_REPORT.md
- Migrate setup.py -> pyproject.toml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds Anthropic Claude, Google Gemini, Cohere, HuggingFace TGI, and AWS
Bedrock-compatible endpoints, plus OpenAI /models & /embeddings stubs
and Ollama /api/show, /api/ps, /api/pull, /api/push, /api/embeddings.

New files:
- schemas/anthropic.py, schemas/gemini.py: Pydantic schemas with Field constraints
- anthropic.py: POST /v1/messages with SSE streaming (Anthropic format)
- gemini.py: POST /v1beta/models/{id}:generateContent + :streamGenerateContent
- cohere.py: POST /v1/chat, /v1/generate, /v1/embed
- huggingface_tgi.py: POST /generate, /generate_stream, GET /info, /health
- aws_bedrock.py: POST /model/{id}/invoke, /invoke-with-response-stream,
  /model/{id}/converse; multi-model response normalisation by model prefix

Modified files:
- chat.py: add GET /v1/models, POST /v1/embeddings; fix response_model=None
  on streaming routes; fix Python 3.11 f-string nested dict literals
- ollama.py: add GET /api/show, /api/ps; POST /api/pull, /api/push,
  /api/embeddings; typed Pydantic request schemas; response_model=None

test/unittests/test_compat_routers.py: 29 tests, all passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each compat router now mounts under a prefix that identifies the API vendor:
  /v1/...                OpenAI (chat_router — unchanged)
  /api/...               Ollama (ollama_router — unchanged)
  /anthropic/v1/...      Anthropic Claude  (was /v1 — conflicted with OpenAI)
  /gemini/v1beta/...     Google Gemini
  /cohere/v1/...         Cohere  (was /v1 — conflicted with OpenAI /v1/chat)
  /tgi/...               HuggingFace TGI
  /bedrock/model/...     AWS Bedrock  (was /model — now scoped)

Benefits: all routers coexist in one FastAPI app with no path conflicts;
Swagger UI groups endpoints clearly by vendor tag + prefix.

Fix Python 3.11 f-string nested dict literals in chat.py (used temp vars).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cy paths with deprecation

OpenAI and Ollama routers now mount under vendor-namespaced prefixes:
  /openai/v1/...   (was /v1/...)
  /ollama/api/...  (was /api/...)

Legacy paths (/v1/... and /api/...) are preserved via register_deprecated_routes()
which re-mounts the same route handlers at the old paths with deprecated=True.

An HTTP middleware (add_deprecation_middleware()) injects response headers on
every request to a legacy path:
  Deprecation: true
  Link: </openai/v1/...>; rel="successor-version"

This gives clients a clear migration signal without breaking existing integrations.
New file: ovos_persona_server/deprecated_routers.py

test: 31 passing (4 new tests for deprecated path behaviour)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 18, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 724f8bb7-4658-46b6-b918-8619ed4d65f6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch modernize
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Mar 18, 2026

The bots have finished their work. Take a look! 🤖

I've aggregated the results of the automated checks for this PR below.

🔍 Lint

Another piece of the puzzle! 🧩

ruff: issues found — see job log

🔨 Build Tests

Everything is bolted down and ready to go. 🔩

✅ All versions pass

Python Build Install
3.10
3.11
3.12
3.13
3.14

⚖️ License Check

Verifying the source of all binary files. 💾

✅ No license violations found (52 packages).

License distribution: 14× MIT, 11× MIT License, 6× Apache Software License, 5× Apache-2.0, 3× BSD-3-Clause, 2× ISC License (ISCL), 2× PSF-2.0, 2× Python Software Foundation License, +7 more

Full breakdown — 52 packages
Package Version License URL
annotated-doc 0.0.4 MIT link
annotated-types 0.7.0 MIT License link
anyio 4.12.1 MIT link
audioop-lts 0.2.2 PSF-2.0 link
build 1.4.0 MIT link
certifi 2026.2.25 Mozilla Public License 2.0 (MPL 2.0) link
charset-normalizer 3.4.6 MIT link
click 8.3.1 BSD-3-Clause link
combo_lock 0.3.0 Apache Software License link
fastapi 0.135.1 MIT link
filelock 3.25.2 MIT link
idna 3.11 BSD-3-Clause link
importlib_metadata 8.7.1 Apache-2.0 link
json-database 0.10.1 MIT link
kthread 0.2.3 MIT License link
langcodes 3.5.1 MIT License link
markdown-it-py 4.0.0 MIT License link
mdurl 0.1.2 MIT License link
memory-tempfile 2.2.3 MIT License link
ovos-config 2.1.1 Apache-2.0 link
ovos-persona-server 0.5.1a2 Apache-2.0 link
ovos-plugin-manager 2.2.0 Apache-2.0 link
ovos-utils 0.8.5 Apache-2.0 link
ovos_bus_client 1.5.0 Apache Software License link
ovos_openai_plugin 2.0.6 MIT link
ovos_persona 0.7.1 Apache License 2.0 link
ovos_solver_failure_plugin 0.0.3 MIT link
packaging 26.0 Apache-2.0 OR BSD-2-Clause link
pexpect 4.9.0 ISC License (ISCL) link
ptyprocess 0.7.0 ISC License (ISCL) link
pydantic 2.12.5 MIT link
pydantic_core 2.41.5 MIT link
pyee 12.1.1 MIT License link
Pygments 2.19.2 BSD License link
pyproject_hooks 1.2.0 MIT License link
python-dateutil 2.9.0.post0 Apache Software License; BSD License link
PyYAML 6.0.3 MIT License link
quebra-frases 0.3.7 Apache Software License link
regex 2026.2.28 Apache-2.0 AND CNRI-Python link
requests 2.32.5 Apache Software License link
rich 13.9.4 MIT License link
rich-click 1.9.7 MIT License

Copyright (c) 2022 Phil Ewels

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
| link |
| six | 1.17.0 | MIT License | link |
| standard-aifc | 3.13.0 | Python Software Foundation License | link |
| standard-chunk | 3.13.0 | Python Software Foundation License | link |
| starlette | 0.52.1 | BSD-3-Clause | link |
| typing-inspection | 0.4.2 | MIT | link |
| typing_extensions | 4.15.0 | PSF-2.0 | link |
| urllib3 | 2.6.3 | MIT | link |
| watchdog | 6.0.0 | Apache Software License | link |
| websocket-client | 1.9.0 | Apache Software License | link |
| zipp | 3.23.0 | MIT | link |

Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed.

🔒 Security (pip-audit)

I've checked the security of our API endpoints. 🌐

✅ No known vulnerabilities found (71 packages scanned).


The automation never sleeps, but I might reboot. 💤

@JarbasAl JarbasAl marked this pull request as draft March 18, 2026 16:34
Adds docs/api-compatibility.md (7 APIs + 2 deprecated, curl examples),
docs/deprecation.md, docs/streaming.md, docs/embeddings.md,
docs/bedrock-models.md. Updates FAQ.md (20+ Q&As), QUICK_FACTS.md,
AUDIT.md, SUGGESTIONS.md, MAINTENANCE_REPORT.md.

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: document all seven canonical and two deprecated API layers exhaustively
- Impact: docs/ fully populated, FAQ.md 20+ entries
- Verified via: uv run pytest test/ -v (41 passed)
New TestOllamaRouter: tags, chat, show, ps, pull, embeddings (501).
New TestDeprecatedOllamaRouter: /api/... returns 200 + Deprecation header.
Additional Bedrock converse structure test.

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: validate Ollama router endpoints and /api/... deprecation headers
- Impact: 31 → 41 tests
- Verified via: uv run pytest test/ -v (41 passed)
AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Expose OVOS persona via A2A protocol; implement AgentExecutor that bridges Persona.stream() (sync) to async A2A event queue via asyncio.to_thread
- Impact: New a2a.py with OVOSPersonaAgentExecutor, _agent_card, create_a2a_application; a2a-sdk is optional
- Verified via: 13 unit tests, 95% coverage
AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Mount A2A Starlette sub-app at /a2a when --a2a-base-url is provided; add optional [a2a] dependency
- Impact: create_persona_app gains a2a_base_url param; __main__.py exposes --a2a-base-url flag; pyproject.toml adds [a2a] and [dev] extras
… FAQ

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Full test coverage without live A2A server; document new A2A endpoint for operators
- Impact: 13 tests passing; docs/a2a.md and FAQ.md updated
- Verified via: pytest test/unittests/test_a2a.py -v --cov=ovos_persona_server.a2a
AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Rewrite README with all 8 API surfaces, A2A section, persona config examples, troubleshooting; update docs/index.md with A2A entry; expand docs/a2a.md with endpoints, curl examples, streaming, multi-turn, architecture; update FAQ.md with A2A section
- Impact: README.md rewritten; docs/index.md and docs/a2a.md updated; FAQ.md expanded
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ovos_persona_server/ollama.py (1)

188-188: ⚠️ Potential issue | 🟠 Major

Change streaming response media type to application/x-ndjson.

The Ollama API specifies that streaming responses must use the application/x-ndjson content type (newline-delimited JSON). Using application/json misrepresents the actual stream format and can cause compatible clients to fail or buffer until EOF. (docs.ollama.com/api/streaming)

♻️ Proposed fix
-    return StreamingResponse(streaming_ollama_chat_response(), media_type="application/json")
+    return StreamingResponse(streaming_ollama_chat_response(), media_type="application/x-ndjson")
-    return StreamingResponse(streaming_ollama_generate_response(), media_type="application/json")
+    return StreamingResponse(streaming_ollama_generate_response(), media_type="application/x-ndjson")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/ollama.py` at line 188, The response uses the wrong
content type; update the StreamingResponse in the function that returns the
streaming Ollama chat (the return that calls streaming_ollama_chat_response())
to use media_type "application/x-ndjson" instead of "application/json" so the
newline-delimited JSON stream is correctly advertised to clients.
🟡 Minor comments (14)
ovos_persona_server/aws_bedrock.py-174-186 (1)

174-186: ⚠️ Potential issue | 🟡 Minor

Same sync-in-async streaming concern as TGI router.

The synchronous persona.stream(messages) iteration in an async generator (line 177) can block the event loop. Consider the same executor-based approach suggested for the TGI router.

Additionally, the blind Exception catch at line 181 would benefit from logging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/aws_bedrock.py` around lines 174 - 186, The async
generator _stream currently iterates the synchronous persona.stream(messages)
directly (blocking) and swallows exceptions; fix by moving the synchronous
streaming into a short sync helper that collects chunks and returns them (e.g.,
def _collect_stream(persona, messages): accumulated = []; for chunk in
persona.stream(messages): if chunk: accumulated.append(chunk); return
accumulated), then call it inside the async _stream with
loop.run_in_executor(None, _collect_stream, persona, messages) to avoid blocking
the event loop; after receiving the collected chunks, yield the same data events
as before; replace the bare except Exception with logging (e.g.,
logger.exception or logger.error with exc info) and still yield the error event
so failures are both logged and sent to the client.
ovos_persona_server/huggingface_tgi.py-93-109 (1)

93-109: ⚠️ Potential issue | 🟡 Minor

Synchronous iteration in async generator may block event loop.

persona.stream(messages) appears to be a synchronous generator. Iterating it directly in an async generator (line 97) will block the event loop during each next() call, degrading concurrency under load.

Consider wrapping with asyncio.to_thread or running in an executor if persona.stream performs blocking I/O.

💡 Potential fix using run_in_executor
+import asyncio
+from functools import partial

 async def _stream() -> AsyncGenerator[str, None]:
     accumulated = []
     token_count = 0
+    loop = asyncio.get_event_loop()
     try:
-        for chunk in persona.stream(messages):
+        stream_iter = iter(persona.stream(messages))
+        while True:
+            chunk = await loop.run_in_executor(None, next, stream_iter, None)
+            if chunk is None:
+                break
             if chunk:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/huggingface_tgi.py` around lines 93 - 109, The async
generator _stream currently iterates persona.stream(messages) synchronously
which can block the event loop; change the iteration to run the blocking
generator on a worker thread or executor (e.g., use asyncio.to_thread or
loop.run_in_executor) so each call to persona.stream/messages yields is
performed off the event loop, then await results and continue building
accumulated/token_count and yielding the same data events; ensure exception
handling for the worker task preserves the existing error yield behavior.
pyproject.toml-8-8 (1)

8-8: ⚠️ Potential issue | 🟡 Minor

Stale description: Flask → FastAPI.

The description says "simple flask server" but the server uses FastAPI (also noted in AUDIT.md). Update to reflect the actual framework.

📝 Proposed fix
-description = "simple flask server to host OpenVoiceOS persona plugins as a service"
+description = "FastAPI server to host OpenVoiceOS persona plugins as a service"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 8, Update the package description string in
pyproject.toml to reflect the actual framework: change the description field
value that currently reads "simple flask server to host OpenVoiceOS persona
plugins as a service" to mention FastAPI instead (e.g., "simple FastAPI server
to host OpenVoiceOS persona plugins as a service") so the description matches
the implementation and AUDIT.md; edit the description = "... " entry
accordingly.
.github/workflows/notify_matrix.yml-10-10 (1)

10-10: ⚠️ Potential issue | 🟡 Minor

Align workflow ref with other workflows: use @dev instead of @master.

This workflow uses @master while all other workflows in the repository use @dev (release_workflow.yml, publish_stable.yml, pip_audit.yml, lint.yml, license_tests.yml, build_tests.yml). Either update this to @dev for consistency or add a comment explaining why this workflow requires a different ref.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/notify_matrix.yml at line 10, Update the workflow action
reference to match other repo workflows by changing the uses ref in the
notify_matrix.yml step from
OpenVoiceOS/gh-automations/.github/workflows/notify-matrix.yml@master to ...@dev
(or, if there is a deliberate reason to keep `@master`, add a brief inline comment
above that uses line explaining why this workflow needs `@master`); specifically
edit the line containing the uses string so it ends with `@dev` or add the
explanatory comment directly above it.
QUICK_FACTS.md-3-4 (1)

3-4: ⚠️ Potential issue | 🟡 Minor

Include the optional A2A surface in this inventory.

This page reads like the package/API quick reference, but it currently omits the conditional /a2a mount, the a2a_base_url parameter on create_persona_app, the --a2a-base-url CLI flag, and the added test/unittests/test_a2a.py coverage. Either document the A2A surface here or explicitly scope the page to the seven compatibility routers only.

Also applies to: 17-18, 28-29, 30-64

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@QUICK_FACTS.md` around lines 3 - 4, Update the QUICK_FACTS.md inventory to
include the optional A2A surface: mention the conditional /a2a mount, document
the a2a_base_url parameter on create_persona_app and the --a2a-base-url CLI
flag, and note that test/unittests/test_a2a.py adds coverage for this surface;
alternatively, explicitly state that the page only documents the seven
compatibility routers and exclude A2A—be sure to reference the
create_persona_app option and CLI flag names so readers can find the conditional
behavior.
docs/embeddings.md-5-11 (1)

5-11: ⚠️ Potential issue | 🟡 Minor

This intro contradicts the table.

Line 5 says there are “three endpoints across two routers”, but Lines 9-11 list chat_router, ollama_router, and cohere_router. This should say three routers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/embeddings.md` around lines 5 - 11, Update the first sentence in the
embeddings intro to say "three endpoints across three routers" (not "two
routers") so it matches the table listing chat_router, ollama_router, and
cohere_router; verify the surrounding text that references
persona.solvers.loaded_modules and the get_embeddings mechanism remains accurate
and consistent with ovos_persona_server/chat.py, ovos_persona_server/ollama.py
and ovos_persona_server/cohere.py.
docs/streaming.md-13-20 (1)

13-20: ⚠️ Potential issue | 🟡 Minor

Mark the example fences with a language.

markdownlint is already flagging these examples (MD040), and Line 13 also trips MD038 because of the trailing space inside the data: code span. Adding text/json fence labels and trimming that code span will keep the page lint-clean.

Also applies to: 37-47, 61-79, 93-96, 110-114, 128-132, 146-150

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/streaming.md` around lines 13 - 20, The fenced examples showing
streaming chunks use unlabeled code fences and include a trailing space in the
inline code span `data: `; update each example fence to include an explicit
language label (e.g., ```text or ```json) and remove the trailing space inside
the `data:` code span so it reads `data:` to satisfy markdownlint (MD040 and
MD038). Apply the same change to all similar examples in the document (the other
chunked examples referenced) so every fenced block is labeled and no `data:`
code spans contain trailing spaces.
docs/api-compatibility.md-3-3 (1)

3-3: ⚠️ Potential issue | 🟡 Minor

The intro over-generalizes model handling.

Line 3 says the request model is ignored and the persona name is used in all responses, but the Anthropic section below documents "model": "<requested model>". Please narrow the global claim or move model/auth behavior into provider-specific notes so the page does not contradict itself.

Also applies to: 118-127

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/api-compatibility.md` at line 3, The intro over-generalizes how the
request `model` and auth headers are handled; update the intro sentence about
the `model` field and auth headers so it no longer states they are universally
ignored, and instead say that persona-backed behavior varies by provider and
refer readers to provider-specific sections (e.g., the Anthropic section which
documents `"model": "<requested model>"`) for exact behavior; alternatively,
remove the blanket claim and move the note about `model`/auth handling into each
provider subsection (including the Anthropic subsection and the section covering
lines referenced previously) so the global intro does not contradict
provider-specific docs.
docs/deprecation.md-22-25 (1)

22-25: ⚠️ Potential issue | 🟡 Minor

Annotate this header example with a language.

This fence is currently tripping markdownlint MD040. Adding text is enough here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/deprecation.md` around lines 22 - 25, The fenced code block in the
header example is missing a language tag and trips markdownlint MD040; update
the triple-backtick fence that contains "Deprecation: true\nLink:
</openai/v1/chat/completions>; rel=\"successor-version\"" by adding a language
identifier (e.g., text) after the opening ``` so the block becomes ```text,
leaving the content unchanged.
README.md-254-257 (1)

254-257: ⚠️ Potential issue | 🟡 Minor

Missing closing backticks for code block.

The code block starting at line 254 is missing its closing triple backticks before the "Then restart the server" line.

📝 Proposed fix
 **A2A endpoint not available after starting with `--a2a-base-url`**
 `a2a-sdk` is not installed. Install it:
 ```bash
 uv pip install 'ovos-persona-server[a2a]'
+```
 Then restart the server.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 254 - 257, The code block in README.md that begins
with "uv pip install 'ovos-persona-server[a2a]'" is missing its closing triple
backticks; update the README.md by adding a closing ``` after that line so the
"Then restart the server." text is rendered outside the code block and the
fenced code block is properly terminated.
ovos_persona_server/anthropic.py-38-43 (1)

38-43: ⚠️ Potential issue | 🟡 Minor

Potential AttributeError when flattening content blocks.

If msg.content is a list but contains blocks without a text attribute (e.g., image blocks in future expansions), the generator expression will raise AttributeError. Consider adding defensive handling.

🛡️ Proposed fix to handle missing text attribute
         else:
-            content = " ".join(block.text for block in msg.content)
+            content = " ".join(
+                block.text for block in msg.content if hasattr(block, "text") and block.text
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/anthropic.py` around lines 38 - 43, The loop that
flattens message content (iterating request.messages and using the generator
expression to join block.text) can raise AttributeError if a block lacks a text
attribute; update the handling inside that loop (where msg.content is treated as
a list) to defensively iterate blocks and use a safe extraction (e.g., check
hasattr(block, "text") or isinstance(block, str) and skip or substitute an empty
string for non-text blocks) before joining, then append the resulting safe
content in messages.append({"role": msg.role, "content": content}) so missing
text attributes no longer raise.
ovos_persona_server/anthropic.py-105-107 (1)

105-107: ⚠️ Potential issue | 🟡 Minor

Exception details exposed in SSE error event.

The raw exception message is interpolated directly into the SSE payload. This could leak internal implementation details or sensitive information to clients.

🛡️ Proposed fix to sanitize error output
         except Exception as exc:
-            yield f"event: error\ndata: {{\"type\":\"error\",\"error\":{{\"message\":\"{exc}\"}}}}\n\n"
+            yield f"event: error\ndata: {{\"type\":\"error\",\"error\":{{\"message\":\"Internal server error\"}}}}\n\n"
             return

Alternatively, log the actual exception server-side for debugging while returning a generic message to clients.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/anthropic.py` around lines 105 - 107, The SSE error
handler currently interpolates the raw exception (exc) into the SSE payload in
the except block; instead, log the full exception server-side (use
logger.exception or similar in the same scope) and change the yielded SSE event
produced by the except handler (the yield that references exc) to a sanitized
generic error message (e.g., "Internal server error" or a short error code) that
does not expose internal details; keep the exception variable name exc and the
yield location so you update the exact generator/except block accordingly.
ovos_persona_server/cohere.py-85-87 (1)

85-87: ⚠️ Potential issue | 🟡 Minor

Unused variable gen_id in non-streaming chat path.

gen_id is generated on line 86 but never used in the non-streaming response (lines 94-105). Consider removing it or using it in the response if intended.

🧹 Proposed fix to remove unused variable
     msg_id = _new_id()
-    gen_id = _new_id()

     if not request.stream:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/cohere.py` around lines 85 - 87, Remove the unused gen_id
assignment or wire it into the non-streaming response; specifically, delete the
gen_id = _new_id() line in ovos_persona_server/cohere.py (or, if gen_id was
intended to identify the generated response, include gen_id in the response
payload/metadata where msg_id is used in the non-streaming path) so there are no
unused variables; update any response construction code that should carry a
generation id to reference gen_id instead of leaving it unused.
ovos_persona_server/chat.py-388-398 (1)

388-398: ⚠️ Potential issue | 🟡 Minor

Missing error handling for get_embeddings call.

If embed_solver.get_embeddings(t) raises an exception, it will propagate as an unhandled 500 error without a clear message. Consider wrapping in try/except for consistency with other endpoints.

🛡️ Proposed fix to add error handling
     texts = request_body.input if isinstance(request_body.input, list) else [request_body.input]
     data = []
-    for i, t in enumerate(texts):
-        vec = embed_solver.get_embeddings(t)
-        data.append({"object": "embedding", "embedding": vec, "index": i})
+    try:
+        for i, t in enumerate(texts):
+            vec = embed_solver.get_embeddings(t)
+            data.append({"object": "embedding", "embedding": vec, "index": i})
+    except Exception as exc:
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Embeddings generation failed: {exc}",
+        ) from exc

     return JSONResponse({
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/chat.py` around lines 388 - 398, The call to
embed_solver.get_embeddings in the loop can raise and is unhandled; wrap the
per-text embedding call inside a try/except (around
embed_solver.get_embeddings(t) within the loop over texts) to catch exceptions,
log the error, and return a JSONResponse with an error message and appropriate
status code (e.g., 500) instead of letting the exception propagate; ensure you
still build and return the successful "object":"list" response when no error
occurs and reference the existing variables/functions
embed_solver.get_embeddings, texts, data, and the JSONResponse return path so
the handler consistently mirrors error handling used elsewhere.
🧹 Nitpick comments (16)
ovos_persona_server/aws_bedrock.py (2)

245-248: Token usage computed twice with same expression.

sum(len(m["content"].split()) for m in messages) is evaluated twice (lines 246 and 248). Extract to a local variable for clarity and efficiency.

♻️ Proposed fix
     text = persona.chat(messages)
+    input_tokens = sum(len(m["content"].split()) for m in messages)
+    output_tokens = len((text or "").split())
     return JSONResponse({
         "output": {
             "message": {
                 "role": "assistant",
                 "content": [{"text": text or ""}],
             }
         },
         "stopReason": "end_turn",
         "usage": {
-            "inputTokens": sum(len(m["content"].split()) for m in messages),
-            "outputTokens": len((text or "").split()),
-            "totalTokens": sum(len(m["content"].split()) for m in messages) + len((text or "").split()),
+            "inputTokens": input_tokens,
+            "outputTokens": output_tokens,
+            "totalTokens": input_tokens + output_tokens,
         },
     })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/aws_bedrock.py` around lines 245 - 248, The token counts
for input are computed twice in the "usage" dict (the expression
sum(len(m["content"].split()) for m in messages) is duplicated); extract that
value to a local variable (e.g., input_tokens = sum(...) computed once before
building the dict) and then use input_tokens for "inputTokens" and to compute
"totalTokens" (add len((text or "").split()) to input_tokens) so the computation
is done once and reused in the creation of the "usage" object.

37-61: Format detection relies on body fields rather than model_id prefix.

As noted in SUGGESTIONS.md item 5, checking "messages" in body first (line 38) could cause misparsing if a request targeting a Titan model accidentally includes a messages key. The model_id parameter is available but not used for disambiguation.

Consider checking model_id prefix first for known families before falling back to body field heuristics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/aws_bedrock.py` around lines 37 - 61, The parsing logic
currently assumes Converse format whenever "messages" is in body (using
variables body, msgs, result), which can misclassify requests for specific model
families; update the code that branches on request format to first inspect the
model_id parameter (e.g., check model_id.startswith for known prefixes like
"titan", "claude", "gpt" or your project's canonical family strings) to select
the correct format handler, and only fall back to the existing body-based
heuristics (the current messages/system parsing logic) if model_id is absent or
unrecognized; keep the existing handling of body["system"] and content list
merging (sys_text, text) but gate that block behind the model_id-based decision.
.github/workflows/publish_stable.yml (1)

9-16: Consider pinning reusable workflow to a stable ref for production releases.

Using @dev for the reusable workflow in a stable release pipeline introduces risk—breaking changes in gh-automations dev branch could unexpectedly fail releases. Consider pinning to a tagged version or @master once the automations repository stabilizes.

The bot actor exclusion on line 9 and removal of setup_py input (now handled via pyproject.toml) are appropriate changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/publish_stable.yml around lines 9 - 16, The reusable
workflow reference currently uses a floating dev ref ("uses:
OpenVoiceOS/gh-automations/.github/workflows/publish-stable.yml@dev"), which
risks breaking stable releases; update that "uses:" invocation to pin to a
stable ref (for example a specific tag like `@v1.2.3` or the repository's main
branch such as `@master` once it's stable) so the publish_stable workflow
consistently runs against a known-good version, and keep the other inputs
(branch, version_file, publish_pypi, sync_dev) unchanged.
ovos_persona_server/huggingface_tgi.py (2)

132-148: Hardcoded /info values may mislead clients.

The /info endpoint returns static placeholder values (e.g., max_input_length: 4096, version: "2.0.0") that don't reflect actual persona capabilities. Clients relying on these for context window management could encounter unexpected truncation.

Consider documenting that these are placeholders or deriving values from persona configuration if available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/huggingface_tgi.py` around lines 132 - 148, The /info
response currently returns hardcoded placeholder fields (see the JSONResponse
returned where persona.name is used) which can mislead clients; update the
handler that returns this JSONResponse to populate fields like max_input_length,
max_total_tokens, version, sha, model_dtype, model_device_type, etc. from the
persona configuration or the actual model metadata if available (fall back to
clearly labeled placeholders only when no metadata exists), or add an explicit
"placeholder": true/notice field to the response so clients know values are not
authoritative. Ensure you reference the JSONResponse return in
huggingface_tgi.py and use persona or its metadata methods to derive the real
values.

107-109: Consider logging exceptions before returning error to client.

The blind Exception catch is acceptable for SSE resilience, but silently discarding the exception makes debugging difficult. Adding a log statement would aid troubleshooting.

🔧 Proposed fix
+import logging
+
+logger = logging.getLogger(__name__)
+
         except Exception as exc:
+            logger.exception("Error during TGI streaming")
             yield f"data:{json.dumps({'error': str(exc)})}\n\n"
             return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/huggingface_tgi.py` around lines 107 - 109, The except
Exception as exc block currently yields the error to the SSE client but discards
the exception; update that block to log the exception before yielding by calling
the module/class logger (e.g. logger.exception(...) or logger.error(...,
exc_info=True)) so the stacktrace and message are recorded; specifically modify
the except Exception as exc in huggingface_tgi.py (the block that does yield
f"data:{json.dumps({'error': str(exc)})}\n\n") to log exc (using
logger.exception or logging.getLogger(__name__).exception) and then continue to
yield and return as before.
pyproject.toml (1)

14-21: Keywords appear to be from a different project.

The keywords (lang, detect, translate) suggest language detection/translation, but this is a persona/LLM server. Consider updating to relevant terms.

📝 Suggested keywords
 keywords = [
-    "plugin",
-    "lang",
-    "detect",
-    "translate",
+    "persona",
+    "llm",
+    "api",
+    "openai",
+    "ollama",
     "OVOS",
     "OpenVoiceOS",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 14 - 21, The keywords array in pyproject.toml
currently contains unrelated language/translation terms; update the "keywords"
list to reflect this project (persona/LLM server) by replacing entries like
"lang", "detect", "translate" with relevant terms such as "persona", "llm",
"assistant", "server", "chatbot", "conversational-ai", "virtual-assistant",
"OpenVoiceOS" (i.e., edit the keywords list symbol in pyproject.toml to contain
project-appropriate tags).
docs/index.md (1)

33-47: Add language specifier to fenced code block.

The architecture diagram should have a language specifier to satisfy markdownlint (MD040). Use text or plaintext for ASCII diagrams.

📝 Proposed fix
-```
+```text
 HTTP request
   └─ FastAPI app  [__init__.py:create_persona_app]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/index.md` around lines 33 - 47, Add a language specifier to the fenced
code block that contains the ASCII architecture diagram in docs/index.md (the
block showing "HTTP request └─ FastAPI app [__init__.py:create_persona_app] ..."
and nested routers/OVOSPersonaAgentExecutor/Persona.stream) so markdownlint
MD040 is satisfied; change the opening triple backticks to include the specifier
(e.g., ```text or ```plaintext) and keep the closing triple backticks unchanged.
docs/a2a.md (1)

139-148: Add language specifier to fenced code block.

The architecture diagram code block should specify a language (e.g., text or plaintext) for consistency and to satisfy markdown linters.

📝 Proposed fix
-```
+```text
 A2A client request
   └─ POST /a2a/
        └─ A2AStarletteApplication     [a2a-sdk]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/a2a.md` around lines 139 - 148, Add a language specifier (e.g., "text"
or "plaintext") to the fenced code block containing the architecture diagram so
markdown linters accept it; update the opening fence for the block that starts
with "A2A client request" (the block showing A2AStarletteApplication,
DefaultRequestHandler, OVOSPersonaAgentExecutor and Persona.stream) to use
```text (or ```plaintext) instead of bare ``` to mark it as plain text.
ovos_persona_server/deprecated_routers.py (1)

64-82: Deprecation warning logged on every request — consider rate-limiting.

The middleware logs a WARNING on every request to a legacy path. In high-traffic scenarios, this could flood logs. Consider logging only once per path (using a set to track logged paths) or at INFO level.

♻️ Optional: Log deprecation warning only once per path
+_warned_paths: set = set()
+
 `@app.middleware`("http")
 async def deprecation_header_middleware(request: Request, call_next: Callable) -> Response:
     """Inject Deprecation + Link headers for legacy endpoint paths."""
     path = request.url.path
     is_legacy = any(
         path.startswith(prefix + "/") or path == prefix
         for prefix in _LEGACY_PREFIX_MAP
     )
     response: Response = await call_next(request)
     if is_legacy:
         successor = _build_successor_path(path)
-        LOG.warning(
-            "Deprecated path accessed: %s — use %s instead",
-            path,
-            successor,
-        )
+        if path not in _warned_paths:
+            LOG.warning(
+                "Deprecated path accessed: %s — use %s instead",
+                path,
+                successor,
+            )
+            _warned_paths.add(path)
         response.headers["Deprecation"] = "true"
         response.headers["Link"] = f'<{successor}>; rel="successor-version"'
     return response
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/deprecated_routers.py` around lines 64 - 82, The
middleware deprecation_header_middleware currently emits LOG.warning for every
legacy request and should be rate-limited: add a module-level set (e.g.,
_logged_deprecated_paths = set()) and, inside deprecation_header_middleware,
check whether the accessed legacy path (use the same logic with
_LEGACY_PREFIX_MAP and _build_successor_path to form the key) has already been
logged; only call LOG.warning and add the path to _logged_deprecated_paths the
first time it is seen (or alternatively change LOG.warning to LOG.info if you
prefer lower verbosity). Ensure thread-safety if necessary for your runtime
(e.g., minimal locking) when updating _logged_deprecated_paths.
test/unittests/test_a2a.py (1)

29-29: PEP 484: Use explicit Optional for nullable default parameter.

The parts parameter defaults to None but the type hint is List[Any], not Optional[List[Any]].

📝 Proposed fix
+from typing import Any, List, Optional
+
 class _FakeMessage:
-    def __init__(self, text: str = "", parts: List[Any] = None, **kwargs: Any) -> None:
+    def __init__(self, text: str = "", parts: Optional[List[Any]] = None, **kwargs: Any) -> None:
         if parts is not None:
             self.parts = parts
         else:
             self.parts = [_FakePart(root=_FakeTextPart(text))]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/unittests/test_a2a.py` at line 29, The constructor signature declares
parts: List[Any] = None but should use an explicit nullable type; update the
__init__ method signature to annotate parts as Optional[List[Any]] and add
Optional to the typing imports (or use typing.Optional) so the default None is
type-correct; keep the default value None and no other behavior changes in
__init__.
test/unittests/test_compat_routers.py (3)

107-108: Use a more descriptive variable name instead of l.

The variable name l is ambiguous (can be confused with 1 or I). Consider using line or data_line.

♻️ Proposed fix
-        lines = [l for l in resp.text.splitlines() if l.startswith("data:")]
+        lines = [line for line in resp.text.splitlines() if line.startswith("data:")]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/unittests/test_compat_routers.py` around lines 107 - 108, Rename the
ambiguous loop variable `l` in the list comprehension that builds `lines` from
`resp.text.splitlines()` to a clearer name like `line` or `data_line` (e.g.,
`lines = [line for line in resp.text.splitlines() if line.startswith("data:")]`)
and update any references to that variable accordingly so the intent is clear
and avoids confusion with `1` or `I`.

198-199: Use a more descriptive variable name instead of l.

Same issue as above - use line instead of l.

♻️ Proposed fix
-        events = [l for l in resp.text.splitlines() if l.startswith("event:")]
+        events = [line for line in resp.text.splitlines() if line.startswith("event:")]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/unittests/test_compat_routers.py` around lines 198 - 199, Replace the
ambiguous loop variable `l` with a descriptive name `line` in the list
comprehensions that build `events` and `event_types` in test_compat_routers
(change `[l for l in resp.text.splitlines() if l.startswith("event:")]` to
`[line for line in resp.text.splitlines() if line.startswith("event:")]` and
similarly update the second comprehension `event_types = [e.split(": ", 1)[1]
for e in events]` if it uses a single-letter variable elsewhere), so both
comprehensions use `line` for clarity and consistency with surrounding code.

278-280: Use more descriptive variable names instead of l.

Same ambiguous variable name issue in both list comprehensions.

♻️ Proposed fix
-        lines = [l.strip() for l in resp.text.splitlines() if l.strip()]
-        events = [json.loads(l) for l in lines if l.startswith("{")]
+        lines = [line.strip() for line in resp.text.splitlines() if line.strip()]
+        events = [json.loads(line) for line in lines if line.startswith("{")]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/unittests/test_compat_routers.py` around lines 278 - 280, The two list
comprehensions use the ambiguous variable name `l`; update them to use a
descriptive name like `line` to improve readability: change the comprehension
that builds `lines` from `[l.strip() for l in resp.text.splitlines() if
l.strip()]` to use `line`, and likewise change `events = [json.loads(l) for l in
lines if l.startswith("{")]` to use `line` so it becomes `events =
[json.loads(line) for line in lines if line.startswith("{")]`; keep the
surrounding logic (resp, lines, events, assert) unchanged.
ovos_persona_server/schemas/anthropic.py (1)

15-19: Consider whether "system" role is needed in AnthropicMessage.

Anthropic's API typically handles system prompts via the top-level system field in the request, not as a message role. Including "system" in the allowed roles may accept requests that don't match the actual Anthropic API behavior. However, since _normalise_messages() in anthropic.py handles this by prepending the top-level system separately, this is likely intentional for flexibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/schemas/anthropic.py` around lines 15 - 19, The allowed
"system" role in AnthropicMessage is misleading for Anthropic's API; change the
role Literal on class AnthropicMessage to only allow "user" and "assistant"
(e.g., Literal["user","assistant"]) and update any code that constructs or
validates messages accordingly so system prompts are supplied via the top-level
system handling in _normalise_messages; ensure references to "system" messages
are removed from AnthropicMessage usages and rely on _normalise_messages to
prepend the top-level system content.
ovos_persona_server/cohere.py (2)

182-182: Same media type consideration for generate streaming.

Apply the same NDJSON media type consideration here for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/cohere.py` at line 182, The response currently returns
StreamingResponse(_stream(), media_type="application/json"); update it to use
the NDJSON media type (e.g., "application/x-ndjson") for consistency with the
other generate streaming implementation, i.e., change the media_type argument on
the StreamingResponse call that wraps _stream(), and verify that the _stream()
generator yields newline-delimited JSON records so the content-type matches the
payload.

129-129: Consider using NDJSON media type for streaming responses.

Cohere's streaming API returns newline-delimited JSON (NDJSON). While application/json works, application/x-ndjson or text/event-stream may be more accurate for client compatibility.

♻️ Proposed fix
-    return StreamingResponse(_stream(), media_type="application/json")
+    return StreamingResponse(_stream(), media_type="application/x-ndjson")

Note: Verify Cohere's actual content-type header to ensure compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/cohere.py` at line 129, The streaming response currently
returns media_type="application/json"; change it to use an NDJSON-compatible
media type by setting StreamingResponse(..., media_type="application/x-ndjson")
(or "text/event-stream" if you intend SSE), or better, detect and forward the
Cohere response content-type (e.g. read the upstream
response.headers.get('content-type') and pass it into StreamingResponse) so the
returned media type matches Cohere's actual streaming format; update the return
that wraps _stream() in ovos_persona_server/cohere.py accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ovos_persona_server/a2a.py`:
- Around line 159-180: The code currently buffers all output by calling
list(self._persona.stream(messages)) in execute, preventing incremental SSE
delivery; change to drain persona.stream as chunks arrive by running the
synchronous Persona.stream in a background thread and streaming results into the
async path (e.g., push chunks into a thread-safe queue from a background
callable and consume that queue in execute), then create and await
event_queue.enqueue_event(TaskArtifactUpdateEvent(...)) for each chunk as it is
dequeued (use append/last_chunk logic based on streaming sequence), and handle
errors and a termination sentinel so streaming stops cleanly instead of waiting
for the full list.

In `@ovos_persona_server/deprecated_routers.py`:
- Around line 114-117: The code incorrectly assumes route.path includes the
router prefix; instead of slicing route.path by canonical_prefix, set relative =
route.path directly and build legacy_path = new_prefix + relative (ensuring you
normalize slashes to avoid double or missing slashes). Update the logic around
the variables route.path, canonical_prefix, new_prefix, relative, and
legacy_path to use the route.path as-is when constructing the legacy route.

In `@ovos_persona_server/gemini.py`:
- Around line 115-124: The async generator _stream currently calls the
synchronous persona.stream(messages) directly, blocking the event loop; change
_stream to offload the blocking call to a background thread (e.g., use
asyncio.to_thread or loop.run_in_executor) when iterating persona.stream so
chunks are produced without blocking; specifically, wrap the
persona.stream(messages) iteration (inside async def _stream) in an
asyncio.to_thread call or call persona.stream via to_thread and then iterate its
returned iterator, then keep using _build_response(chunk, model_id) and yielding
the JSON SSE lines as before; follow the same offloading pattern used in a2a.py
to avoid event-loop blocking.

In `@ovos_persona_server/ollama.py`:
- Around line 519-521: The response is missing the required model field in
OllamaEmbedResponse; set the model to the identifier used by the embedder before
returning — e.g., after computing vec from
embed_solver.get_embeddings(input_text) construct OllamaEmbedResponse with
embeddings=[vec] and model=embed_solver.model (or embed_solver.model_name if
that property is used by your embed_solver), falling back to a sensible string
if neither exists; update the return in the block that builds input_text, calls
embed_solver.get_embeddings and returns JSONResponse so schema validation
passes.
- Around line 486-521: In ollama_embeddings, don't join list inputs into one
string; instead detect if request_body.input is a list and for each item call
the chosen embed_solver.get_embeddings(item) to produce a list of vectors,
preserving input order and returning OllamaEmbedResponse(embeddings=[...]) where
embeddings is a List[List[float]] (for a single string input still wrap the
single vector in a list). Update the branch that currently sets input_text and
calls embed_solver.get_embeddings to handle both cases (string -> single vector
wrapped in list; list -> map/loop calling get_embeddings per element), keeping
the rest of the logic (solver selection via persona.solvers.loaded_modules)
unchanged.
- Around line 376-441: The /show handler is incorrectly a GET and both /show and
/ps are missing Ollama-required fields; change the decorator on the show
function from `@ollama_router.get` to `@ollama_router.post`, extend the OllamaModel
and OllamaModelDetails dataclasses to include the missing fields (for show:
parameters, license, template, modelfile, capabilities; for ps/details/runtime:
expires_at, size_vram, context_length or include a runtime sub-object returned
by ps), populate those fields with sensible defaults or derived values in show()
and ps() (use persona.name, parent_model_str, timestamp() for expires_at or null
semantics) and ensure model.model_dump() includes the new fields so the /show
and /ps responses match Ollama's API contract (refer to functions show, ps and
classes OllamaModel, OllamaModelDetails).

In `@pyproject.toml`:
- Around line 22-26: The package runtime is missing uvicorn which is
imported/used by the console entry point; add "uvicorn" to the dependencies
array in pyproject.toml so the runtime dependency is installed. Specifically,
update the dependencies list that currently contains "fastapi", "ovos-persona",
and "pydantic" to include "uvicorn" so that __main__.py's main() (which calls
uvicorn.run()) and the ovos-persona-server entry point work correctly.

---

Outside diff comments:
In `@ovos_persona_server/ollama.py`:
- Line 188: The response uses the wrong content type; update the
StreamingResponse in the function that returns the streaming Ollama chat (the
return that calls streaming_ollama_chat_response()) to use media_type
"application/x-ndjson" instead of "application/json" so the newline-delimited
JSON stream is correctly advertised to clients.

---

Minor comments:
In @.github/workflows/notify_matrix.yml:
- Line 10: Update the workflow action reference to match other repo workflows by
changing the uses ref in the notify_matrix.yml step from
OpenVoiceOS/gh-automations/.github/workflows/notify-matrix.yml@master to ...@dev
(or, if there is a deliberate reason to keep `@master`, add a brief inline comment
above that uses line explaining why this workflow needs `@master`); specifically
edit the line containing the uses string so it ends with `@dev` or add the
explanatory comment directly above it.

In `@docs/api-compatibility.md`:
- Line 3: The intro over-generalizes how the request `model` and auth headers
are handled; update the intro sentence about the `model` field and auth headers
so it no longer states they are universally ignored, and instead say that
persona-backed behavior varies by provider and refer readers to
provider-specific sections (e.g., the Anthropic section which documents
`"model": "<requested model>"`) for exact behavior; alternatively, remove the
blanket claim and move the note about `model`/auth handling into each provider
subsection (including the Anthropic subsection and the section covering lines
referenced previously) so the global intro does not contradict provider-specific
docs.

In `@docs/deprecation.md`:
- Around line 22-25: The fenced code block in the header example is missing a
language tag and trips markdownlint MD040; update the triple-backtick fence that
contains "Deprecation: true\nLink: </openai/v1/chat/completions>;
rel=\"successor-version\"" by adding a language identifier (e.g., text) after
the opening ``` so the block becomes ```text, leaving the content unchanged.

In `@docs/embeddings.md`:
- Around line 5-11: Update the first sentence in the embeddings intro to say
"three endpoints across three routers" (not "two routers") so it matches the
table listing chat_router, ollama_router, and cohere_router; verify the
surrounding text that references persona.solvers.loaded_modules and the
get_embeddings mechanism remains accurate and consistent with
ovos_persona_server/chat.py, ovos_persona_server/ollama.py and
ovos_persona_server/cohere.py.

In `@docs/streaming.md`:
- Around line 13-20: The fenced examples showing streaming chunks use unlabeled
code fences and include a trailing space in the inline code span `data: `;
update each example fence to include an explicit language label (e.g., ```text
or ```json) and remove the trailing space inside the `data:` code span so it
reads `data:` to satisfy markdownlint (MD040 and MD038). Apply the same change
to all similar examples in the document (the other chunked examples referenced)
so every fenced block is labeled and no `data:` code spans contain trailing
spaces.

In `@ovos_persona_server/anthropic.py`:
- Around line 38-43: The loop that flattens message content (iterating
request.messages and using the generator expression to join block.text) can
raise AttributeError if a block lacks a text attribute; update the handling
inside that loop (where msg.content is treated as a list) to defensively iterate
blocks and use a safe extraction (e.g., check hasattr(block, "text") or
isinstance(block, str) and skip or substitute an empty string for non-text
blocks) before joining, then append the resulting safe content in
messages.append({"role": msg.role, "content": content}) so missing text
attributes no longer raise.
- Around line 105-107: The SSE error handler currently interpolates the raw
exception (exc) into the SSE payload in the except block; instead, log the full
exception server-side (use logger.exception or similar in the same scope) and
change the yielded SSE event produced by the except handler (the yield that
references exc) to a sanitized generic error message (e.g., "Internal server
error" or a short error code) that does not expose internal details; keep the
exception variable name exc and the yield location so you update the exact
generator/except block accordingly.

In `@ovos_persona_server/aws_bedrock.py`:
- Around line 174-186: The async generator _stream currently iterates the
synchronous persona.stream(messages) directly (blocking) and swallows
exceptions; fix by moving the synchronous streaming into a short sync helper
that collects chunks and returns them (e.g., def _collect_stream(persona,
messages): accumulated = []; for chunk in persona.stream(messages): if chunk:
accumulated.append(chunk); return accumulated), then call it inside the async
_stream with loop.run_in_executor(None, _collect_stream, persona, messages) to
avoid blocking the event loop; after receiving the collected chunks, yield the
same data events as before; replace the bare except Exception with logging
(e.g., logger.exception or logger.error with exc info) and still yield the error
event so failures are both logged and sent to the client.

In `@ovos_persona_server/chat.py`:
- Around line 388-398: The call to embed_solver.get_embeddings in the loop can
raise and is unhandled; wrap the per-text embedding call inside a try/except
(around embed_solver.get_embeddings(t) within the loop over texts) to catch
exceptions, log the error, and return a JSONResponse with an error message and
appropriate status code (e.g., 500) instead of letting the exception propagate;
ensure you still build and return the successful "object":"list" response when
no error occurs and reference the existing variables/functions
embed_solver.get_embeddings, texts, data, and the JSONResponse return path so
the handler consistently mirrors error handling used elsewhere.

In `@ovos_persona_server/cohere.py`:
- Around line 85-87: Remove the unused gen_id assignment or wire it into the
non-streaming response; specifically, delete the gen_id = _new_id() line in
ovos_persona_server/cohere.py (or, if gen_id was intended to identify the
generated response, include gen_id in the response payload/metadata where msg_id
is used in the non-streaming path) so there are no unused variables; update any
response construction code that should carry a generation id to reference gen_id
instead of leaving it unused.

In `@ovos_persona_server/huggingface_tgi.py`:
- Around line 93-109: The async generator _stream currently iterates
persona.stream(messages) synchronously which can block the event loop; change
the iteration to run the blocking generator on a worker thread or executor
(e.g., use asyncio.to_thread or loop.run_in_executor) so each call to
persona.stream/messages yields is performed off the event loop, then await
results and continue building accumulated/token_count and yielding the same data
events; ensure exception handling for the worker task preserves the existing
error yield behavior.

In `@pyproject.toml`:
- Line 8: Update the package description string in pyproject.toml to reflect the
actual framework: change the description field value that currently reads
"simple flask server to host OpenVoiceOS persona plugins as a service" to
mention FastAPI instead (e.g., "simple FastAPI server to host OpenVoiceOS
persona plugins as a service") so the description matches the implementation and
AUDIT.md; edit the description = "... " entry accordingly.

In `@QUICK_FACTS.md`:
- Around line 3-4: Update the QUICK_FACTS.md inventory to include the optional
A2A surface: mention the conditional /a2a mount, document the a2a_base_url
parameter on create_persona_app and the --a2a-base-url CLI flag, and note that
test/unittests/test_a2a.py adds coverage for this surface; alternatively,
explicitly state that the page only documents the seven compatibility routers
and exclude A2A—be sure to reference the create_persona_app option and CLI flag
names so readers can find the conditional behavior.

In `@README.md`:
- Around line 254-257: The code block in README.md that begins with "uv pip
install 'ovos-persona-server[a2a]'" is missing its closing triple backticks;
update the README.md by adding a closing ``` after that line so the "Then
restart the server." text is rendered outside the code block and the fenced code
block is properly terminated.

---

Nitpick comments:
In @.github/workflows/publish_stable.yml:
- Around line 9-16: The reusable workflow reference currently uses a floating
dev ref ("uses:
OpenVoiceOS/gh-automations/.github/workflows/publish-stable.yml@dev"), which
risks breaking stable releases; update that "uses:" invocation to pin to a
stable ref (for example a specific tag like `@v1.2.3` or the repository's main
branch such as `@master` once it's stable) so the publish_stable workflow
consistently runs against a known-good version, and keep the other inputs
(branch, version_file, publish_pypi, sync_dev) unchanged.

In `@docs/a2a.md`:
- Around line 139-148: Add a language specifier (e.g., "text" or "plaintext") to
the fenced code block containing the architecture diagram so markdown linters
accept it; update the opening fence for the block that starts with "A2A client
request" (the block showing A2AStarletteApplication, DefaultRequestHandler,
OVOSPersonaAgentExecutor and Persona.stream) to use ```text (or ```plaintext)
instead of bare ``` to mark it as plain text.

In `@docs/index.md`:
- Around line 33-47: Add a language specifier to the fenced code block that
contains the ASCII architecture diagram in docs/index.md (the block showing
"HTTP request └─ FastAPI app [__init__.py:create_persona_app] ..." and nested
routers/OVOSPersonaAgentExecutor/Persona.stream) so markdownlint MD040 is
satisfied; change the opening triple backticks to include the specifier (e.g.,
```text or ```plaintext) and keep the closing triple backticks unchanged.

In `@ovos_persona_server/aws_bedrock.py`:
- Around line 245-248: The token counts for input are computed twice in the
"usage" dict (the expression sum(len(m["content"].split()) for m in messages) is
duplicated); extract that value to a local variable (e.g., input_tokens =
sum(...) computed once before building the dict) and then use input_tokens for
"inputTokens" and to compute "totalTokens" (add len((text or "").split()) to
input_tokens) so the computation is done once and reused in the creation of the
"usage" object.
- Around line 37-61: The parsing logic currently assumes Converse format
whenever "messages" is in body (using variables body, msgs, result), which can
misclassify requests for specific model families; update the code that branches
on request format to first inspect the model_id parameter (e.g., check
model_id.startswith for known prefixes like "titan", "claude", "gpt" or your
project's canonical family strings) to select the correct format handler, and
only fall back to the existing body-based heuristics (the current
messages/system parsing logic) if model_id is absent or unrecognized; keep the
existing handling of body["system"] and content list merging (sys_text, text)
but gate that block behind the model_id-based decision.

In `@ovos_persona_server/cohere.py`:
- Line 182: The response currently returns StreamingResponse(_stream(),
media_type="application/json"); update it to use the NDJSON media type (e.g.,
"application/x-ndjson") for consistency with the other generate streaming
implementation, i.e., change the media_type argument on the StreamingResponse
call that wraps _stream(), and verify that the _stream() generator yields
newline-delimited JSON records so the content-type matches the payload.
- Line 129: The streaming response currently returns
media_type="application/json"; change it to use an NDJSON-compatible media type
by setting StreamingResponse(..., media_type="application/x-ndjson") (or
"text/event-stream" if you intend SSE), or better, detect and forward the Cohere
response content-type (e.g. read the upstream
response.headers.get('content-type') and pass it into StreamingResponse) so the
returned media type matches Cohere's actual streaming format; update the return
that wraps _stream() in ovos_persona_server/cohere.py accordingly.

In `@ovos_persona_server/deprecated_routers.py`:
- Around line 64-82: The middleware deprecation_header_middleware currently
emits LOG.warning for every legacy request and should be rate-limited: add a
module-level set (e.g., _logged_deprecated_paths = set()) and, inside
deprecation_header_middleware, check whether the accessed legacy path (use the
same logic with _LEGACY_PREFIX_MAP and _build_successor_path to form the key)
has already been logged; only call LOG.warning and add the path to
_logged_deprecated_paths the first time it is seen (or alternatively change
LOG.warning to LOG.info if you prefer lower verbosity). Ensure thread-safety if
necessary for your runtime (e.g., minimal locking) when updating
_logged_deprecated_paths.

In `@ovos_persona_server/huggingface_tgi.py`:
- Around line 132-148: The /info response currently returns hardcoded
placeholder fields (see the JSONResponse returned where persona.name is used)
which can mislead clients; update the handler that returns this JSONResponse to
populate fields like max_input_length, max_total_tokens, version, sha,
model_dtype, model_device_type, etc. from the persona configuration or the
actual model metadata if available (fall back to clearly labeled placeholders
only when no metadata exists), or add an explicit "placeholder": true/notice
field to the response so clients know values are not authoritative. Ensure you
reference the JSONResponse return in huggingface_tgi.py and use persona or its
metadata methods to derive the real values.
- Around line 107-109: The except Exception as exc block currently yields the
error to the SSE client but discards the exception; update that block to log the
exception before yielding by calling the module/class logger (e.g.
logger.exception(...) or logger.error(..., exc_info=True)) so the stacktrace and
message are recorded; specifically modify the except Exception as exc in
huggingface_tgi.py (the block that does yield f"data:{json.dumps({'error':
str(exc)})}\n\n") to log exc (using logger.exception or
logging.getLogger(__name__).exception) and then continue to yield and return as
before.

In `@ovos_persona_server/schemas/anthropic.py`:
- Around line 15-19: The allowed "system" role in AnthropicMessage is misleading
for Anthropic's API; change the role Literal on class AnthropicMessage to only
allow "user" and "assistant" (e.g., Literal["user","assistant"]) and update any
code that constructs or validates messages accordingly so system prompts are
supplied via the top-level system handling in _normalise_messages; ensure
references to "system" messages are removed from AnthropicMessage usages and
rely on _normalise_messages to prepend the top-level system content.

In `@pyproject.toml`:
- Around line 14-21: The keywords array in pyproject.toml currently contains
unrelated language/translation terms; update the "keywords" list to reflect this
project (persona/LLM server) by replacing entries like "lang", "detect",
"translate" with relevant terms such as "persona", "llm", "assistant", "server",
"chatbot", "conversational-ai", "virtual-assistant", "OpenVoiceOS" (i.e., edit
the keywords list symbol in pyproject.toml to contain project-appropriate tags).

In `@test/unittests/test_a2a.py`:
- Line 29: The constructor signature declares parts: List[Any] = None but should
use an explicit nullable type; update the __init__ method signature to annotate
parts as Optional[List[Any]] and add Optional to the typing imports (or use
typing.Optional) so the default None is type-correct; keep the default value
None and no other behavior changes in __init__.

In `@test/unittests/test_compat_routers.py`:
- Around line 107-108: Rename the ambiguous loop variable `l` in the list
comprehension that builds `lines` from `resp.text.splitlines()` to a clearer
name like `line` or `data_line` (e.g., `lines = [line for line in
resp.text.splitlines() if line.startswith("data:")]`) and update any references
to that variable accordingly so the intent is clear and avoids confusion with
`1` or `I`.
- Around line 198-199: Replace the ambiguous loop variable `l` with a
descriptive name `line` in the list comprehensions that build `events` and
`event_types` in test_compat_routers (change `[l for l in resp.text.splitlines()
if l.startswith("event:")]` to `[line for line in resp.text.splitlines() if
line.startswith("event:")]` and similarly update the second comprehension
`event_types = [e.split(": ", 1)[1] for e in events]` if it uses a single-letter
variable elsewhere), so both comprehensions use `line` for clarity and
consistency with surrounding code.
- Around line 278-280: The two list comprehensions use the ambiguous variable
name `l`; update them to use a descriptive name like `line` to improve
readability: change the comprehension that builds `lines` from `[l.strip() for l
in resp.text.splitlines() if l.strip()]` to use `line`, and likewise change
`events = [json.loads(l) for l in lines if l.startswith("{")]` to use `line` so
it becomes `events = [json.loads(line) for line in lines if
line.startswith("{")]`; keep the surrounding logic (resp, lines, events, assert)
unchanged.

Comment on lines +159 to +180
# Persona.stream is synchronous — run in a thread pool
chunks = await asyncio.to_thread(
lambda: list(self._persona.stream(messages))
)

for i, chunk in enumerate(chunks):
if not chunk:
continue
artifact = Artifact(
parts=[Part(root=TextPart(text=chunk))],
artifact_id=str(i),
name=f"chunk-{i}",
)
await event_queue.enqueue_event(
TaskArtifactUpdateEvent(
task_id=context.task_id,
context_id=context.context_id,
artifact=artifact,
append=(i > 0),
last_chunk=(i == len(chunks) - 1),
)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Streaming is defeated by collecting all chunks before emitting events.

The current implementation calls list(self._persona.stream(messages)) which collects all chunks into memory before emitting any TaskArtifactUpdateEvent. This defeats the purpose of streaming — clients won't receive any events until the entire response is generated.

For true SSE streaming, chunks should be emitted as they arrive. However, since persona.stream() is synchronous and runs in asyncio.to_thread, achieving true streaming requires a different approach (e.g., using a queue or async generator with run_in_executor).

If true streaming is not required for the initial implementation, consider documenting this limitation. Otherwise, refactor to emit events incrementally.

💡 Potential approach for true streaming

One approach is to use a thread-safe queue:

import queue
from concurrent.futures import ThreadPoolExecutor

async def execute(self, context: "RequestContext", event_queue: "EventQueue") -> None:
    user_text = self._extract_user_text(context.message)
    messages = [{"role": "user", "content": user_text}]
    
    chunk_queue: queue.Queue = queue.Queue()
    
    def _stream_to_queue():
        try:
            for chunk in self._persona.stream(messages):
                chunk_queue.put(("chunk", chunk))
            chunk_queue.put(("done", None))
        except Exception as e:
            chunk_queue.put(("error", e))
    
    loop = asyncio.get_event_loop()
    executor = ThreadPoolExecutor(max_workers=1)
    loop.run_in_executor(executor, _stream_to_queue)
    
    i = 0
    while True:
        # Poll queue with small sleep to yield control
        try:
            msg_type, data = chunk_queue.get_nowait()
        except queue.Empty:
            await asyncio.sleep(0.01)
            continue
        
        if msg_type == "done":
            break
        elif msg_type == "error":
            raise data
        elif msg_type == "chunk" and data:
            # emit artifact event...
            i += 1

Alternatively, document the current behavior as "buffered streaming" where chunks are sentence-level but the full response is awaited before delivery.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/a2a.py` around lines 159 - 180, The code currently
buffers all output by calling list(self._persona.stream(messages)) in execute,
preventing incremental SSE delivery; change to drain persona.stream as chunks
arrive by running the synchronous Persona.stream in a background thread and
streaming results into the async path (e.g., push chunks into a thread-safe
queue from a background callable and consume that queue in execute), then create
and await event_queue.enqueue_event(TaskArtifactUpdateEvent(...)) for each chunk
as it is dequeued (use append/last_chunk logic based on streaming sequence), and
handle errors and a termination sentinel so streaming stops cleanly instead of
waiting for the full list.

Comment on lines +114 to +117
# route.path already contains the canonical prefix, e.g.
# '/openai/v1/chat/completions'. Replace it with the legacy prefix.
relative = route.path[len(canonical_prefix):] # '/chat/completions'
legacy_path = new_prefix + relative # '/v1/chat/completions'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify what route.path contains in FastAPI routers

ast-grep --pattern $'@$_.post(
  $_,
  $$$
)'

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1814


🏁 Script executed:

#!/bin/bash
# Read the deprecated_routers.py file around lines 114-117
cat -n ovos_persona_server/deprecated_routers.py | sed -n '100,130p'

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1925


🏁 Script executed:

#!/bin/bash
# Check router definitions in chat.py and ollama.py
echo "=== chat.py router definition ==="
head -50 ovos_persona_server/chat.py | grep -A 5 "chat_router"

echo -e "\n=== ollama.py router definition ==="
head -70 ovos_persona_server/ollama.py | grep -A 5 "ollama_router"

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 852


🏁 Script executed:

#!/bin/bash
# Check the _LEGACY_PREFIX_MAP and router registration to understand the flow
grep -A 10 "_LEGACY_PREFIX_MAP\|register_deprecated_routes" ovos_persona_server/deprecated_routers.py | head -40

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1341


🏁 Script executed:

#!/bin/bash
# Look for where register_deprecated_routes is called and what routers are passed
grep -r "register_deprecated_routes" ovos_persona_server/ --include="*.py" -B 2 -A 2

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1507


🏁 Script executed:

#!/bin/bash
# Check the full register_deprecated_routes function to see how it calls _legacy_router
sed -n '80,150p' ovos_persona_server/deprecated_routers.py

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 2807


🏁 Script executed:

#!/bin/bash
# Verify the actual route.path by checking if there's any test or debug output
grep -r "route.path" ovos_persona_server/ --include="*.py" -B 2 -A 2

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1320


🏁 Script executed:

#!/bin/bash
# Create a minimal test to verify FastAPI route.path behavior
python3 << 'EOF'
from fastapi import APIRouter
from fastapi.routing import APIRoute

# Simulate the actual router setup
router = APIRouter(prefix="/openai/v1", tags=["test"])

`@router.post`("/chat/completions")
async def test_route():
    return {"msg": "test"}

# Check what route.path contains
for route in router.routes:
    if isinstance(route, APIRoute):
        print(f"route.path: {route.path}")
        print(f"route.methods: {route.methods}")
        
        # Simulate the buggy calculation
        canonical_prefix = "/openai/v1"
        relative = route.path[len(canonical_prefix):]
        print(f"\nWith buggy code:")
        print(f"  canonical_prefix: {canonical_prefix}")
        print(f"  route.path[len(canonical_prefix):]: {relative}")
        print(f"  Expected relative: /chat/completions")
        
        # Simulate the fixed calculation
        new_prefix = "/v1"
        legacy_path_buggy = new_prefix + relative
        legacy_path_fixed = new_prefix + route.path
        print(f"\nResult:")
        print(f"  Buggy legacy_path: {legacy_path_buggy}")
        print(f"  Fixed legacy_path: {legacy_path_fixed}")
        print(f"  Expected: /v1/chat/completions")
EOF

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 194


🏁 Script executed:

#!/bin/bash
# Search for any tests or existing code that demonstrates route.path behavior
grep -r "\.path" ovos_persona_server/ --include="*.py" | grep -i route | head -20

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 383


🏁 Script executed:

#!/bin/bash
# Check if there are any tests that verify the deprecated routes work correctly
find . -name "*test*" -type f -name "*.py" | head -10

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 153


🌐 Web query:

FastAPI APIRoute path attribute what does route.path contain with router prefix

💡 Result:

In FastAPI, the route.path attribute of an APIRoute instance contains the declared path from the path operation decorator (e.g., "/items/{item_id}"), without the router prefix. The full path (including prefix) used for matching and OpenAPI generation is available via route.path_format. When an APIRouter with prefix="/api" has @router.get("/items/{item_id}"), then: - route.path = "/items/{item_id}" - route.path_format = "/api/items/{item_id}" This is inherited from Starlette's Route class, where path is the template passed to the constructor, and path_format is the compiled regex pattern with prefix applied during router inclusion. Example from usage: for route in request.app.routes: if isinstance(route, APIRoute): print(route.path) # e.g. "/items/{item_id}" print(route.path_format) # e.g. "/api/items/{item_id}" Sources confirm route.path is used to list the base route path in middleware and iteration over app.routes (result 1), while prefixes are handled separately in APIRouter.include_router (results 6,9,10). Starlette source shows path as input template (result 24), path_format as formatted version (inferred from routing.py snippets).

Citations:


🏁 Script executed:

#!/bin/bash
# Check test_compat_routers.py to see if it tests the deprecated routes
cat -n test/unittests/test_compat_routers.py

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 21171


🏁 Script executed:

#!/bin/bash
# Run the deprecated route tests to confirm they actually pass
cd test/unittests && python -m pytest test_compat_routers.py::TestDeprecatedOpenAIChatRouter::test_deprecated_chat_still_works -v 2>&1 || true

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1380


🏁 Script executed:

#!/bin/bash
# Let's check what routes are actually being generated by adding debug output
# First, let's see if there's a way to inspect the actual routes
grep -r "for route in" ovos_persona_server/ --include="*.py" -B 3 -A 10

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1576


🏁 Script executed:

#!/bin/bash
# Let's write a simple test script to verify the actual behavior
python3 << 'EOF'
import sys
sys.path.insert(0, '.')

# Simulate the logic to see what actually happens
canonical_prefix = "/openai/v1"
new_prefix = "/v1"
route_path = "/chat/completions"

print(f"canonical_prefix: {canonical_prefix}")
print(f"new_prefix: {new_prefix}")
print(f"route.path (what we expect): {route_path}")
print()

# Current buggy code
relative = route_path[len(canonical_prefix):]
legacy_path_buggy = new_prefix + relative

print(f"With current code:")
print(f"  relative = route.path[len(canonical_prefix):] = {route_path}[{len(canonical_prefix)}:] = '{relative}'")
print(f"  legacy_path = new_prefix + relative = '{new_prefix}' + '{relative}' = '{legacy_path_buggy}'")
print()

# Fixed code
legacy_path_fixed = new_prefix + route_path

print(f"With fixed code:")
print(f"  legacy_path = new_prefix + route.path = '{new_prefix}' + '{route_path}' = '{legacy_path_fixed}'")
print()

print(f"Test expects: /v1/chat/completions")
print(f"Current code produces: {legacy_path_buggy}")
print(f"Fixed code produces: {legacy_path_fixed}")
print()

if legacy_path_buggy == "/v1/chat/completions":
    print("❌ Current code works (but comment is still wrong about route.path)")
else:
    print("✓ Current code is buggy as claimed")
EOF

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 605


Critical: route.path does not contain the router prefix — legacy routes will have malformed paths.

The comment incorrectly states that route.path already contains the canonical prefix (e.g., /openai/v1/chat/completions). In FastAPI, APIRoute.path contains only the path defined in the route decorator, not the router's prefix. For chat_router with prefix="/openai/v1" and @router.post("/chat/completions"), the route.path is just "/chat/completions".

This causes line 116 to produce incorrect paths:

  • "/chat/completions"[len("/openai/v1"):]"letions" (truncates the first 10 characters)
  • legacy_path = "/v1" + "letions""/v1letions" (broken route)

Expected: /v1/chat/completions

The fix is to use route.path directly:

Proposed fix
-            # route.path already contains the canonical prefix, e.g.
-            # '/openai/v1/chat/completions'.  Replace it with the legacy prefix.
-            relative = route.path[len(canonical_prefix):]   # '/chat/completions'
-            legacy_path = new_prefix + relative              # '/v1/chat/completions'
+            # route.path is the path defined on the route (e.g., '/chat/completions'),
+            # NOT including the router's prefix. Use it directly with the legacy prefix.
+            legacy_path = new_prefix + route.path            # '/v1/chat/completions'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/deprecated_routers.py` around lines 114 - 117, The code
incorrectly assumes route.path includes the router prefix; instead of slicing
route.path by canonical_prefix, set relative = route.path directly and build
legacy_path = new_prefix + relative (ensuring you normalize slashes to avoid
double or missing slashes). Update the logic around the variables route.path,
canonical_prefix, new_prefix, relative, and legacy_path to use the route.path
as-is when constructing the legacy route.

Comment on lines +115 to +124
async def _stream() -> AsyncGenerator[str, None]:
"""Yield SSE events with GeminiResponse chunks."""
try:
for chunk in persona.stream(messages):
if chunk:
resp = _build_response(chunk, model_id)
yield f"data: {json.dumps(resp.model_dump())}\n\n"
except Exception as exc:
yield f"data: {json.dumps({'error': str(exc)})}\n\n"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Blocking synchronous call in async generator — will block the event loop.

persona.stream(messages) is a synchronous call (as confirmed by the A2A module using asyncio.to_thread), but here it's called directly inside an async generator without offloading to a thread. This will block the FastAPI event loop during streaming, degrading concurrency for other requests.

Consider using asyncio.to_thread or run_in_executor to avoid blocking, similar to the pattern in a2a.py.

🐛 Proposed fix using asyncio.to_thread
+import asyncio
+
 async def _stream() -> AsyncGenerator[str, None]:
     """Yield SSE events with GeminiResponse chunks."""
     try:
-        for chunk in persona.stream(messages):
+        # Run synchronous stream in thread pool to avoid blocking event loop
+        chunks = await asyncio.to_thread(lambda: list(persona.stream(messages)))
+        for chunk in chunks:
             if chunk:
                 resp = _build_response(chunk, model_id)
                 yield f"data: {json.dumps(resp.model_dump())}\n\n"
     except Exception as exc:
         yield f"data: {json.dumps({'error': str(exc)})}\n\n"

Note: This still collects all chunks before yielding (same limitation as A2A). For true streaming, a more complex approach with a thread-safe queue would be needed.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 122-122: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/gemini.py` around lines 115 - 124, The async generator
_stream currently calls the synchronous persona.stream(messages) directly,
blocking the event loop; change _stream to offload the blocking call to a
background thread (e.g., use asyncio.to_thread or loop.run_in_executor) when
iterating persona.stream so chunks are produced without blocking; specifically,
wrap the persona.stream(messages) iteration (inside async def _stream) in an
asyncio.to_thread call or call persona.stream via to_thread and then iterate its
returned iterator, then keep using _build_response(chunk, model_id) and yielding
the JSON SSE lines as before; follow the same offloading pattern used in a2a.py
to avoid event-loop blocking.

Comment on lines +376 to +441
@ollama_router.get("/show")
async def show(persona: Persona = Depends(get_default_persona)) -> JSONResponse:
"""Show model card (Ollama-compatible stub).

Args:
persona: Injected persona instance.

Returns:
Static model card for the loaded persona.
"""
solvers: List[str] = list(persona.solvers.loaded_modules.keys())
models_in_config: set = {persona.config.get(s, {}).get("model")
for s in persona.solvers.loaded_modules.keys()}
parent_model_str: str = "|".join(filter(None, models_in_config))

details: OllamaModelDetails = OllamaModelDetails(
format="json",
family="ovos-persona",
families=solvers,
parent_model=parent_model_str,
parameter_size="",
quantization_level="",
)
model: OllamaModel = OllamaModel(
name=persona.name,
model=persona.name,
digest="sha256:placeholder_digest",
size=0,
modified_at=timestamp(),
details=details,
)
return JSONResponse(model.model_dump())


@ollama_router.get("/ps")
async def ps(persona: Persona = Depends(get_default_persona)) -> JSONResponse:
"""List running models (Ollama-compatible stub).

Args:
persona: Injected persona instance.

Returns:
Ollama /api/ps format with the loaded persona listed as running.
"""
solvers: List[str] = list(persona.solvers.loaded_modules.keys())
models_in_config: set = {persona.config.get(s, {}).get("model")
for s in persona.solvers.loaded_modules.keys()}
parent_model_str: str = "|".join(filter(None, models_in_config))

details: OllamaModelDetails = OllamaModelDetails(
format="json",
family="ovos-persona",
families=solvers,
parent_model=parent_model_str,
parameter_size="",
quantization_level="",
)
model: OllamaModel = OllamaModel(
name=persona.name,
model=persona.name,
digest="sha256:placeholder_digest",
size=0,
modified_at=timestamp(),
details=details,
)
return JSONResponse({"models": [model.model_dump()]})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Ollama API /api/show endpoint POST request specification

💡 Result:

The Ollama /api/show endpoint is a POST request used to retrieve detailed information about a specific model. Endpoint: POST http://localhost:11434/api/show (default base URL) Request body (JSON): - model: string (required) - The name of the model to show (e.g., "gemma3", "llama3.2") - verbose: boolean (optional, default false) - If true, returns full data for verbose response fields like detailed tokenizer info Example curl request: curl http://localhost:11434/api/show -d '{ "model": "gemma3" }' Or with verbose: curl http://localhost:11434/api/show -d '{ "model": "gemma3", "verbose": true }' The response is a JSON object containing model details such as parameters, license, template, modelfile, capabilities (e.g., completion, vision), modified_at timestamp, details (format, family, quantization_level, etc.), and model_info (architecture-specific parameters). This specification is from the official Ollama OpenAPI schema and documentation.

Citations:


🌐 Web query:

Ollama API /api/ps endpoint runtime fields specification

💡 Result:

The Ollama API endpoint GET /api/ps returns a list of models currently loaded into memory. The response is a JSON object with a top-level "models" array, where each item is an object describing a running model. The fields for each model (Ps schema) are: - name: string (e.g., "gemma3" or "mistral:latest") - the loaded model name. - model: string (e.g., "gemma3") - the base model name. - size: integer (e.g., 6591830464) - model size in bytes. - digest: string (e.g., "a2af6cc3eb7fa8be8504abaf9b04e88f17a119ec3f04a3addf55f92841195f5a") - SHA256 digest. - details: object containing: - parent_model: string (e.g., ""). - format: string (e.g., "gguf"). - family: string (e.g., "gemma3"). - families: array of strings (e.g., ["gemma3"]). - parameter_size: string (e.g., "4.3B"). - quantization_level: string (e.g., "Q4_K_M"). - expires_at: string (ISO 8601 timestamp, e.g., "2025-10-17T16:47:07.93355-07:00") - when the model will be unloaded. - size_vram: integer (e.g., 5333539264) - VRAM usage in bytes. - context_length: integer (e.g., 4096) - model's context length. Example response: { "models": [ { "name": "gemma3", "model": "gemma3", "size": 6591830464, "digest": "a2af6cc3eb7fa8be8504abaf9b04e88f17a119ec3f04a3addf55f92841195f5a", "details": { "parent_model": "", "format": "gguf", "family": "gemma3", "families": ["gemma3"], "parameter_size": "4.3B", "quantization_level": "Q4_K_M" }, "expires_at": "2025-10-17T16:47:07.93355-07:00", "size_vram": 5333539264, "context_length": 4096 } ] } This specification is from the official OpenAPI schema and examples in the documentation.

Citations:


🏁 Script executed:

# Find the OllamaModel and OllamaModelDetails class definitions
rg -A 20 "class OllamaModel" ovos_persona_server/

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 4050


🏁 Script executed:

# Also search for OllamaModelDetails
rg -A 20 "class OllamaModelDetails" ovos_persona_server/

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 2153


🏁 Script executed:

# Check what the /tags endpoint returns to understand if show and ps reuse it
rg -B 5 -A 20 "@ollama_router.get\(\"/tags\"\)" ovos_persona_server/

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 57


🏁 Script executed:

# Check the current decorator for /show
rg -B 2 "async def show" ovos_persona_server/

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 274


🏁 Script executed:

# Check the /tags endpoint to understand the reuse claim
rg -B 5 -A 25 "async def tags" ovos_persona_server/ollama.py

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1541


Change /show from GET to POST and add missing Ollama API fields to both endpoints.

The /show endpoint is incorrectly decorated with @ollama_router.get but should be POST per Ollama's specification. Additionally, both /show and /ps are using the same model shape as /tags, which is incompatible with Ollama's actual API contracts:

  • /show response must include parameters, license, template, modelfile, and capabilities fields
  • /ps response must include runtime fields: expires_at, size_vram, and context_length

The current OllamaModel and OllamaModelDetails classes lack these fields, so Ollama-compatible clients will break on these endpoints.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 377-377: Do not perform function call Depends in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


[warning] 411-411: Do not perform function call Depends in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/ollama.py` around lines 376 - 441, The /show handler is
incorrectly a GET and both /show and /ps are missing Ollama-required fields;
change the decorator on the show function from `@ollama_router.get` to
`@ollama_router.post`, extend the OllamaModel and OllamaModelDetails dataclasses
to include the missing fields (for show: parameters, license, template,
modelfile, capabilities; for ps/details/runtime: expires_at, size_vram,
context_length or include a runtime sub-object returned by ps), populate those
fields with sensible defaults or derived values in show() and ps() (use
persona.name, parent_model_str, timestamp() for expires_at or null semantics)
and ensure model.model_dump() includes the new fields so the /show and /ps
responses match Ollama's API contract (refer to functions show, ps and classes
OllamaModel, OllamaModelDetails).

Comment on lines +486 to +521
@ollama_router.post("/embeddings")
async def ollama_embeddings(
request_body: OllamaEmbedRequest,
persona: Persona = Depends(get_default_persona),
) -> JSONResponse:
"""Generate embeddings (Ollama-compatible stub).

Delegates to persona's embeddings solver if available.

Args:
request_body: Ollama embed request with model and input.
persona: Injected persona instance.

Returns:
Ollama-format embeddings response or 501 if not supported.

Raises:
HTTPException: 501 if no embeddings solver is configured.
"""
solver_names = list(persona.solvers.loaded_modules.keys())
embed_solver = None
for name in solver_names:
solver = persona.solvers.loaded_modules[name]
if hasattr(solver, "get_embeddings"):
embed_solver = solver
break

if embed_solver is None:
raise HTTPException(
status_code=status.HTTP_501_NOT_IMPLEMENTED,
detail="No embeddings solver configured for this persona.",
)

input_text = request_body.input if isinstance(request_body.input, str) else " ".join(request_body.input)
vec = embed_solver.get_embeddings(input_text)
return JSONResponse(OllamaEmbedResponse(embeddings=[vec]).model_dump())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n ovos_persona_server/ollama.py | sed -n '1,50p'

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1814


🏁 Script executed:

# Find the data model definitions for OllamaEmbedRequest and OllamaEmbedResponse
rg -n "class OllamaEmbedRequest|class OllamaEmbedResponse" ovos_persona_server/

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 233


🏁 Script executed:

# Get broader context of the ollama.py file
wc -l ovos_persona_server/ollama.py

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 108


🌐 Web query:

Ollama API embeddings endpoint POST /api/embed batch input response format

💡 Result:

The Ollama embeddings endpoint is POST /api/embed (note: the deprecated /api/embeddings uses "prompt" instead of "input"). Batch input: JSON object with "model": string (required), "input": array of strings (for batch), optional "options", "truncate", "keep_alive". Example request: { "model": "embeddinggemma", "input": [ "First sentence", "Second sentence", "Third sentence" ] } Response: JSON object with "model": string, "embeddings": array of arrays (one float array per input string), timings like "total_duration", "load_duration". Example response structure: { "model": "embeddinggemma", "embeddings": [ [0.010071029, -0.0017594862, ...], [...] ], "total_duration": 14143917, ... } Official docs confirm /api/embed supports batch via "input" array, returning parallel embeddings array. Deprecated /api/embeddings used "prompt_batch" / "embedding_batch" in early proposals but current standard is /api/embed.

Citations:


🏁 Script executed:

cat -n ovos_persona_server/schemas/ollama.py | sed -n '204,240p'

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 3174


Process each input separately to preserve batch semantics.

The Ollama embeddings endpoint accepts input as either a string or array and returns one embedding per input item. The current code joins array inputs into a single string, causing it to return one embedding instead of a per-item embedding array. This breaks compatibility with batch callers and changes both output cardinality and semantics.

Handle string and array inputs separately: when input is an array, call get_embeddings() for each item and collect the results into a list, matching the expected OllamaEmbedResponse structure of embeddings: List[List[float]] (one embedding vector per input).

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 489-489: Do not perform function call Depends in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/ollama.py` around lines 486 - 521, In ollama_embeddings,
don't join list inputs into one string; instead detect if request_body.input is
a list and for each item call the chosen embed_solver.get_embeddings(item) to
produce a list of vectors, preserving input order and returning
OllamaEmbedResponse(embeddings=[...]) where embeddings is a List[List[float]]
(for a single string input still wrap the single vector in a list). Update the
branch that currently sets input_text and calls embed_solver.get_embeddings to
handle both cases (string -> single vector wrapped in list; list -> map/loop
calling get_embeddings per element), keeping the rest of the logic (solver
selection via persona.solvers.loaded_modules) unchanged.

Comment on lines +519 to +521
input_text = request_body.input if isinstance(request_body.input, str) else " ".join(request_body.input)
vec = embed_solver.get_embeddings(input_text)
return JSONResponse(OllamaEmbedResponse(embeddings=[vec]).model_dump())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Populate the required model field before returning.

Line 521 constructs OllamaEmbedResponse with only embeddings, but model is required by the schema. Any successful embeddings request will fail validation on the happy path.

🐛 Proposed fix
-    vec = embed_solver.get_embeddings(input_text)
-    return JSONResponse(OllamaEmbedResponse(embeddings=[vec]).model_dump())
+    vec = embed_solver.get_embeddings(input_text)
+    return JSONResponse(
+        OllamaEmbedResponse(model=persona.name, embeddings=[vec]).model_dump()
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ovos_persona_server/ollama.py` around lines 519 - 521, The response is
missing the required model field in OllamaEmbedResponse; set the model to the
identifier used by the embedder before returning — e.g., after computing vec
from embed_solver.get_embeddings(input_text) construct OllamaEmbedResponse with
embeddings=[vec] and model=embed_solver.model (or embed_solver.model_name if
that property is used by your embed_solver), falling back to a sensible string
if neither exists; update the return in the block that builds input_text, calls
embed_solver.get_embeddings and returns JSONResponse so schema validation
passes.

Comment on lines +22 to +26
dependencies = [
"fastapi",
"ovos-persona",
"pydantic",
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify uvicorn is used in the codebase
rg -n "import uvicorn|from uvicorn" ovos_persona_server/

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 124


🏁 Script executed:

# Check __main__.py to see how uvicorn is used
cat -n ovos_persona_server/__main__.py

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1655


🏁 Script executed:

# Check the full pyproject.toml for uvicorn in all sections
cat -n pyproject.toml

Repository: OpenVoiceOS/ovos-persona-server

Length of output: 1511


Add uvicorn to runtime dependencies.

The console script entry point (ovos-persona-server) calls main() in __main__.py, which imports and uses uvicorn.run() at line 38, but uvicorn is not listed in the dependencies.

Fix
 dependencies = [
     "fastapi",
     "ovos-persona",
     "pydantic",
+    "uvicorn",
 ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dependencies = [
"fastapi",
"ovos-persona",
"pydantic",
]
dependencies = [
"fastapi",
"ovos-persona",
"pydantic",
"uvicorn",
]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 22 - 26, The package runtime is missing uvicorn
which is imported/used by the console entry point; add "uvicorn" to the
dependencies array in pyproject.toml so the runtime dependency is installed.
Specifically, update the dependencies list that currently contains "fastapi",
"ovos-persona", and "pydantic" to include "uvicorn" so that __main__.py's main()
(which calls uvicorn.run()) and the ovos-persona-server entry point work
correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant