Skip to content

Conversation

junrae6454
Copy link
Contributor

What does this PR do?

  • This PR modifies the apply_chat_template function to improve handling of non-ASCII characters in JSON output by setting ensure_ascii to False. This change ensures that characters such as "안녕?" and emojis are correctly rendered in their original form rather than as escaped sequences.
  • In the original tojson implementation of Jinja2, the characters "<", ">", "&", and "'" are escaped to ensure safe HTML output. However, in apply_chat_template, we use json.dumps directly to preserve the original content as much as possible.
    jinja2 default tojson

relative issue

#31041
#31079

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@CISC
Copy link
Contributor

CISC commented May 29, 2024

Yes, this makes sense, didn't catch the escaping of HTML characters.

May I suggest also removing the sort_keys parameter so it goes back to the default of not sorting keys, this way the caller has control over the order, which can improve inference.

@LysandreJik LysandreJik requested a review from Rocketknight1 May 31, 2024 10:29
@junrae6454
Copy link
Contributor Author

@Rocketknight1
When you have a moment, could you please review my latest pull request? Your feedback would be greatly appreciated.

@Rocketknight1
Copy link
Member

Sorry for the delay! This makes sense to me, however there is a risk of affecting other Jinja implementations. @xenova how does huggingface/jinja deal with |tojson?

@CISC
Copy link
Contributor

CISC commented Jun 12, 2024

@Rocketknight1 AFAICT huggingface.js/jinja doesn't support tojson filter.

@xenova
Copy link
Contributor

xenova commented Jun 12, 2024

No, not yet, but it should be really easy to implement!

@Rocketknight1
Copy link
Member

Okay! In that case, pinging @amyeroberts for core maintainer review.

Brief summary: We often want to flatten lists/dicts into a string in a chat template, especially when it deals with tools. Jinja was originally designed for templating in webpages, and as a result it escapes special HTML characters by default, but this is not what we want in a chat template. This PR swaps the default tojson filter in Jinja for Python's json.dumps() function, which doesn't have this awkward feature.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM - thanks for updating!

@Rocketknight1
Copy link
Member

@junrae6454 Thanks for the PR! I made one small change - I moved the lambda function out into a proper function, and exposed a couple of the other arguments, as I feel like they might be useful later for rendering JSON.

Will merge once tests pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants