Skip to content

Conversation

picnixz
Copy link
Member

@picnixz picnixz commented Sep 22, 2025

@hartwork I observed that the maximum allocation factor is only checked if we exceed the activation threshold. The online docs for Expat didn't explicitly mention it so I mentioned it in our docs.

I eventually decided to redo the float-check in case of an error so that we can give a better message to the user (as it relates to security practices, I think it's good to explain why the API call failed as much as we can).


📚 Documentation preview 📚: https://cpython-previews--139234.org.readthedocs.build/

@picnixz picnixz force-pushed the feat/xml/mitigation-api-90949 branch from 3fa5b10 to 9c7371f Compare September 22, 2025 15:23
@picnixz
Copy link
Member Author

picnixz commented Sep 22, 2025

I plan to also expose the billion laugh mitigation API, but in a separate PR.

@hartwork
Copy link
Contributor

@picnixz thanks for taking on this project — very much appreciated! 🙏

@hartwork I observed that the maximum allocation factor is only checked if we exceed the activation threshold. The online docs for Expat didn't explicitly mention it so I mentioned it in our docs.

@picnixz there is mention of the activation threshold in the final note of https://libexpat.github.io/doc/api/latest/#XML_SetAllocTrackerMaximumAmplification but I'm happy if the topic gets more attention in CPython docs.

I already started review and spotted a few things. I will likely submit a first round of review some time later today.

@picnixz
Copy link
Member Author

picnixz commented Sep 22, 2025

there is mention of the activation threshold in the final note of

I assume you are actually talking about this:

So if you do reduce the maximum allowed amplification, please make sure that the activation threshold is still big enough to not end up with undesired false positives (i.e. benign files being rejected).

I actually understood it only as "if you use a smaller factor then in practice, you might want to change the activation threshold" but I didn't understand it as "the amplification factor is only used if we exceed the threshold in the first place". I assumed from the note on XML_SetAllocTrackerActivationThreshold that the two values were totally independent and could have two different protections:

Note: For types of allocations that intentionally bypass tracking and limiting, please see XML_SetAllocTrackerMaximumAmplification above.

@picnixz
Copy link
Member Author

picnixz commented Sep 22, 2025

I already started review and spotted a few things. I will likely submit a first round of review some time later today.

Thank you! I'll hold the work on the billion laugh API as it'll be easier to add the API once this one is merged.

@picnixz picnixz force-pushed the feat/xml/mitigation-api-90949 branch from 4ba478d to d636685 Compare September 22, 2025 17:12
Copy link
Contributor

@hartwork hartwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picnixz here's everything I found so far. Happy to learn what I missed in these findings 🍻

}
#endif

#if XML_COMBINED_VERSION >= 20702
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that Expat >=2.7.2 is needed to offer this functionality, but maybe the the Python API should be available in any case and raise exceptions that the compile time Expat was not recent enough to support this feature. Is that what's happening? I remember we had this very question when introducing SetReparseDeferralEnabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argument Clinic will handle this automatically by "not" having the function (an AttributeError will be raised). For SetReparseDeferralEnabled, it appears that the API call will be a no-op otherwise.

Maybe it's better to raise NotImplementedError instead of the default AttributeError, as we usually do in ssl when the libssl backend is not the latest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picnixz I'd like to vote for NotImplementedError or no-op rather than AttributeError for now. I'll need to sleep over this and potentially dig up past communication on the reparse deferall thing. There was past rationale on this somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a NotImplementedError over a no-op, as otherwise people might assume that they are protected against some attacks while they are not.

Copy link
Contributor

@hartwork hartwork Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picnixz I should note that protection is active by default (or it would not have fixed the vulnerability). So a no-op in the tuning function would leave them protected under default settings, not unprotected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the #error approach. It simplifies the checks. I think I'll do a separate PR for that one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picnixz please feel free to CC me for review there 👍

Copy link
Member Author

@picnixz picnixz Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if I include expat_config.h before, those symbols will anyway be defined (unless someone changes expat_config.h). But I don't think I need to assume that expat_config.h should be changed. Until now, it was not even used... (it wasn't included). I prefer not to overcomplicate the file (but at least now I know that it should be fine)

Copy link
Contributor

@hartwork hartwork Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picnixz file expat_config.h is auto-generated by the Expat build system (if systemwide Expat is used) and hardcoded by CPython when the bundle is used. So in the former case, XML_DTD and XML_GE could be defined or not: it's under the sysadmin's control.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks for the explanation. I thought it was a python-only file.

@hartwork
Copy link
Contributor

I assume you are actually talking about this:

So if you do reduce the maximum allowed amplification, please make sure that the activation threshold is still big enough to not end up with undesired false positives (i.e. benign files being rejected).

@picnixz I confirm, yes, exactly 👍

I actually understood it only as "if you use a smaller factor then in practice, you might want to change the activation threshold" but I didn't understand it as "the amplification factor is only used if we exceed the threshold in the first place". I assumed from the note on XML_SetAllocTrackerActivationThreshold that the two values were totally independent and could have two different protections:

I see. I think that means that upstream docs should be adjusted some way to make it easier to read that (or them) as interconnected. How about I try a related pull request upstream and you take the review seat there?

I'll hold the work on the billion laugh API as it'll be easier to add the API once this one is merged.

Understood, good plan 👍

@hartwork
Copy link
Contributor

@picnixz here's what I had before for a 3.13 version making up for PyUnicodeWriter_Format, in case we need something like it again later elsewhere:

static PyObject *
format_xml_error(enum XML_Error code, XML_Size lineno, XML_Size column)
{
    const char *errmsg = XML_ErrorString(code);
    char asciiBuffer[512] = "";
    const int formatRet = snprintf(asciiBuffer, sizeof(asciiBuffer),
             "%s: line %zu, column %zu",
            errmsg, (size_t)lineno, (size_t)column);
    if (formatRet < 0 || formatRet >= (int)sizeof(asciiBuffer)) {
        return NULL;
    }

    _PyUnicodeWriter writer;
    _PyUnicodeWriter_Init(&writer);
    writer.overallocate = 1;

    if (_PyUnicodeWriter_WriteASCIIString(&writer, asciiBuffer,
                                          strlen(asciiBuffer)) < 0)
    {
        _PyUnicodeWriter_Dealloc(&writer);
        return NULL;
    }

    return _PyUnicodeWriter_Finish(&writer);
}

@bedevere-app
Copy link

bedevere-app bot commented Sep 26, 2025

GH-139367 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Sep 26, 2025
@hartwork
Copy link
Contributor

return PyUnicode_FromFormat("%s: line %zu, column %zu", message, line, column);

@picnixz thanks, applied in #139367 with size_t casts. What do you think?

picnixz added a commit that referenced this pull request Sep 26, 2025
)

Fix some typos left in f04bea4,
and simplify some internal functions to ease maintenance of future
mitigation APIs.
hartwork pushed a commit to hartwork/cpython that referenced this pull request Sep 26, 2025
…on API (python#139366)

Fix some typos left in f04bea4,
and simplify some internal functions to ease maintenance of future
mitigation APIs.

(cherry picked from commit 68a1778)
picnixz added a commit to picnixz/cpython that referenced this pull request Sep 28, 2025
…on API (python#139366)

Fix some typos left in f04bea4,
and simplify some internal functions to ease maintenance of future
mitigation APIs.
@hartwork
Copy link
Contributor

@picnixz I went to past related pull requests…

…to look for open questions or things that may be missing still here. (That exposed the missing version checks in the new tests as already commented above.) And it gave me these three open questions:

  • Do the sax and etree parsers using Expat need new knobs for this and if so, does it need to be in here? The related classes are:

    • xml.etree.ElementTree.XMLParser
    • xml.etree.ElementTree.XMLPullParser
    • xml.sax.expatreader.ExpatParser
  • Is extending the docs for CVE-2025-59375 needed and if so, does it have to be in here?

    • Should the docs suggest getattr hasattr for the new method added and should it be done in here?

What do you think?

@picnixz any ideas about what to do about xml.sax and xml.etree?

Regarding hasattr, I can look into a pull request like #116278 if you want (or be in the review seat)?

@picnixz
Copy link
Member Author

picnixz commented Sep 28, 2025

any ideas about what to do about xml.sax and xml.etree?

I think we could just add a method for tuning the protections? maybe a single method for all protections? or a configuration object for that? I don't really know which one would be the best here.

@picnixz
Copy link
Member Author

picnixz commented Sep 28, 2025

Note: I think having the user calling parser.<method-with-a-really-long-name> wouldn't really be "pretty". I'm sad that we can't have some parameter called tune_lolz_protection() (the most pythonic I can think of is set_thresholds(expansion=..., memory=...) so that users don't need to remember very long method names.

@hartwork
Copy link
Contributor

@picnixz it's a pity that etree and sax are not reusing pyexpat internally (as can be seen in Modules/_elementtree.c). Or else we could just ask for the underlying parser, give the user access, and we'd already be done.

@picnixz
Copy link
Member Author

picnixz commented Sep 28, 2025

it's a pity that etree and sax are not reusing pyexpat internally

But they do I think? at least there is some ExpatReader in the sax module and I think there is some import pyexpat in etree. Or did I misunderstand?

@hartwork
Copy link
Contributor

@picnixz maybe we need to look at sax and etree separately. Other than sax, etree has a C and a pure-Python version of the same module that — I think — need to satisfy the same interface. The C version of etree does not seem to use xmlparseobject of pyexpat internally (despite the #include "pyexpat.h"). I'm only vaguely familiar with this territory. Past related pull request #115623 showcases addition of the same feature to both the C and the pure-Python version.

@picnixz
Copy link
Member Author

picnixz commented Sep 28, 2025

Indeed, for etree, we have a pure Python implementation and a C implementation. The C implementation directly uses pyexpat, but the pure Python one doesn't. Implementing the logic for the pure Python interface seems an overkill. It would essentially ask to mimic the allocation limitations and we would then need to compute ourselves the size of objects being allocated, which is not realistic for the pure Python implementation.

I would instead suggest that we don't do anything for this one, and that by default, it's left untouched (namely make those methods a no-op). For the C implementation, we would add new methods at the level of _elementree.XMLParser. Note that we use libexpat by using the C API that is exposed by the pyexpat (C) module.

For SAX, it's also a pure Python implementation (no C), so I suggest leaving the API empty as well. The pure Python implementation however seems to only use pyexpat (see https://github.com/python/cpython/blob/main/Lib/xml/sax/expatreader.py#L284-L295) so we could directly use the underlying C API as well (I don't think it was needed to re-implement the logic in pure Python since, AFAICT, the parser was always C-based for SAX).

I think checking for the existence of the underlying C API through hasattr would be the best.

hartwork pushed a commit to hartwork/cpython that referenced this pull request Oct 2, 2025
CVE-2025-59375) (pythonGH-139234)

Expose the XML Expat 2.7.2 mitigation APIs to disallow use of
disproportional amounts of dynamic memory from within an Expat
parser (see CVE-2025-59375 for instance).

The exposed APIs are available on Expat parsers, that is,
parsers created by `xml.parsers.expat.ParserCreate()`, as:

- `parser.SetAllocTrackerActivationThreshold(threshold)`, and
- `parser.SetAllocTrackerMaximumAmplification(max_factor)`.
(cherry picked from commit f04bea4)

Co-authored-by: Bénédikt Tran <[email protected]>
hartwork pushed a commit to hartwork/cpython that referenced this pull request Oct 2, 2025
…on API (python#139366)

Fix some typos left in f04bea4,
and simplify some internal functions to ease maintenance of future
mitigation APIs.

(cherry picked from commit 68a1778)
@bedevere-app
Copy link

bedevere-app bot commented Oct 2, 2025

GH-139527 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Oct 2, 2025
hartwork pushed a commit to hartwork/cpython that referenced this pull request Oct 2, 2025
…2025-59375) (python#139234)

Expose the XML Expat 2.7.2 mitigation APIs to disallow use of
disproportional amounts of dynamic memory from within an Expat
parser (see CVE-2025-59375 for instance).

The exposed APIs are available on Expat parsers, that is,
parsers created by `xml.parsers.expat.ParserCreate()`, as:

- `parser.SetAllocTrackerActivationThreshold(threshold)`, and
- `parser.SetAllocTrackerMaximumAmplification(max_factor)`.

(cherry picked from commit f04bea4)
hartwork pushed a commit to hartwork/cpython that referenced this pull request Oct 2, 2025
…on API (python#139366)

Fix some typos left in f04bea4,
and simplify some internal functions to ease maintenance of future
mitigation APIs.

(cherry picked from commit 68a1778)
@bedevere-app
Copy link

bedevere-app bot commented Oct 2, 2025

GH-139529 is a backport of this pull request to the 3.11 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.11 only security fixes label Oct 2, 2025
hartwork pushed a commit to hartwork/cpython that referenced this pull request Oct 3, 2025
…2025-59375) (python#139234)

Expose the XML Expat 2.7.2 mitigation APIs to disallow use of
disproportional amounts of dynamic memory from within an Expat
parser (see CVE-2025-59375 for instance).

The exposed APIs are available on Expat parsers, that is,
parsers created by `xml.parsers.expat.ParserCreate()`, as:

- `parser.SetAllocTrackerActivationThreshold(threshold)`, and
- `parser.SetAllocTrackerMaximumAmplification(max_factor)`.

(cherry picked from commit f04bea4)
hartwork pushed a commit to hartwork/cpython that referenced this pull request Oct 3, 2025
…on API (python#139366)

Fix some typos left in f04bea4,
and simplify some internal functions to ease maintenance of future
mitigation APIs.

(cherry picked from commit 68a1778)
@bedevere-app
Copy link

bedevere-app bot commented Oct 3, 2025

GH-139532 is a backport of this pull request to the 3.10 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.10 only security fixes label Oct 3, 2025
picnixz pushed a commit that referenced this pull request Oct 4, 2025
…139234 (#139558)

Fix a compiler warning `-Wunused-function` after f04bea4.

The `set_invalid_arg` function in `Modules/pyexpat.c` may be unused if the underlying Expat
version is less than 2.4.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants