Skip to content

INTPYTHON-676: Adding security and optimization to cache collections #343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 73 additions & 17 deletions django_mongodb_backend/cache.py
Original file line number Diff line number Diff line change
@@ -1,33 +1,61 @@
import pickle
from datetime import datetime, timezone
from hashlib import blake2b
from typing import Any, Optional, Tuple

from django.core.cache.backends.base import DEFAULT_TIMEOUT, BaseCache
from django.core.cache.backends.db import Options
from django.core.exceptions import SuspiciousOperation
from django.db import connections, router
from django.utils.functional import cached_property
from pymongo import ASCENDING, DESCENDING, IndexModel, ReturnDocument
from pymongo.errors import DuplicateKeyError, OperationFailure
from django.conf import settings


class MongoSerializer:
def __init__(self, protocol=None):
def __init__(self, protocol=None, signer=None):
self.protocol = pickle.HIGHEST_PROTOCOL if protocol is None else protocol
self.signer = signer

def dumps(self, obj):
# For better incr() and decr() atomicity, don't pickle integers.
# Using type() rather than isinstance() matches only integers and not
# subclasses like bool.
if type(obj) is int: # noqa: E721
return obj
return pickle.dumps(obj, self.protocol)
def _get_signature(self, data) -> Optional[bytes]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently don't have any type hints in this project (Django hasn't adopted them), so it's a bit out of place to add them in this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, Django was primarily written prior to type hinting's standardization. I personally feel like it is required here since I am breaking from the standard APIs for some functions. It felt more correct to add them to everything then to only add them to certain functions. I can remove them, but I am worried it will reduce readability and maintainability.

if self.signer is None:
return None
s = self.signer.copy()
s.update(data)
return s.digest()

def loads(self, data):
try:
return int(data)
except (ValueError, TypeError):
return pickle.loads(data) # noqa: S301
def _get_pickled(self, obj: Any) -> bytes:
return pickle.dumps(obj, protocol=self.protocol) # noqa: S301

def dumps(self, obj) -> Tuple[Any, bool, Optional[str]]:
# Serialize the object to a format suitable for MongoDB storage.
# The return value is a tuple of (data, pickled, signature).
match obj:
case int() | str() | bytes():
return (obj, False, None)
Comment on lines +37 to +38
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what you refer to as "optimization". I think any optimization should be done separate from security hardening to make that change more clear. However, I wonder if the benefit of not serializing str/bytes is mostly for the signing case. While it helps with CPU, it adds extra the "pickled" attribute for all keys in the cache. It might be better to limit this optimization a new MongoSignedSerializer class and leave MongoSerializer unchanged. (As our code strays more and more from tried and tested patterns in Django, I feel less confident in its robustness. In this case, the MongoSerializer is copied from django.core.cache.backends.redis.RedisSerializer. We have to ask why Django didn't make this decision and whether it is really better.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, unpickled str and bytes are 4x faster during dumps and 20x faster during loads. We would see similar performance increases for dict, list and tuple, but I found it to be slower to check that each element is JSON/BSON serializable then to simply pickle and sign them.

This is one of the optimizations that I made, however, it is the least impactful overall. The majority of the speed increases come from switching to directly using python's blake2b algorithm over Django's Django.core.signing implementation. Django.core.signing is slow by comparison in part due to the algorithm implementation and because of string concatenation between the HMAC hash and the algorithm choice.

The pickled field adds 10 bytes to every cache request. I personally do not believe it to be a significant overhead for the performance difference, but if we change the field name to simply p we can reduce it to a mere 4 bytes if that is more palatable.

While I would normally agree that copying existing, tested code is better, I believe that the redis code is severely limited by a traditional relational database thought process. I do not believe it is taking advantage of the inherent performance boosts we can gain by using Mongodb's document model. While there are other performance limitations we are forced into due to the Django API (colocation of cached headers and page data), this is one that is easy to remediate due to all code paths being within our own code.

case _:
pickled_data = self._get_pickled(obj)
return (pickled_data, True, self._get_signature(pickled_data) if self.signer else None)

def loads(self, data:Any, pickled:bool, signature=None) -> Any:
if pickled:
try:
if self.signer is not None:
# constant time compare is not required due to how data is retrieved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm familiar with the usage of django.utils.crypto.constant_time_compare() but I'm not sure what "how data is retrieved" means in this context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threat model for signing data is based on the assumption that a malicious actor has gained write access to the cache collection, but no access (or very very low privilege) to the server Django is running on.
A constant time comparison would prevent the actor from determining the correct hmac hash by measuring the length of time before an error is thrown in a side channel attack. It is my personal opinion that a constant time comparison here is not required due to a minimum of two network requests (client -> server -> database) which should introduce enough entropy to make a short circuit comparison sufficient.
During my testing and profiling of potential solutions, I found that use of constant_time_compare introduced a significant amount of latency, which is what prompted the removal.

If you feel like a side channel attack is a potential issue for this threat model, I am willing to reintroduce the constant time comparison.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest (if only to avoid having to reason about these subtleties in the future) that you use the constant time comparison always when comparing cryptographic hashes. It is theoretically possible that using a non-constant time comparison for equality might allow someone to more easily craft a valid HMAC for a maliciously crafted payload.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I opted not to use Django.core.Signing.constant_time_compare since it was roughly 4.5x slower then the short circuit comparison ==. I just reran the tests and found The short circuit comparison to have roughly a 5ns difference between matches/non-matches. Personally, I don't think that is a measurable speed difference when there is multiple network requests occurring and adding entropy, but I understand that I could be missing a potential threat model for this attack.

That said, I did also benchmark hmac.compare_digest which came out as 3x faster then Django's implementation. Since the main concern about these changes are performance related, I think that hmac's implementation is a good alternative and provides the side channel attack safety that is currently lacking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, Django's constant_time_compare() is an alias of secrets.compare_digest() (which is an alias of hmac.compare_digest()) except that Django's version coerces the arguments using force_bytes(). This might be artifact of the Python 2/3 transition. No Django tests fail when removing those force_bytes() calls. I'll suggest that Django deprecates constant_time_compare() in favor of hmac.compare_digest().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented constant time compare using hmac.compare_digest. @timgraham, would you rather it be done with Django's implementation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, my point was that Django's implementation is just an alias hmac.compare_digest(). (Frankly, if the database is compromised, I think timing attacks would be the least of the problems.)

if signature and (signature == self._get_signature(data)):
return pickle.loads(data) # noqa: S301
else:
raise SuspiciousOperation(f"Pickeled cache data is missing signature")
else:
return pickle.loads(data)
except (ValueError, TypeError):
# ValueError: Invalid signature
# TypeError: Data wasn't a byte string
raise SuspiciousOperation(f'Invalid pickle signature: {{"signature": {signature}, "data":{data}}}')
else:
return data

class MongoDBCache(BaseCache):
pickle_protocol = pickle.HIGHEST_PROTOCOL

Expand All @@ -39,6 +67,17 @@ class CacheEntry:
_meta = Options(collection_name)

self.cache_model_class = CacheEntry
self._sign_cache = params.get("ENABLE_SIGNING", True)

self._key = params.get("KEY", settings.SECRET_KEY[:64])
if len(self._key) == 0:
self._key = settings.SECRET_KEY[:64]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a subtle thing here. If I'm understanding correctly, SECRET_KEY comes from Django and is a value that is used in various places around Django for cryptographic operations. Since it is used in potentially other cryptographic operations, there are (admittedly, theoretical) "key-reuse" vulnerabilities when you use it as the key for different cryptographic operations (e.g. HMAC and whatever other usecase Django is already using it for).

The most correct thing to do is to derive a key from SECRET_KEY using a key derivation function. Python's hashlib contains pbkdf2_hmac which can be configured to do this for you.

PBKDF2 asks for a "iteration" parameter—if you look it up online, the conventional wisdom for PBKDF2 is to use a very large number of "iterations" because generally PBKDF2 is used for converting human-chosen passwords (which have low entropy) into keys. However, if I'm right in assuming that SECRET_KEY is already a cryptographically-secure random string (e.g. has high entropy), then a single iteration = 1 is sufficient to derive a strong key.

This might look something like this:

from hashlib import pbkdf2_hmac
purpose = b'mongodbcachekey'
iterations = 1
self._key = pbkdf2_hmac('sha256', settings.SECRET_KEY[:64], purpose, iterations)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZacharyEspiritu Interesting suggestion about SECRET_KEY! That's probably something we can do in the project template.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I thought this might be solved by setting Blake2b's person field during the hasher initialization. Do you think it would still be useful to have the person field set if we transition to pbkdf2_hmac?

if isinstance(self._key, str):
self._key = self._key.encode()

self._salt = params.get("SALT", "")
if isinstance(self._salt, str):
self._salt = self._salt.encode()

def create_indexes(self):
expires_index = IndexModel("expires_at", expireAfterSeconds=0)
Expand All @@ -47,7 +86,10 @@ def create_indexes(self):

@cached_property
def serializer(self):
return MongoSerializer(self.pickle_protocol)
signer = None
if self._sign_cache:
signer = blake2b(key=self._key[:64], salt=self._salt[:16], person=self._collection_name[:16].encode())
return MongoSerializer(self.pickle_protocol, signer)

@property
def collection_for_read(self):
Expand Down Expand Up @@ -84,19 +126,30 @@ def get_many(self, keys, version=None):
with self.collection_for_read.find(
{"key": {"$in": tuple(keys_map)}, **self._filter_expired(expired=False)}
) as cursor:
return {keys_map[row["key"]]: self.serializer.loads(row["value"]) for row in cursor}
results = {}
for row in cursor:
try:
results[keys_map[row["key"]]] = self.serializer.loads(row["value"], row["pickled"], row["signature"])
except SuspiciousOperation as e:
self.delete(row["key"])
e.add_note(f"Cache entry with key '{row['key']}' was deleted due to suspicious data")
raise e
return results

def set(self, key, value, timeout=DEFAULT_TIMEOUT, version=None):
key = self.make_and_validate_key(key, version=version)
num = self.collection_for_write.count_documents({}, hint="_id_")
if num >= self._max_entries:
self._cull(num)
value, pickled, signature = self.serializer.dumps(value)
self.collection_for_write.update_one(
{"key": key},
{
"$set": {
"key": key,
"value": self.serializer.dumps(value),
"value": value,
"pickled": pickled,
"signature": signature,
Comment on lines +148 to +149
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not good to add "pickled" and "signature" keys to all cache data when signing is disabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pickled entry would be required regardless of the the signing option due to the optimizations of allowing unpicked byte strings.
I need to reevaluate if removing the signature field would introduce any issues or vulnerabilites, but I think that is fine.

"expires_at": self.get_backend_timeout(timeout),
}
},
Expand All @@ -109,12 +162,15 @@ def add(self, key, value, timeout=DEFAULT_TIMEOUT, version=None):
if num >= self._max_entries:
self._cull(num)
try:
value, pickled, signature = self.serializer.dumps(value)
self.collection_for_write.update_one(
{"key": key, **self._filter_expired(expired=True)},
{
"$set": {
"key": key,
"value": self.serializer.dumps(value),
"value": value,
"pickled": pickled,
"signature": signature,
"expires_at": self.get_backend_timeout(timeout),
}
},
Expand Down
19 changes: 19 additions & 0 deletions docs/source/topics/cache.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,25 @@ In addition, the cache is culled based on ``CULL_FREQUENCY`` when ``add()``
or ``set()`` is called, if ``MAX_ENTRIES`` is exceeded. See
:ref:`django:cache_arguments` for an explanation of these two options.

Cache entries include a HMAC signature to ensure data integrity by default.
You can disable this by setting ``ENABLE_SIGNING`` to ``False``.
Signatures can also include an optional key and salt parameter by setting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for custom key and salt? Probably it should be explained that SECRET_KEY is the default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested by another reviewer, I removed the SALT option since it wasn't providing what I expected it to. I also added the documentation of SECRET_KEY as requested.
I don't have a valid use case for changing the key, but I feel like we would be limiting developers by arbitrarily removing the option when it doesn't really cost anything to keep it. Do you think it's fine as is, or should we brainstorm some reasons to keep/remove it?

``KEY`` and ``SALT`` repectively. Signatures are provided by the Blake2 hash
function, making Key sizes limited to 64 bytes, and salt sizes limited to 16
bytes. If a key is not provided, cache entries will be signed using the
``SECRET_KEY``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to be explicit about what "data integrity" means. (i.e. If your database is compromised, it can lead to RCE.)

  • It should also be made clear that this behavior differs from Django's built-in database backends.
  • It would be helpful to give some guidance about the performance implications.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any examples of where performance implications are documented in django? I would love to copy the format. I think it would be hard for me to invent a new format for describing a 2000ns slow down due to signing while also showing that to be only a 3% difference from the original without simply including a large list of numbers.


In this example, the cache collection is configured with a key and salt::

CACHES = {
"default": {
"BACKEND": "django_mongodb_backend.cache.MongoDBCache",
"LOCATION": "my_cache_collection",
"KEY": "my_secret_key",
"SALT": "my_salt",
},
}

Creating the cache collection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
58 changes: 55 additions & 3 deletions tests/cache_/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ def f():
class C:
def m(n):
return 24
def __eq__(self, other):
return isinstance(other, type(self))


class Unpicklable:
Expand Down Expand Up @@ -97,6 +99,7 @@ def caches_setting_for_tests(base=None, exclude=None, **params):
BACKEND="django_mongodb_backend.cache.MongoDBCache",
# Spaces are used in the name to ensure quoting/escaping works.
LOCATION="test cache collection",
ENABLE_SIGNING=False,
),
)
@modify_settings(
Expand Down Expand Up @@ -950,10 +953,59 @@ def test_collection_has_indexes(self):
)

def test_serializer_dumps(self):
self.assertEqual(cache.serializer.dumps(123), 123)
self.assertIsInstance(cache.serializer.dumps(True), bytes)
self.assertIsInstance(cache.serializer.dumps("abc"), bytes)
self.assertTupleEqual(cache.serializer.dumps(123), (123, False, None))
self.assertTupleEqual(cache.serializer.dumps(True), (True, False, None))
self.assertTupleEqual(cache.serializer.dumps("abc"), ("abc", False, None))
self.assertTupleEqual(cache.serializer.dumps(b"abc"), (b"abc", False, None))

c = C()
pickled_c = pickle.dumps(c, protocol=pickle.HIGHEST_PROTOCOL)
self.assertTupleEqual(cache.serializer.dumps(c), (pickled_c, True, None))

def test_serializer_loads(self):
self.assertEqual(cache.serializer.loads(123, False, None), 123)
self.assertEqual(cache.serializer.loads(True, False, None), True)
self.assertEqual(cache.serializer.loads("abc", False, None), "abc")
self.assertEqual(cache.serializer.loads(b"abc", False, None), b"abc")

c = C()
pickled_c = pickle.dumps(c, protocol=pickle.HIGHEST_PROTOCOL)
self.assertEqual(cache.serializer.loads(pickled_c, True, None), c)



@override_settings(
CACHES=caches_setting_for_tests(
BACKEND="django_mongodb_backend.cache.MongoDBCache",
# Spaces are used in the name to ensure quoting/escaping works.
LOCATION="test cache collection",
ENABLE_SIGNING=True,
SALT="test-salt",
),
)
class SignedCacheTests(CacheTests):
def test_serializer_dumps(self):
self.assertTupleEqual(cache.serializer.dumps(123), (123, False, None))
self.assertTupleEqual(cache.serializer.dumps(True), (True, False, None))
self.assertTupleEqual(cache.serializer.dumps("abc"), ("abc", False, None))
self.assertTupleEqual(cache.serializer.dumps(b"abc"), (b"abc", False, None))

c = C()
pickled_c = pickle.dumps(c, protocol=pickle.HIGHEST_PROTOCOL)
self.assertTupleEqual(cache.serializer.dumps(c), (pickled_c, True, cache.serializer._get_signature(pickled_c)))

def test_serializer_loads(self):
self.assertEqual(cache.serializer.loads(123, False, None), 123)
self.assertEqual(cache.serializer.loads(True, False, None), True)
self.assertEqual(cache.serializer.loads("abc", False, None), "abc")
self.assertEqual(cache.serializer.loads(b"abc", False, None), b"abc")

c = C()
pickled_c = pickle.dumps(c, protocol=pickle.HIGHEST_PROTOCOL)
self.assertEqual(cache.serializer.loads(pickled_c, True, cache.serializer._get_signature(pickled_c)), c)

with self.assertRaises(Exception):
cache.serializer.loads(pickled_c, True, "invalid-signature")

class DBCacheRouter:
"""A router that puts the cache table on the 'other' database."""
Expand Down