-
Notifications
You must be signed in to change notification settings - Fork 26
INTPYTHON-676: Adding security and optimization to cache collections #343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
d750611
ae3986a
c874952
0d4cf8b
29080ba
d158b13
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,61 @@ | ||
import pickle | ||
from datetime import datetime, timezone | ||
from hashlib import blake2b | ||
from typing import Any, Optional, Tuple | ||
|
||
from django.core.cache.backends.base import DEFAULT_TIMEOUT, BaseCache | ||
from django.core.cache.backends.db import Options | ||
from django.core.exceptions import SuspiciousOperation | ||
from django.db import connections, router | ||
from django.utils.functional import cached_property | ||
from pymongo import ASCENDING, DESCENDING, IndexModel, ReturnDocument | ||
from pymongo.errors import DuplicateKeyError, OperationFailure | ||
from django.conf import settings | ||
|
||
|
||
class MongoSerializer: | ||
def __init__(self, protocol=None): | ||
def __init__(self, protocol=None, signer=None): | ||
self.protocol = pickle.HIGHEST_PROTOCOL if protocol is None else protocol | ||
self.signer = signer | ||
|
||
def dumps(self, obj): | ||
# For better incr() and decr() atomicity, don't pickle integers. | ||
# Using type() rather than isinstance() matches only integers and not | ||
# subclasses like bool. | ||
if type(obj) is int: # noqa: E721 | ||
return obj | ||
return pickle.dumps(obj, self.protocol) | ||
def _get_signature(self, data) -> Optional[bytes]: | ||
if self.signer is None: | ||
return None | ||
s = self.signer.copy() | ||
s.update(data) | ||
return s.digest() | ||
|
||
def loads(self, data): | ||
try: | ||
return int(data) | ||
except (ValueError, TypeError): | ||
return pickle.loads(data) # noqa: S301 | ||
def _get_pickled(self, obj: Any) -> bytes: | ||
return pickle.dumps(obj, protocol=self.protocol) # noqa: S301 | ||
|
||
def dumps(self, obj) -> Tuple[Any, bool, Optional[str]]: | ||
# Serialize the object to a format suitable for MongoDB storage. | ||
# The return value is a tuple of (data, pickled, signature). | ||
match obj: | ||
case int() | str() | bytes(): | ||
return (obj, False, None) | ||
Comment on lines
+37
to
+38
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is what you refer to as "optimization". I think any optimization should be done separate from security hardening to make that change more clear. However, I wonder if the benefit of not serializing str/bytes is mostly for the signing case. While it helps with CPU, it adds extra the "pickled" attribute for all keys in the cache. It might be better to limit this optimization a new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my testing, unpickled This is one of the optimizations that I made, however, it is the least impactful overall. The majority of the speed increases come from switching to directly using python's The pickled field adds 10 bytes to every cache request. I personally do not believe it to be a significant overhead for the performance difference, but if we change the field name to simply While I would normally agree that copying existing, tested code is better, I believe that the redis code is severely limited by a traditional relational database thought process. I do not believe it is taking advantage of the inherent performance boosts we can gain by using Mongodb's document model. While there are other performance limitations we are forced into due to the Django API (colocation of cached headers and page data), this is one that is easy to remediate due to all code paths being within our own code. |
||
case _: | ||
pickled_data = self._get_pickled(obj) | ||
return (pickled_data, True, self._get_signature(pickled_data) if self.signer else None) | ||
|
||
def loads(self, data:Any, pickled:bool, signature=None) -> Any: | ||
linted marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if pickled: | ||
try: | ||
if self.signer is not None: | ||
# constant time compare is not required due to how data is retrieved | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm familiar with the usage of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The threat model for signing data is based on the assumption that a malicious actor has gained write access to the cache collection, but no access (or very very low privilege) to the server Django is running on. If you feel like a side channel attack is a potential issue for this threat model, I am willing to reintroduce the constant time comparison. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest (if only to avoid having to reason about these subtleties in the future) that you use the constant time comparison always when comparing cryptographic hashes. It is theoretically possible that using a non-constant time comparison for equality might allow someone to more easily craft a valid HMAC for a maliciously crafted payload. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Originally, I opted not to use That said, I did also benchmark There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On a related note, Django's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I implemented constant time compare using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, my point was that Django's implementation is just an alias |
||
if signature and (signature == self._get_signature(data)): | ||
return pickle.loads(data) # noqa: S301 | ||
else: | ||
raise SuspiciousOperation(f"Pickeled cache data is missing signature") | ||
linted marked this conversation as resolved.
Show resolved
Hide resolved
|
||
else: | ||
return pickle.loads(data) | ||
except (ValueError, TypeError): | ||
# ValueError: Invalid signature | ||
# TypeError: Data wasn't a byte string | ||
raise SuspiciousOperation(f'Invalid pickle signature: {{"signature": {signature}, "data":{data}}}') | ||
else: | ||
return data | ||
|
||
class MongoDBCache(BaseCache): | ||
pickle_protocol = pickle.HIGHEST_PROTOCOL | ||
|
||
|
@@ -39,6 +67,17 @@ class CacheEntry: | |
_meta = Options(collection_name) | ||
|
||
self.cache_model_class = CacheEntry | ||
self._sign_cache = params.get("ENABLE_SIGNING", True) | ||
|
||
self._key = params.get("KEY", settings.SECRET_KEY[:64]) | ||
if len(self._key) == 0: | ||
self._key = settings.SECRET_KEY[:64] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A bit of a subtle thing here. If I'm understanding correctly, The most correct thing to do is to derive a key from PBKDF2 asks for a "iteration" parameter—if you look it up online, the conventional wisdom for PBKDF2 is to use a very large number of "iterations" because generally PBKDF2 is used for converting human-chosen passwords (which have low entropy) into keys. However, if I'm right in assuming that This might look something like this: from hashlib import pbkdf2_hmac
purpose = b'mongodbcachekey'
iterations = 1
self._key = pbkdf2_hmac('sha256', settings.SECRET_KEY[:64], purpose, iterations) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ZacharyEspiritu Interesting suggestion about SECRET_KEY! That's probably something we can do in the project template. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting, I thought this might be solved by setting |
||
if isinstance(self._key, str): | ||
self._key = self._key.encode() | ||
|
||
self._salt = params.get("SALT", "") | ||
if isinstance(self._salt, str): | ||
self._salt = self._salt.encode() | ||
|
||
def create_indexes(self): | ||
expires_index = IndexModel("expires_at", expireAfterSeconds=0) | ||
|
@@ -47,7 +86,10 @@ def create_indexes(self): | |
|
||
@cached_property | ||
def serializer(self): | ||
return MongoSerializer(self.pickle_protocol) | ||
signer = None | ||
if self._sign_cache: | ||
signer = blake2b(key=self._key[:64], salt=self._salt[:16], person=self._collection_name[:16].encode()) | ||
linted marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return MongoSerializer(self.pickle_protocol, signer) | ||
|
||
@property | ||
def collection_for_read(self): | ||
|
@@ -84,19 +126,30 @@ def get_many(self, keys, version=None): | |
with self.collection_for_read.find( | ||
{"key": {"$in": tuple(keys_map)}, **self._filter_expired(expired=False)} | ||
) as cursor: | ||
return {keys_map[row["key"]]: self.serializer.loads(row["value"]) for row in cursor} | ||
results = {} | ||
for row in cursor: | ||
try: | ||
results[keys_map[row["key"]]] = self.serializer.loads(row["value"], row["pickled"], row["signature"]) | ||
except SuspiciousOperation as e: | ||
self.delete(row["key"]) | ||
e.add_note(f"Cache entry with key '{row['key']}' was deleted due to suspicious data") | ||
raise e | ||
return results | ||
|
||
def set(self, key, value, timeout=DEFAULT_TIMEOUT, version=None): | ||
key = self.make_and_validate_key(key, version=version) | ||
num = self.collection_for_write.count_documents({}, hint="_id_") | ||
if num >= self._max_entries: | ||
self._cull(num) | ||
value, pickled, signature = self.serializer.dumps(value) | ||
self.collection_for_write.update_one( | ||
{"key": key}, | ||
{ | ||
"$set": { | ||
"key": key, | ||
"value": self.serializer.dumps(value), | ||
"value": value, | ||
"pickled": pickled, | ||
"signature": signature, | ||
Comment on lines
+148
to
+149
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's not good to add "pickled" and "signature" keys to all cache data when signing is disabled. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
"expires_at": self.get_backend_timeout(timeout), | ||
} | ||
}, | ||
|
@@ -109,12 +162,15 @@ def add(self, key, value, timeout=DEFAULT_TIMEOUT, version=None): | |
if num >= self._max_entries: | ||
self._cull(num) | ||
try: | ||
value, pickled, signature = self.serializer.dumps(value) | ||
self.collection_for_write.update_one( | ||
{"key": key, **self._filter_expired(expired=True)}, | ||
{ | ||
"$set": { | ||
"key": key, | ||
"value": self.serializer.dumps(value), | ||
"value": value, | ||
"pickled": pickled, | ||
"signature": signature, | ||
"expires_at": self.get_backend_timeout(timeout), | ||
} | ||
}, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -32,6 +32,25 @@ In addition, the cache is culled based on ``CULL_FREQUENCY`` when ``add()`` | |
or ``set()`` is called, if ``MAX_ENTRIES`` is exceeded. See | ||
:ref:`django:cache_arguments` for an explanation of these two options. | ||
|
||
Cache entries include a HMAC signature to ensure data integrity by default. | ||
You can disable this by setting ``ENABLE_SIGNING`` to ``False``. | ||
Signatures can also include an optional key and salt parameter by setting | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the use case for custom key and salt? Probably it should be explained that SECRET_KEY is the default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As suggested by another reviewer, I removed the SALT option since it wasn't providing what I expected it to. I also added the documentation of |
||
``KEY`` and ``SALT`` repectively. Signatures are provided by the Blake2 hash | ||
function, making Key sizes limited to 64 bytes, and salt sizes limited to 16 | ||
bytes. If a key is not provided, cache entries will be signed using the | ||
``SECRET_KEY``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's important to be explicit about what "data integrity" means. (i.e. If your database is compromised, it can lead to RCE.)
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have any examples of where performance implications are documented in django? I would love to copy the format. I think it would be hard for me to invent a new format for describing a 2000ns slow down due to signing while also showing that to be only a 3% difference from the original without simply including a large list of numbers. |
||
|
||
In this example, the cache collection is configured with a key and salt:: | ||
|
||
CACHES = { | ||
"default": { | ||
"BACKEND": "django_mongodb_backend.cache.MongoDBCache", | ||
"LOCATION": "my_cache_collection", | ||
"KEY": "my_secret_key", | ||
"SALT": "my_salt", | ||
}, | ||
} | ||
|
||
Creating the cache collection | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently don't have any type hints in this project (Django hasn't adopted them), so it's a bit out of place to add them in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that, Django was primarily written prior to type hinting's standardization. I personally feel like it is required here since I am breaking from the standard APIs for some functions. It felt more correct to add them to everything then to only add them to certain functions. I can remove them, but I am worried it will reduce readability and maintainability.