-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support #1813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 31 commits
245c869
031cd8c
8d4e8a2
2df0d6b
315a115
d74314d
27f13c8
263f8c7
f8bcdef
5435785
5c4d152
f86d040
adcb945
7986cc5
26b8398
28de28a
68235b8
e2a1a3c
bf9758a
e1590aa
c4c7af7
de5a245
43bcce4
41ee0bb
9d52aeb
0d34464
1d49656
2af0ca4
c93bae1
f374b5a
0db9866
0532803
3edeef6
d199597
abc7cd3
4550c20
99d44e1
001636d
637c474
2d511f6
17e1d33
ce5f3e3
edfe972
8aaa2f6
dfb322c
8946daf
9397129
913403b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -13,7 +13,10 @@ | |||
# limitations under the License. | ||||
from __future__ import annotations | ||||
|
||||
from typing import TYPE_CHECKING, Any, Tuple, Type, Union | ||||
import struct | ||||
from dataclasses import dataclass | ||||
from enum import Enum | ||||
from typing import TYPE_CHECKING, Any, Optional, Sequence, Tuple, Type, Union | ||||
from uuid import UUID | ||||
|
||||
"""Tools for representing BSON binary data. | ||||
|
@@ -191,21 +194,76 @@ class UuidRepresentation: | |||
""" | ||||
|
||||
|
||||
VECTOR_SUBTYPE = 9 | ||||
"""**(BETA)** BSON binary subtype for densely packed vector data. | ||||
|
||||
.. versionadded:: 4.10 | ||||
""" | ||||
|
||||
|
||||
USER_DEFINED_SUBTYPE = 128 | ||||
"""BSON binary subtype for any user defined structure. | ||||
""" | ||||
|
||||
|
||||
class BinaryVectorDtype(Enum): | ||||
"""**(BETA)** Datatypes of vector subtype. | ||||
|
||||
:param FLOAT32: (0x27) Pack list of :class:`float` as float32 | ||||
:param INT8: (0x03) Pack list of :class:`int` in [-128, 127] as signed int8 | ||||
:param PACKED_BIT: (0x10) Pack list of :class:`int` in [0, 255] as unsigned uint8 | ||||
|
||||
The `PACKED_BIT` value represents a special case where vector values themselves | ||||
can only be of two values (0 or 1) but these are packed together into groups of 8, | ||||
a byte. In Python, these are displayed as ints in range [0, 255] | ||||
|
||||
Each value is of type bytes with a length of one. | ||||
|
||||
.. versionadded:: 4.9 | ||||
""" | ||||
|
||||
INT8 = b"\x03" | ||||
FLOAT32 = b"\x27" | ||||
PACKED_BIT = b"\x10" | ||||
|
||||
|
||||
# Map from bytes to enum value, for decoding. | ||||
DTYPE_FROM_HEX = {key.value: key for key in BinaryVectorDtype} | ||||
|
||||
|
||||
@dataclass | ||||
class BinaryVector: | ||||
"""**(BETA)** Vector of numbers along with metadata for binary interoperability. | ||||
|
||||
:param data: Sequence of numbers representing the mathematical vector. | ||||
:param dtype: The data type stored in binary | ||||
:param padding: The number of bits in the final byte that are to be ignored | ||||
when a vector element's size is less than a byte | ||||
and the length of the vector is not a multiple of 8. | ||||
|
||||
.. versionadded:: 4.9 | ||||
""" | ||||
|
||||
data: Sequence[float | int] | ||||
|
__slots__ = ("__time", "__inc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could pass slots=True
instead: https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange. And you're cool with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to add them manually instead of using slots=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class BinaryVector:
"""**(BETA)** Vector of numbers along with metadata for binary interoperability.
:param data (Sequence[float | int]): Sequence of numbers representing the mathematical vector.
:param dtype (:class:`bson.Binary.BinaryVectorDtype`): The data type stored in binary
:param padding (Optional[int] = 0): The number of bits in the final byte that are to be ignored
when a vector element's size is less than a byte
and the length of the vector is not a multiple of 8. Default is 0.
.. versionadded:: 4.10
"""
__slots__ = ("data", "dtype", "padding")
def __init__(self, data, dtype, padding=0):
self.data = data
self.dtype = dtype
self.padding = padding
Is this right? @blink1073
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't realize __slots__
with dataclass was problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be sorted now
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[int]
-> int
, unless padding=None is considered valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
padding=None is expected when it is not a packed type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it? I see the code using padding=0 in that case.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[int]
-> int
blink1073 marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about assert.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about assert.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[bool]
-> bool
unless uncompressed=None
is valid.
blink1073 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to validate self.subtype == 9
here before attempting to decode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShaneHarvey How do I do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
if self.subtype != VECTOR_SUBTYPE:
raise ValueError(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at as_uuid
it should be the same as that validation:
if self.subtype not in ALL_UUID_SUBTYPES:
raise ValueError(f"cannot decode subtype {self.subtype} as a uuid")
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use assert
for validation data. If the argument type is incorrect we raise a TypeError, if the type is correct but the value is invalid we raise a ValueError.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add an uncompressed
option here at all? It looks like this option is irreversible because from_vector does not support the same option. Even if it did, is this option useful outside of test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we should leave out uncompressed
, it was a quality of life addition, but is not likely to be standardized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's irreversible. We don't yet provide an API to go from a full vector of zeros and ones to a packed bit vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going in, you'd be using something like numpy.packbits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's omit it since it adds unneeded complexity at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
{ | ||
"description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32", | ||
"test_key": "vector", | ||
"tests": [ | ||
{ | ||
"description": "Simple Vector FLOAT32", | ||
"valid": true, | ||
"vector": [127.0, 7.0], | ||
"dtype_hex": "0x27", | ||
"dtype_alias": "FLOAT32", | ||
"padding": 0, | ||
"canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000" | ||
}, | ||
{ | ||
"description": "Empty Vector FLOAT32", | ||
"valid": true, | ||
"vector": [], | ||
"dtype_hex": "0x27", | ||
"dtype_alias": "FLOAT32", | ||
"padding": 0, | ||
"canonical_bson": "1400000005766563746F72000200000009270000" | ||
}, | ||
{ | ||
"description": "Infinity Vector FLOAT32", | ||
"valid": true, | ||
"vector": ["-inf", 0.0, "inf"], | ||
"dtype_hex": "0x27", | ||
"dtype_alias": "FLOAT32", | ||
"padding": 0, | ||
"canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00" | ||
}, | ||
{ | ||
"description": "FLOAT32 with padding", | ||
"valid": false, | ||
"vector": [127.0, 7.0], | ||
"dtype_hex": "0x27", | ||
"dtype_alias": "FLOAT32", | ||
"padding": 3 | ||
} | ||
] | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
{ | ||
"description": "Tests of Binary subtype 9, Vectors, with dtype INT8", | ||
"test_key": "vector", | ||
"tests": [ | ||
{ | ||
"description": "Simple Vector INT8", | ||
"valid": true, | ||
"vector": [127, 7], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0, | ||
"canonical_bson": "1600000005766563746F7200040000000903007F0700" | ||
}, | ||
{ | ||
"description": "Empty Vector INT8", | ||
"valid": true, | ||
"vector": [], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0, | ||
"canonical_bson": "1400000005766563746F72000200000009030000" | ||
}, | ||
{ | ||
"description": "Overflow Vector INT8", | ||
"valid": false, | ||
"vector": [128], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0 | ||
}, | ||
{ | ||
"description": "Underflow Vector INT8", | ||
"valid": false, | ||
"vector": [-129], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0 | ||
}, | ||
{ | ||
"description": "INT8 with padding", | ||
"valid": false, | ||
"vector": [127, 7], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 3 | ||
}, | ||
{ | ||
"description": "INT8 with float inputs", | ||
"valid": false, | ||
"vector": [127.77, 7.77], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0 | ||
} | ||
] | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
{ | ||
"description": "Tests of Binary subtype 9, Vectors, with dtype PACKED_BIT", | ||
"test_key": "vector", | ||
"tests": [ | ||
{ | ||
"description": "Simple Vector PACKED_BIT", | ||
"valid": true, | ||
"vector": [127, 7], | ||
"dtype_hex": "0x10", | ||
"dtype_alias": "PACKED_BIT", | ||
"padding": 0, | ||
"canonical_bson": "1600000005766563746F7200040000000910007F0700" | ||
}, | ||
{ | ||
"description": "Empty Vector PACKED_BIT", | ||
"valid": true, | ||
"vector": [], | ||
"dtype_hex": "0x10", | ||
"dtype_alias": "PACKED_BIT", | ||
"padding": 0, | ||
"canonical_bson": "1400000005766563746F72000200000009100000" | ||
}, | ||
{ | ||
"description": "PACKED_BIT with padding", | ||
"valid": true, | ||
"vector": [127, 7], | ||
"dtype_hex": "0x10", | ||
"dtype_alias": "PACKED_BIT", | ||
"padding": 3, | ||
"canonical_bson": "1600000005766563746F7200040000000910037F0700" | ||
}, | ||
{ | ||
"description": "Overflow Vector PACKED_BIT", | ||
"valid": false, | ||
"vector": [256], | ||
"dtype_hex": "0x10", | ||
"dtype_alias": "PACKED_BIT", | ||
"padding": 0 | ||
}, | ||
{ | ||
"description": "Underflow Vector PACKED_BIT", | ||
"valid": false, | ||
"vector": [-1], | ||
"dtype_hex": "0x10", | ||
"dtype_alias": "PACKED_BIT", | ||
"padding": 0 | ||
} | ||
] | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An enum of bytes is a bit unusual. Can we change it to use ints? eg
INT8 = 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming convention that Geert used has meaning in both the first and last 4 bits, so I'd prefer it to stay as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how the interpretation of the values here matters. Are you saying this is a bit flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user doesn't see the value locally either. Given that the bson spec itself always uses integers I think we should use the integer form.