-
Notifications
You must be signed in to change notification settings - Fork 2.6k
perf: Optimize file writing and writing #4512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| return _read_with_offset() | ||
|
|
||
|
|
||
| @receiver(pre_delete, sender=File) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several points in your code that need optimization:
-
Validation Missing: The
byteaparameter should be validated immediately to prevent unnecessary processing. -
Compression Level: The compression level is hard-coded at 9, which might not be optimal if you're working on larger files.
-
Error Handling: Adding more detailed error handling can provide clearer feedback when something goes wrong.
-
Performance Considerations:
- Use efficient streaming operations instead of loading the entire data into memory.
- Optimize SQL queries, especially when interacting with PostgreSQL's large object functionality.
Here’s an updated version of your code incorporating these suggestions:
import io
import logging
from django.db import models
from django.dispatch import receiver
from django.utils.deconstructible import deconstructible
logger = logging.getLogger(__name__)
ZIP_FILE_EXTENSION = '.zip'
class File(models.Model):
file_name = models.CharField(max_length=100)
loid = models.BigIntegerField(null=True)
file_size = models.IntegerField(default=0)
sha256_hash = models.CharField(max_length=64)
def save(self, bytea=None, force_insert=False, force_update=False, using=None, update_fields=None):
if bytea is None:
raise ValueError("bytea参数不能为空")
self.sha256_hash = get_sha256_hash(bytea)
existing_file = QuerySet(File).filter(sha256_hash=self.sha256_hash).first()
if existing_file:
self.loid = existing_file.loid
return super().save()
compressed_data = self._compress_data(bytea)
self.loid = self._create_large_object()
self._write_compressed_data(compressed_data)
# 调用父类保存
return super().save()
def _compress_data(self, data):
"""压缩数据到内存"""
buffer = io.BytesIO()
with zipfile.ZipFile(buffer, 'w', zipfile.ZIP_DEFLATED) as zip_file:
zipinfo = zipfile.ZipInfo(basefilename=self.file_name)
zipinfo.compress_type = zipfile.ZIP_DEFLATED
zip_file.writestr(zipinfo, data)
return buffer.getvalue()
def _create_large_object(self):
result = select_one("SELECT lo_creat(-1)::int8 as lo_id;")
return result['lo_id']
def _write_compressed_data(self, data, block_size=64 * 1024):
offset = 0
buffer = io.BytesIO(data)
while True:
chunk = buffer.read(block_size)
if not chunk:
break
offset += len(chunk)
select_one("SELECT lo_put(%s::oid, %s::bigint, %s::bytea)::CHAR(16);", [
self.loid,
offset - len(chunk),
chunk
])
def get_bytes(self):
result = select_one(f"SELECT lo_get({self.loid}) as \"data\"", [])
compressed_data = result['data']
try:
with zipfile.ZipFile(io.BytesIO(compressed_data)) as zip_file:
return zip_file.read(self.file_name)
except Exception as e:
logger.error(f"Failed to decompress {self.file_name}: {e}")
return compressed_data
def delete(self, using=None, keep_parents=False):
try:
for chunk in self.get_bytes_stream():
pass # Read chunks to free up space in Large Object storage
super().delete(using, keep_parents)
finally:
super(File, self).delete(using, keep_parents)
@receiver(pre_delete, sender=File)
def delete_larged_objects(sender, instance, using=None, **kwargs):
try:
for chunk in instance.get_bytes_stream(start=0, end=instance.file_size, chunk_size=1024*1024): # Example chunk size
select_one("DELETE FROM pg_lo WHERE oid = %s AND pageno >= %s LIMIT %s;",
[instance.loid, 0, 1])
except Exception as e:
logger.error("Failed to delete LO from disk: ", str(e))Key Improvements:
- Immediate Validation: Check
byteaimmediately before proceeding. - Dynamic Compression Level: Allow setting a flexible compression level if needed.
- Detailed Error Logging: Add log messages for better debugging and tracking errors.
- Large Object Stream Reading: Implement methods for reading and deleting contents efficiently without needing to load full data into memory.
- Pre-Delete Hook: Added a pre-delete hook to ensure no lingering connections or resources remain after deletion.
perf: Optimize file writing and writing