Skip to content

Commit Per Email During Gmail Sync #42

@mrrobot47

Description

@mrrobot47

Metadata

  • File(s): frappe_gmail_thread/frappe_gmail_thread/doctype/gmail_thread/gmail_thread.py:222,249,272,372
  • Category: Database / Background Jobs
  • Severity: High
  • Effort to Fix: Medium
  • Estimated Performance Gain: 50-70%

Problem Description

The Gmail sync function (sync()) calls frappe.db.commit() after processing each individual email message. For users with active inboxes receiving 50-100 new emails, a single sync operation will execute 50-100 separate database commits.

Each commit:

  • Forces a disk flush to the database
  • Acquires and releases database locks
  • Creates transaction overhead

This significantly slows down the sync process and increases database load.

Code Location

Initial sync - commit per email (line 222):

gmail_thread.append("emails", email)
gmail_thread.save(ignore_permissions=True)
frappe.db.commit()  # nosemgrep  <-- Commit after EACH email
frappe.db.set_value(
    "Gmail Thread",
    gmail_thread.name,
    "modified",
    email.date_and_time,
    update_modified=False,
)

End of initial sync label loop (line 249):

gmail_account.reload()
gmail_account.last_historyid = max_history_id
gmail_account.save(ignore_permissions=True)
frappe.db.commit()  # nosemgrep

History not found error handling (line 272):

if error.get("reason") == "notFound":
    gmail_account.last_historyid = 0
    gmail_account.save(ignore_permissions=True)
    frappe.db.commit()
    return

End of incremental sync (line 372):

gmail_account.reload()
gmail_account.last_historyid = max_history_id
gmail_account.save(ignore_permissions=True)
frappe.db.commit()  # nosemgrep

Root Cause

The developer likely added per-email commits to ensure data persistence in case of mid-sync failures. However, this approach:

  1. Creates excessive I/O overhead
  2. Negates the benefits of batch processing
  3. Causes the sync to take much longer than necessary
  4. Places unnecessary load on the database during the 30-minute sync cycle

Proposed Solution

Batch commits - either commit once at the end of processing all emails for a label, or commit every N emails (e.g., 10-20) to balance durability with performance:

def sync(user=None):
    if user:
        frappe.set_user(user)
    user = frappe.session.user
    gmail_account = frappe.get_doc("Gmail Account", {"linked_user": user})
    # ... validation code ...
    
    BATCH_COMMIT_SIZE = 20  # Commit every N emails
    emails_processed = 0
    
    for label_id in label_ids:
        try:
            if not last_history_id:
                # Initial sync
                threads = gmail.users().threads().list(userId="me", labelIds=label_id).execute()
                if "threads" not in threads:
                    continue
                for thread in threads["threads"][::-1]:
                    # ... process thread and emails ...
                    for message in thread_data["messages"]:
                        # ... process email ...
                        gmail_thread.append("emails", email)
                        gmail_thread.save(ignore_permissions=True)
                        
                        emails_processed += 1
                        if emails_processed % BATCH_COMMIT_SIZE == 0:
                            frappe.db.commit()
                            # Update history ID periodically for crash recovery
                            gmail_account.reload()
                            gmail_account.last_historyid = max_history_id
                            gmail_account.save(ignore_permissions=True)
                
                # Final commit for remaining emails
                frappe.db.commit()
                gmail_account.reload()
                gmail_account.last_historyid = max_history_id
                gmail_account.save(ignore_permissions=True)
            else:
                # Incremental sync - similar batching approach
                # ...
        except Exception:
            frappe.db.rollback()
            frappe.log_error(frappe.get_traceback(), "Gmail Thread Sync Error")
            continue

Implementation Steps

  1. Add a batch commit size constant (e.g., 20 emails)
  2. Track the number of emails processed in a counter
  3. Only commit when counter reaches batch size
  4. After processing all emails, perform a final commit
  5. Update last_historyid during batch commits for crash recovery
  6. Add frappe.db.rollback() in exception handler to clean up failed transactions
  7. Test with accounts that have many new emails

Additional Notes

  • The # nosemgrep comment suggests the developer was aware this is a pattern to avoid but added it intentionally
  • Consider using frappe.db.savepoint() for finer-grained transaction control if needed
  • With 10 Gmail accounts syncing every 30 minutes, this optimization could reduce database commits from 1000+ to ~50 per sync cycle

Metadata

Metadata

Labels

highHigh severityperformancePerformance optimization

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions