You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implementation of Queue-Based Processing Architecture: Carlos outlined their plan to implement a minimal viable product (MVP) using RabbitMQ and a REST API to improve processing efficiency, with Michael and Francis providing input on technical details and deployment strategies.
MVP Development Plan: Carlos described their approach to building a small MVP that integrates RabbitMQ for queuing, a REST API for communication, and batch processing to test if the new architecture is faster and more scalable than the current request-response model.
Performance Testing Strategy: Carlos explained their intention to evaluate the efficiency of the new system by measuring GPU and CPU utilization, testing with multiple workers for scalability, and monitoring how results are written back to the database.
Cloud Deployment and Student Involvement: Carlos shared that a student will join the project to assist with cloud deployment, focusing on provisioning resources and creating a blueprint for trial deployments, with a preference for AWS or Azure depending on available credits.
Cloud Infrastructure Preferences: Michael and Francis discussed preferences for using managed PostgreSQL databases and containerized deployments (e.g., Kubernetes) over VMs, and clarified the roles of Redis and RabbitMQ in the architecture.
Optimization and Extensibility of Processing Service: Carlos and Michael discussed making the processing service generic and extensible, focusing on Pytorch optimizations as a template for common use cases while ensuring compatibility with other pipelines and user needs.
Generic REST API Interface: Carlos emphasized the goal of keeping the processing service's dependency limited to the REST API, allowing integration with various environments such as R, and supporting both optimized and generic processing workflows.
Pytorch Optimization Blueprint: Michael suggested using Carlos's Pytorch optimizations as the primary template for classifier tasks, aiming to maximize GPU efficiency and provide recommendations for best practices.
Balancing Performance and Extensibility: Carlos and Michael acknowledged the need to balance performance optimizations with extensibility, agreeing to document and recommend effective techniques even if a fully generalized framework is not feasible.
Job Monitoring and Failure Handling Improvements: Michael described a new periodic meta task for monitoring unfinished jobs and marking failed tasks, requesting Carlos to review the implementation for integration into the evolving architecture.
Meta Task Implementation: Michael implemented a scheduled meta task that checks all unfinished jobs, marks those disconnected from Celery as failed, and considers retrying or updating statuses as appropriate.
Review and Integration Plans: Carlos agreed to review Michael's pull request and noted plans to provide their own work-in-progress PR for feedback by the end of the week.
RabbitMQ Behavior and Troubleshooting: Carlos shared their experience with RabbitMQ's message dropping behavior under low disk space and discussed configuration options to improve error handling, with Michael requesting documentation of these findings.
Message Dropping Issue: Carlos discovered that RabbitMQ silently drops messages when disk space is low, only accepting messages once the issue is resolved, which can be problematic in Docker environments.
Error Handling Configuration: Carlos found a way to configure RabbitMQ to explicitly error out instead of dropping messages, improving reliability for the processing pipeline.
Documentation Request: Michael asked Carlos to document the troubleshooting steps and manual tests for future reference and team knowledge sharing.
Integration with External Processing Services: Michael and Carlos discussed the potential for integrating Kavi's independent processing service with the current antenna system, noting schema similarities and possible API-based connections.
Service Architecture Comparison: Michael explained that Kavi's processing service is independent, queues batches of images via scripts, and saves results to a database, but shares schema design with antenna.
Potential for Integration: Carlos and Michael agreed that, due to similar types and schemas, connecting Kavi's service to antenna via the new API should be straightforward once the architecture is more developed.
Data Availability for Testing: Carlos confirmed to Michael that the shared data bucket provides sufficient images for testing, with options to adjust job size and access additional validated test sets if needed.
Current Test Data Usage: Carlos reported configuring jobs with about 100 images from the shared bucket, which is adequate for short experimental runs, and noted the ability to scale up by adjusting filters.
Access to Validated Test Sets: Michael mentioned the availability of a small validated test set for future use, clarifying that current efforts focus on generating predictions rather than accuracy.
Follow-up tasks:
Cloud Resource Provisioning: Check availability of AWS credits and confirm whether Azure can be used for cloud deployment if AWS is not available. (Carlos)
Cloud Deployment Collaboration: Start a thread with the new student to coordinate cloud deployment tasks and ensure independent progress on antenna deployment and architecture changes. (Carlos)
Pull Request Review: Review Michael's PR implementing a periodic meta task for job status checking once it is ready for review. (Carlos)
Documentation of RabbitMQ Behavior: Document the manual test process for RabbitMQ message dropping due to low disk space and the configuration to force error reporting. (Carlos)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Follow-up tasks:
Beta Was this translation helpful? Give feedback.
All reactions