Skip to content

SRCH-5154 Bulk delete zombie records from Elastic Search#1743

Open
krbhavith wants to merge 12 commits intoGSA:mainfrom
krbhavith:SRCH-5154/zombie_url_upload
Open

SRCH-5154 Bulk delete zombie records from Elastic Search#1743
krbhavith wants to merge 12 commits intoGSA:mainfrom
krbhavith:SRCH-5154/zombie_url_upload

Conversation

@krbhavith
Copy link
Contributor

@krbhavith krbhavith commented Nov 26, 2024

Summary

  • Upload a list of url's(zombie) through Super Admin page to delete from SearchgovUrl table and from Kibana.
  • Added concern BulkUploadHandler to be used by all bulk upload controllers.

Checklist

Please ensure you have addressed all concerns below before marking a PR "ready for review" or before requesting a re-review. If you cannot complete an item below, replace the checkbox with the ⚠️ :warning: emoji and explain why the step was not completed.

Functionality Checks

  • You have merged the latest changes from the target branch (usually main) into your branch.

  • Your primary commit message is of the format SRCH-#### <description> matching the associated Jira ticket.

  • PR title is either of the format SRCH-#### <description> matching the associated Jira ticket (i.e. "SRCH-123 implement feature X"), or Release - SRCH-####, SRCH-####, SRCH-#### matching the Jira ticket numbers in the release.

  • Automated checks pass. If Code Climate checks do not pass, explain reason for failures:

Process Checks

  • You have specified at least one "Reviewer".

@krbhavith krbhavith force-pushed the SRCH-5154/zombie_url_upload branch from e1c91c7 to 30e403c Compare November 26, 2024 17:24
Copy link
Contributor

@stevenbarragan stevenbarragan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plase make sure rspec, cucumber, and codeclimate checks pass.

@krbhavith
Copy link
Contributor Author

Plase make sure rspec, cucumber, and codeclimate checks pass.

I have to make one more change to delete the record with document_id in I14yDocument when url is not available in searchgov_urls table.

@krbhavith krbhavith force-pushed the SRCH-5154/zombie_url_upload branch 4 times, most recently from be72efe to 52117bf Compare December 10, 2024 16:47
Copy link
Contributor

@stevenbarragan stevenbarragan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krbhavith don't forget to fix the test coverage

Copy link
Contributor

@stevenbarragan stevenbarragan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure file exists where the job will read it. And avoid using rescue as much.

BulkZombieUrlUploaderJob.perform_later(
current_user,
@file.original_filename,
@file.tempfile.path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach won't work on staging or production. When the request comes in the tempfile gets created in the app servers but the workers will try to perform the upload from the crawl server.

This file needs to be send to S3 and the job needs to read it from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing similar functionality at https://github.com/GSA/search-gov/blob/main/app/controllers/admin/bulk_url_upload_controller.rb#L33 working fine now in production

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the worker picking up the tempfile?

@krbhavith krbhavith force-pushed the SRCH-5154/zombie_url_upload branch from 0d18682 to c83b0ed Compare December 17, 2024 04:14
@krbhavith krbhavith force-pushed the SRCH-5154/zombie_url_upload branch from c83b0ed to 7b4929d Compare December 17, 2024 14:23
Copy link
Contributor

@stevenbarragan stevenbarragan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need to use.send to call methods plase change that. There is a few places where logging can be improved.

BulkZombieUrlUploaderJob.perform_later(
current_user,
@file.original_filename,
@file.tempfile.path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the worker picking up the tempfile?

let(:row) { { 'URL' => 'https://example.com', 'DOC_ID' => nil } }

it 'logs a missing document ID error' do
uploader.send(:process_row, row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use send to call methods. If the method is private test it through the parent or make it public.

uploader.process_row, row

context 'when document ID is present' do
it 'handles URL processing' do
allow(uploader).to receive(:handle_url_processing)
uploader.send(:process_row, row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a need to use send?


describe '#handle_processing_error' do
subject(:handle_processing_error) do
uploader.send(:handle_processing_error, error, url, document_id, row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use send.


describe '#initialize_results' do
it 'initializes the results object' do
uploader.send(:initialize_results)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use send. private methods do no need to be unit tested. The uploadmethod in here should have tests for whatinitialize_results` is doing instead.

# frozen_string_literal: true

class BulkZombieUrls::FileValidator
MAXIMUM_FILE_SIZE = 4.megabytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where this this restriction comes from? I don't see it as part of the ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the requirement from the ticket, but I am following this as per the previous other upload processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants