Exclude system tables and only restore user's schema#486
Exclude system tables and only restore user's schema#486kendrick-ren wants to merge 1 commit intoscylladb:masterfrom
Conversation
@kendrick-ren, please, split the patch into patches that change one thing, e.g. "Exclude system tables and only restore user's schema". Also, please, provide a meaningful and more detailed than in the current patch (don't confuse with the PR description). I'll also leave a few comments in the patch itself. |
vladzcloudius
left a comment
There was a problem hiding this comment.
As promised - there are more comments.
| tags: | ||
| - restore_token_ring | ||
| tasks: | ||
| - name: Stop all nodes serially |
There was a problem hiding this comment.
Nodes don't need to be stopped serially. Stopping nodes serially is only going to make this step take longer for no reason AFAICT.
There was a problem hiding this comment.
That is correct, this is a mistake. Let me move the step of stopping nodes out of this play.
There was a problem hiding this comment.
Fixed. Made change to stop all nodes in parallel in line 46 ~ 56.
| tags: | ||
| - upload_snapshot | ||
| tasks: | ||
| - name: Restore |
There was a problem hiding this comment.
What's the point of moving this block to be before the restoration of the original cluster's seeds configuration?
The original structure of the playbook that was, first, setting up the destination cluster (all the way) and only then restoring the data makes more sense the proposed version when the data restoration is stuck in between two parts of the cluster configuration steps.
There was a problem hiding this comment.
The intension is to make the cleanup as an always run action, so even the restoration or any previous step fails in between somehow, the cluster would still be able to come back.
While the actual change to make it always action comes as a new feature in the next PR.
There was a problem hiding this comment.
Fixed. Leave the structure as it was before.
There was a problem hiding this comment.
You did not restore the original structure - the "block" and the corresponding indentation are still part of the first commit.
Please, remove.
| # Cluster name to restore schema to. This is the name of the target cluster which | ||
| # is registered in ScyllaDB Manager | ||
| # | ||
| cluster_name: <target_cluster_name> |
There was a problem hiding this comment.
cluster_name is a mandatory value that must appear in scylla.yaml on every Scylla node.
Giving a variable that has a different semantics the same name is very confusing.
Please, rename it to something like scylla_manager_cluster_name.
There was a problem hiding this comment.
Agree, scylla_manager_cluster_name is better. Changed.
There was a problem hiding this comment.
Agree, scylla_manager_cluster_name is better. Changed.
| shell: | | ||
| scylla-manager-agent download-files -L {{ backup_location }} -n {{ host_id[inventory_hostname] }} -T {{ snapshot_tag }} -d /var/lib/scylla/data/ --dry-run | grep "^\s*\-" | cut -d"-" -f2 | cut -d"(" -f1 | ||
| become_user: scylla | ||
| scylla-manager-agent download-files -L {{ backup_location }} -n {{ host_id[inventory_hostname] }} -T {{ snapshot_tag }} --dump-manifest 2>/dev/null | jq -r '.index[] | [.keyspace,.table] | join(".")' |
There was a problem hiding this comment.
This hunk add a dependency on the 3d party jq tool.
Note that the old version on this command didn't have this requirement.
If you still believe that using jq is necessary, please, make sure to update the README.md by add adding the corresponding requirement to a "Prerequisites" section.
There was a problem hiding this comment.
@jpancier-scylla any specific reason on switching to use jq? Can you help elaborate?
There was a problem hiding this comment.
based on the discussion today, we'll keep using jq and update the readme correspondingly.
There was a problem hiding this comment.
Note that Ansible has a native support for JSON.
There was a problem hiding this comment.
Updated the readme to include jq as one prerequisite in the README.md
There was a problem hiding this comment.
Note that Ansible has a native support for JSON.
That is a good idea, let me convert the logic to use the built-in json parser instead of jq. Will submit another commit shortly.
There was a problem hiding this comment.
Switched to Ansible built-in json parser, took out jq from the code and from README.md
There was a problem hiding this comment.
Why not using the built-in Ansible JSON parser?
- name: Parse JSON from stdout
ansible.builtin.set_fact:
parsed_data: "{{ command_output.stdout | from_json }}"
| # - name: Let's see our new facts | ||
| # debug: | ||
| # msg: "{{ inventory_hostname }} old seeds list is {{ old_seeds }}" | ||
|
|
There was a problem hiding this comment.
Put any cleanup into a separate patch/commit.
There was a problem hiding this comment.
What about this comment?
There was a problem hiding this comment.
Excluded from this PR. Use a separate patch to get rid of it.
There was a problem hiding this comment.
I still see it in the first commit of this PR.
| gather_facts: false | ||
| serial: 1 | ||
| tags: | ||
| - restore_token_ring |
There was a problem hiding this comment.
Unless you wipe old data/commitlog/schema_commitlog/etc. directories like the original playbook did Scylla nodes are not going to bootstrap hence won't pick up new tokens.
This whole change makes very little sense.
Did you test that the playbook with your changes actually works?
There was a problem hiding this comment.
The new change was validated in both sanity test and impact test
There was a problem hiding this comment.
Fixed, restore to the original logic.
|
@vladzcloudius I think all comments are resolved. Please review and add new comments if any. Thanks. |
|
@kendrick-ren, please, don't resolve other people's comments. This makes it hard to go over changes that have been requested. |
| # - name: Let's see our new facts | ||
| # debug: | ||
| # msg: "{{ inventory_hostname }} old seeds list is {{ old_seeds }}" | ||
|
|
There was a problem hiding this comment.
What about this comment?
|
@vladzcloudius - name: Let's see our new factsdebug:msg: "{{ inventory_hostname }} old seeds list is {{ old_seeds }}"Member This would be resolved in another patch, instead of combining them into one, per your earlier another comment. |
| become_user: scylla | ||
| register: _tables_list | ||
| vars: | ||
| ansible_pipelining: true |
There was a problem hiding this comment.
Why is pipelining needed?
According to Ansible documentation using pipelining conflicts with a privilege escalation (become) which you use in this task.
There was a problem hiding this comment.
It is to avoid Ansible to write a temp file under a temp directory for the step, while it only conflicts when "become" is used along with requiretty enabled (which is usually disabled for Ansible automation on systems).
There was a problem hiding this comment.
Why do we care if Ansible is going to write a temp file?
| - name: Save tables names list as a fact | ||
| set_fact: | ||
| system_schema_tables: "{{ tables_to_restore | select('search', '^\\s*system_schema\\.') | list }}" | ||
| all_tables: "{{ _tables_list.stdout.split('\n') }}" |
There was a problem hiding this comment.
I don't get it: don't you already have it in _tables_list.stdout_lines ?
There was a problem hiding this comment.
oops, I meant to use that, which was why I generated stdout_lines, but forgot to change the step after.
Fixed now.
|
|
||
| - name: Save names of tables to restore as a fact | ||
| set_fact: | ||
| tables_to_restore: "{{ all_tables | reject('search', '^system(_|\\.)') | list }}" |
There was a problem hiding this comment.
What about audit keyspace?
There was a problem hiding this comment.
audit keyspace is also kept here, this variable is only used in node refresh step. In the steps later, audit keyspace table content is also restored.
| port: 9042 | ||
| host: "{{ listen_address }}" | ||
|
|
||
|
|
There was a problem hiding this comment.
Cleanups should be part of a separate commit.
There was a problem hiding this comment.
okay, isolated the cleanups related change to a separate commit.
There was a problem hiding this comment.
I still see it in the first commit.
| - restore_token_ring | ||
| tasks: | ||
| - name: Delete initial_token in scylla.yaml of {{ inventory_hostname }} | ||
| tags: cleanup |
There was a problem hiding this comment.
Adding tags (which allow you run specific "tagged" tasks) should put in a separate dedicated commit.
There was a problem hiding this comment.
okay, all cleanups related changes went into a separate commit.
| path: /etc/scylla/scylla.yaml | ||
| regexp: '^initial_token:' | ||
| line: "" | ||
| state: absent |
There was a problem hiding this comment.
This is a change in behavior - should sent as a separate dedicated commit with a description that has a motivation for a change.
There was a problem hiding this comment.
since the purpose is to remove initial_token, state: absent will delete the line, while the original line: "" would leave a blank line (which is an extra blank line compared with user's very original scylla.yaml file).
It went to a separate commit.
There was a problem hiding this comment.
Please, don't mix independent changes in the same commit: there should be one commit for tagging and one for this hunk.
Each commit with a corresponding description.
| tasks: | ||
| - name: Restore ScyllaDB schema from backup | ||
| tags: | ||
| - restore_schema |
There was a problem hiding this comment.
Why there are 2 different tags on the task of a play that has a single task?
There was a problem hiding this comment.
That is a good catch. Removed.
| backrefs: yes | ||
|
|
||
| - name: Restore schema from a backup snapshot {{ snapshot_tag }} | ||
| hosts: all |
There was a problem hiding this comment.
I don't get it: why do you run this play on hosts: all while in fact you are going to run a single command on a Scylla Manager host?
There was a problem hiding this comment.
I think that is inherited, and hosts:all is the default in Ansible any way.
But it would be overwritten by the "delegate_to" and "run_once" setting in the sub task under it.
There was a problem hiding this comment.
These are new lines - not inherited.
And you can use whatever host or a group of hosts with the hosts: ... parameter. Hence the question.
Here you were supposed to use scyllamgr_host as your hosts:
Something like (after you address the corresponding comment below):
...
hosts: scylla-manager
...
And drop all this delegate_to and run_once nonsense.
| when: item not in system_schema_tables | ||
| - name: Restore | ||
| tags: | ||
| - restore_data |
There was a problem hiding this comment.
Same question as before: why there are two tags for the same block of tasks?
And even if there is any reason for that - this must come in a separate commit.
In order to be able to set a tag on a block you had to put all those commands in a block and change their indentation which makes it hard to see what was the functional change in this commit that was supposed to only change the way we restore the schema.
Please, take all this syntactic sugar out of the functional commit.
There was a problem hiding this comment.
It makes sense, removed the unnecessary tag.
| # - name: Let's see our new facts | ||
| # debug: | ||
| # msg: "{{ inventory_hostname }} old seeds list is {{ old_seeds }}" | ||
|
|
There was a problem hiding this comment.
I still see it in the first commit of this PR.
|
@vladzcloudius Resolved. |
vladzcloudius
left a comment
There was a problem hiding this comment.
The first patch description seems to be misleading.
There wasn't any issue with tables UUIDs on the source cluster in the original procedure AFAIK.
If there wasn't please, refer the corresponding GH.
AFAIU the change is required simply because the SM API has changed and there is a way to restore the schema without having to upload system KS data.
Please, reference the corresponding API change in the commit message too.
Other than the API change, I was referring to this GH issue -> scylladb/scylla-manager#3019 (comment) for table UUIDs on the target cluster. Is the that no longer an issue? |
| become_user: scylla | ||
| register: _tables_list | ||
| vars: | ||
| ansible_pipelining: true |
There was a problem hiding this comment.
Why do we care if Ansible is going to write a temp file?
| port: 9042 | ||
| host: "{{ listen_address }}" | ||
|
|
||
|
|
There was a problem hiding this comment.
I still see it in the first commit.
| backrefs: yes | ||
|
|
||
| - name: Restore schema from a backup snapshot {{ snapshot_tag }} | ||
| hosts: all |
There was a problem hiding this comment.
These are new lines - not inherited.
And you can use whatever host or a group of hosts with the hosts: ... parameter. Hence the question.
Here you were supposed to use scyllamgr_host as your hosts:
Something like (after you address the corresponding comment below):
...
hosts: scylla-manager
...
And drop all this delegate_to and run_once nonsense.
| - name: Get names of the tables in the snapshot {{ snapshot_tag }} | ||
| shell: | | ||
| scylla-manager-agent download-files -L {{ backup_location }} -n {{ host_id[inventory_hostname] }} -T {{ snapshot_tag }} -d /var/lib/scylla/data/ --dry-run | grep "^\s*\-" | cut -d"-" -f2 | cut -d"(" -f1 | ||
| scylla-manager-agent download-files -L {{ backup_location }} -n {{ host_id[inventory_hostname] }} -T {{ snapshot_tag }} --dump-manifest |
There was a problem hiding this comment.
This (first) commit does a lot more than changes just the way schema is restored.
It refactors all places that use scylla-manager-agent because its API has change apparently.
The patch description should be adjusted accordingly.
| with_items: "{{ tables_to_restore }}" | ||
| when: item not in system_schema_tables | ||
| - name: Restore | ||
| block: |
| tags: | ||
| - upload_snapshot | ||
| tasks: | ||
| - name: Restore |
There was a problem hiding this comment.
You did not restore the original structure - the "block" and the corresponding indentation are still part of the first commit.
Please, remove.
| 3.66.25.100: aff05f79-7c69-4ecf-a827-5ea790a0fdc6 | ||
|
|
||
| # Host where Scylladb Manager is run from | ||
| scyllamgr_host: <scylla_manager_host_ip, e.g. localhost if the playbook is run on the Scylla Manager host> |
There was a problem hiding this comment.
This should be part of the inventory, not the vars.
Make it a new section in the inventory.
| path: /etc/scylla/scylla.yaml | ||
| regexp: '^initial_token:' | ||
| line: "" | ||
| state: absent |
There was a problem hiding this comment.
Please, don't mix independent changes in the same commit: there should be one commit for tagging and one for this hunk.
Each commit with a corresponding description.
| block: | ||
| - name: Download data | ||
| shell: | | ||
| scylla-manager-agent download-files -L {{ backup_location }} -n {{ host_id[inventory_hostname] }} -T {{ snapshot_tag }} -d {{ data_dir }} -K '*,!system*' --mode upload |
There was a problem hiding this comment.
Regular expression here and at the line 51 are not the same.
Hence, if the user has a KS with a name system_my_keyspace you will not download its data but will try to restore it and will report it as a success while in fact you wouldn't restore anything.
There was a problem hiding this comment.
Right, changed the regular expression in this one.
On the other hand, the regex in line 51 would also exclude the example you gave, line 51 basically excludes all "system_" and "system.".
Hence, I changed to '*,!system_*,!system.*'
|
|
||
| - name: refresh nodes with the restored data | ||
| shell: | | ||
| nodetool refresh {{ item.split('.') | join(' ') }} |
There was a problem hiding this comment.
This is not going to work event with the nodetool from 2025.1 since the syntax is now nodetool refresh --keyspace <KS> --table <Table name>:
$ nodetool help refresh
Load newly placed SSTables to the system without a restart
Add the files to the upload directory, by default it is located under
/var/lib/scylla/data/keyspace_name/table_name-UUID/upload.
Materialized Views (MV) and Secondary Indexes (SI) of the upload table, and if
they exist, they are automatically updated. Uploading MV or SI SSTables is not
required and will fail.
For more information, see: https://opensource.docs.scylladb.com/branch-2025.1/operating-scylla/nodetool-commands/refresh.html"
scylla-nodetool options:
-h [ --help ] show help message
--help-seastar show help message about seastar options
--help-loggers print a list of logger names and exit
-h [ --host ] arg (=localhost) the hostname or ip address of the ScyllaDB
node
-p [ --port ] arg (=10000) the port of the REST API of the ScyllaDB node
--rest-api-port arg the port of the REST API of the ScyllaDB node;
takes precedence over --port|-p
--password arg Remote jmx agent password (unused)
--password-file arg Path to the JMX password file (unused)
-u [ --username ] arg Remote jmx agent username (unused)
--print-port Operate in 4.0 mode with hosts disambiguated
by port number (unused)
--keyspace arg The keyspace to load sstable(s) into
--table arg The table to load sstable(s) into
refresh:
--load-and-stream Allows loading sstables that do not belong to
this node, in which case they are
automatically streamed to the owning nodes
--primary-replica-only Load the sstables and stream to primary
replica node that owns the data. Repair is
needed after the load and stream process
There was a problem hiding this comment.
I recommend considering to use REST API instead - its API is probably less likely to break unlike the nodetool.
Check if the corresponding REST API in 2022.2 is different from the one in 2025.1.
There was a problem hiding this comment.
hmm, the following doc: https://docs.scylladb.com/manual/branch-2025.1/operating-scylla/nodetool-commands/refresh.html
and https://enterprise.docs.scylladb.com/branch-2024.1/operating-scylla/nodetool-commands/refresh.html
Both shows that the same syntax is still supported, except 2025.1 and beyond has more options added. The test done before on both 2024.1 LTS and 2025.1 LTS/2025.2/2025.3 worked.
In addition, REST API is still marked as BETA in 2024.2, while it looks it is not exposed in 2024.1 LTS official doc at all.
We have customer still on 2024.1 LTS. It is probably better to still use nodetool at the moment.
| nodetool refresh {{ item.split('.') | join(' ') }} | ||
| with_items: "{{ tables_to_restore }}" | ||
|
|
||
| - name: Restart Scylla service to pick up the restored data |
There was a problem hiding this comment.
Why do you need to restart Scylla to pick up a restored data? The whole point of nodetool refresh to NOT require a restart.
There was a problem hiding this comment.
That is a good one. Took it out.
|
For this comment (#486 (comment)), scylla user may or may not have privilege to write the temp file under a temp folder for Ansible task, when it doesn't, this task will give a warning complaining the privilege, which may confuse customers even it doesn't harm. The ansible_pipelining option would avoid this. |
|
for #486 (comment), the cleanups related stuffs are now taken out from the PR. |
|
for #486 (comment), fixed. |
…uuids on the target cluster Use Scylla Manager command "restore --restore-schema" (refer to https://manager.docs.scylladb.com/stable/sctool/restore.html for the detail of the command) to restore user's table instead of using the previous approach of through restoring system table data. This avoids the issue discussed in scylladb/scylla-manager#3019. Refactor the task "Get names of the tables in the snapshot" with the new option --dump-manifest and also replace the previous JSON parsing logic by using Ansible built-in JSON parser to get more more reliable results. Refactor inventory file to have 2 separate sections, one for Scylla hosts and the other for Scylla Manager host which is required to run above Scylla Manager "restore --restore-schema" command.
This PR excludes system tables from the restore process, and only restores user's schema with scylla manager restore --restore-schema command.