SLA-aware SQL Server Maintenance Framework (Index rebuild, stats update, partial integrity checks)
System has been created to solve the problem of SLA violation during SQL Server Database Maintenance, that often been a reason why some of our customers gave up database maintenance at all. System has necessary capabilities to solve this problem:
- Executes only needed maintenance tasks, evaluating database state
- Prioritizes tasks to ensure the most important tasks are executed
- Predicts execution time
- Keeps the track of time during execution and does not start next task if it probably won’t fit remaining timeframe.
Current version of system can perform various types of maintenance tasks
- Offline index rebuild,
- Online index rebuild (if it is possible due to columns data type, SQL Server version, edition and system settings),
- Index reorganization,
- Update statistics with FULLSCAN,
- With limited SAMPLE,
- With previously used SAMPLE (RESAMPLE).
- Metadata checks (CHECKCATALOG),
- Allocation checks (CHECKALLOC)
- Single table integrity check (CHECKTABLE)
- Fully modular execution model can help adapt to any constraints
- Scalability – most resource-intensive tasks can be configured to run in several threads – Frag evaluation tasks, execution of maintenance tasks (index maintenance, stats update, integrity checks)
- Prediction – system uses ML algorithm to predict how much time each task can take based on previous executions
- SLA-awareness – system will skip tasks that probably won’t fit in the remaining maintenance window
- Time budget – Each phase (Tasks generation, Index maintenance, Stats Update and Integrity Checks) has its separate time budget. If some time remains then system will redistribute it, use for retries of failed or skipped tasks.
- Locking control – utilizes WAIT_AT_LOW_PRIORITY feature or LOCK_TIMEOUT to avoid long blocking.
- Soft stopping feature – at any time execution can be stopped either soft (finish current task and stop) or hard (just stop).
- Granular Integrity Checks – instead of running full DBCC CHECKDB more granular checks can be run (CHECK CATALOG, CHECKALLOC, CHECKTABLE). All with time control.
- Monitoring of AG synchronization. If redo queue exceeds the given threshold or AG is unhealthy then index maintenance task is skipped for this database.
- Monitoring of running backups – can configure to skip index maintenance tasks for a database if running backup is detected to avoid log overflowing.
- Log space control – system can control the amount of space left for transaction logging (including in ldf file and the volume it resides on)
- Blacklisting of some operations or objects (exclusion from maintenance)
System is SQL Server database and a set of SQL Server Agent jobs. Database name is VBMS (Value Based Maintenance System – because system evaluates every task and executes the most valuable ones first). Database name should not be changed as it is used in code.
- Tasks generators (FillQueueAll)
- Workers (StartWorker)
- Killers (KillWorkers)
Tasks generation can be executed before maintenance time as it probably may not interrupt normal work. Needs to be tested on each solution if implied performance impact is acceptable.
Stores global parameters. Different data types stored in three different columns (string_value, int_value, float_value). Each parameter has it's brief description.
Stores the types of operations that system can perform.
The main table in the system. Here the tasks are added and then updated during their execution. Each task is a T-SQL command and a set of metadata, such as operation type, target object, dates when it was added, began and completed, etc.
Field | Data type | Description |
---|---|---|
[entry_id] | bigint | Primary key, increment is 1. |
[batch_id] | uniqueidentifier | Unique indetifier of a group of tasks, generated during one execution of FillQueueAll |
[subsystem_id] | int | Id of type of work (references dbo.OperationTypes) |
[action_type_id] | int | Id of type of action (references dbo.OperationsTypes) |
[command] | nvarchar(4000) | T-SQL code to be executed |
[date_added] | datetime | Datetime when task was generated by FillQueueAll |
[date_started] | datetime | Datetime when task was taken by worker |
[date_completed] | datetime | Datetime when task was completed by worker |
[duration_s] | int | Time in seconds spent for execution |
[database_id] | int | Id of database affected by task |
[table_id] | int | Id of a table affected by task |
[index_id] | int | Id of an index or statistics affected by task |
[partition_n] | int | Number of partition affected by task |
[maxdop] | int | Max degree of parallelism used for index maintenance task execution |
[user_scans] | int | Number of scans, performed on the index. Taken from sys.dm_db_index_usage_stats |
[user_seeks] | int | No longer used. |
[user_updates] | int | No longer used. |
[rowcnt] | bigint | Number of records, that are stored in a partition or table. |
[size_mb] | float | Size of a table, partition, or index. |
[ix_frag] | float | Percent of fragmentation |
[time_factor] | float | Ratio of size_mb to duration_s. used to predict next execution duration of particular type of task on particular object. |
[rowmod_factor] | float | Factor, used to measure how outdated the statistics is. |
[checked_daysago] | int | Time since the table has been checked |
[result] | nvarchar(max) | Outcome of operation execution. Ok or detailed error message. |
[time_prognosis_s] | int | Approximate time that this task can take. Calculated basing on previous executions. |
[exit_code] | int | 1 if execution was successful, -1 if not enough time left, -2 if not enough transaction log space left, or exact error number if error occurred. |
[worker_name] | nvarchar(50) | Text identifier of a worker, that executed this task. |
[execution_id] | uniqueidentifier | Uniqueidentitifier of a group of tasks, executed by same worker during same at one launch. |
[priority] | int | Number, sets automatically to lower priority of tasks, faild last time. 1 - normal priority, 2 - low priority |
volume_mount_point | nvarchar(255) | volume or mountpoint where the object physically resides. |
Here the task execution profiles are stored. Profile is a set of parameters that are used to control what and how the worker will execute.
Field | Data type | Description | Default value |
---|---|---|---|
worker_name | nvarchar(255) | Unique profile name | |
date_added | datetime | Date, when profile has been added. Filled automatically. | getdate() |
owner | nvarchar(150) | Who to ask about this profile | John Doe |
comment | nvarchar(500) | Description of profile. What is this profile for? ex: AW_Nightly_full_maint | Owner did not add any comment |
use_user_db | bit | Perform tasks for user DBs. | 1 |
use_system_db | bit | Perform tasks for system DBs (and VBMS too). | 0 |
dbname_list | varchar(500) | List of databases. No spaces allowed. Ex: 'DB1,DB2,DB3' | '' |
except_list | bit | Invert dbname_list. | 0 |
indexes | bit | Perform index maintenance tasks | 1 |
frag_eval_time_limit_s | Bigint | Time limit for index frag analysis | |
stats | bit | Perform stats maintenance tasks | 1 |
stats_sample | nvarchar(50) | Sample to use when generate statistics maintenance tasks after index maintenance. Supported values are: SAMPLE # PERCENT, SAMPLE # ROWS, FULLSCAN | FULLSCAN |
checkall | bit | Perform any kind of checks | 1 |
checktable | bit | Perform CHECKTABLE | 1 |
checkalloc | bit | Perform CHECKALLOC | 1 |
checkcatalog | bit | Perform CHECKCATALOG | 1 |
online_only | bit | Do not pick the Offline index rebuild tasks | 1 |
check_backup_state | Bit | Check if a backup is currently running on this database | |
check_ag_state | Bit | Check if AG is unhealthy | |
ag_max_queue | Bit | Maximum redo queue size | |
afterparty | bit | In the end, if we have more than 5 minutes left then we can try to execute skipped tasks of any kind. Or if index maintenance time limit was faced, some index tasks remains and during next phases (statistics maintenance and integrity checks) some time have been saved, then it can be used to perform those index tasks. By default, this feature is on. Do not set it off unless you are sure. | 1 |
add_stats_runtime | bit | Add statistics maintenance tasks AFTER index operations. This might save up some time, because some statistics will be updated during index rebuild. Without this option a lot of time will be spent for nothing. | 1 |
totaltimemin | int | Total limit of worker execution time. Overrides TotalMaintenanceWindowMn global parameter. | |
indextime | int | Percent of totaltimemin that can be spent for index maintenance. Overrides IndexMaintenanceWindowPercent global parameter | |
stattime | int | Percent of totaltimemin that can be spent for stats maintenance. Overrides StatMaintenanceWindowPercent global parameter | |
checktime | int | Percent of totaltimemin that can be spent for integrity checks. Overrides CheckWindowPercent global parameter. | |
latched_spid | int | Flag used to prevent starting FillQueueAll for this worker several times. When FillQueueAll started for a particular worker it places current session_id here. | 0 |
stoplight | bit | Flag used to prevent worker from taking next task. It's a signal to stop work after current task is completed. | 0 |
Table is used to track current worker activities. Each FillQueueAll or Worker add a record here when started and after each task executed. It is used mainly for killing workers if it is needed.
Field | Data type | Description | Default value |
---|---|---|---|
record_date | datetime | Timestamp of last update to this record | getdate() |
worker_name | nvarchar(255) | Name of a worker, executing tasks | |
session_id | int | Session_id of a worker | |
program_name | nvarchar(255) | Program name, taked from sys.dm_exec_sessions | |
subsystem_id | int | subsystem_id of a current task. If 1 then killing may cause rollback. | |
entry_id | bigint | Entry_id from dbo.Tasks, that is currently executing | |
is_afterparty | bit | Is worker in the afterparty phase. | 0 |
Table is used to store output of DBCC checks. When DBCC command found any errors it outputs them here. Most of the fields are inherited from DBCC CHECK[TABLE, ALLOC, CATALOG] WITH TABLERESULTS.
Field | Data type | Description | Default value |
---|---|---|---|
rec_id | int | Primary key | |
execution_guid | uniqueidentifier | Unique identifier. Same for records generated during one DBCC launch. Not related to any fields in dbo.Tasks | |
log_date | datetime | Date and time when check has been executed | |
All other fields are from standard DBCC output. |
This table stores records of all actions that should not be taken. You can blacklist any kind of maintenance task for any kind of objects. Each record represents a rule, described as subsystem_id, actiontype_id, database_id, table_id and index_id. NULL is used as *. To add any blacklist items there is a stored procedure dbo.AddBlacklistItem. Rules can be enabled or disabled.
Field | Data type | Description | Default value |
---|---|---|---|
item_id | bigint | Primary key | |
date_added | datetime | Date when rule was added | getdate() |
database_id | int | Id of a database | |
table_id | int | Object_id of a table | |
index_id | int | Id of an index or a statistic | |
partition_n | int | Number of a partition | |
subsystem_id | int | Id of subsystem to blacklist (refers to dbo.OperationsTypes) | |
actiontype_id | int | Id of action type to blacklist (refers to dbo.OperationsTypes) | |
enabled | bit | Flag to enable/disable rule | 1 |
worker_name | nvarchar(255) | Used to map entries to specific workers. Null = for every worker |
SP for maintenance tasks generation. It evaluates chosen DB's and adds suitable maintenance tasks to dbo.Tasks table. In fact this SP is wrapper for a set of procedures dbo.FillQueueXXX.
Parameter | Data type | Description | Default value |
---|---|---|---|
worker_name | nvarchar(50) | Generate tasks only for a specific worker profile | NULL |
indexes | bit | Generate tasks for index maintenance | 1 |
async_frag_collection | bit | Used when need to handover frag collection to separate thread(s) for scaling. Advanced scenario. | 1 |
collect_frag_now | Int | 0-dont collect frag data, only prepare indexes list, 1- prepare indexes list, collect frag data and generate tasks in this session. 2 - Indexes list with frag data has been filled already, need just to prepare the tasks. Advanced scenario | 1 |
stats | bit | Generate tasks for statistics maintenance | 0 |
stats_sample | nvarchar(50) | Sample size used, when generating statistics maintenance tasks | FULLSCAN |
checkall | bit | Generate all sorts of DBCC check tasks. | 1 |
checktable | bit | Generate DBCC CHECKTABLE tasks | 1 |
table_max_size_mb | int | Maximum size (Mb) of a table that can be checked by DBCC CHECKTABLE | 1000000000000 |
checkalloc | bit | Generate DBCC CHECKALLOC tasks | 1 |
checkcatalog | bit | Generate DBCC CHECKCATALOG tasks | 1 |
use_user_db | bit | Generate tasks for user databases | 1 |
use_system_db | bit | Generate tasks for system databases | 0 |
dbname_list | nvarchar(500) | Generate tasks for list databases | '' |
except_list | bit | Turns dbname_list into exception list | 0 |
sortintempdb | bit | Use tempdb for sorting during index maintenance. Overrides values, stored in dbo.Parameters | NULL |
ix_maxdop | bit | Maximum number cores, used for index maintenance. Overrides values, stored in dbo.Parameters | NULL |
This flag combines the three flags – checktable, checkalloc and checkcatalog. Each of this three flags can override CHECKALL flag.
FillQueueAll can be used in several ways. The most common scenario is to configure a worker in dbo.Workers table and then generate tasks for that particular worker:
Exec dbo.FillQueueAll @worker_name = 'AdventureWorks_FullMaint'
This will generate tasks, that can be executed by previously configured AdventureWorks_FullMaint worker.
The other option is to manually select which type of tasks should be generated regardless of are there any workers configured that can handle these tasks.
exec dbo.FillQueueAll @indexes = 1, @stats = 1, @dbname_list = 'AdventureWorks,DB1', @checkall = 0, @checktable =1
This will generate index, stats maintenance and table integrity check tasks for AdventureWorks and DB1 databases.
Procedure removes all incomplete tasks for corresponding objects before generating new one to avoid duplication and execution of outdated tasks.
Procedure returns the batch_id which is same for all tasks generated.
SP for generation of CHECKALLOC tasks. Used by FillQueueAll. No direct usage is advised.
SP for generation of CHECKCATALOG tasks. Used by FillQueueAll. No direct usage is advised.
SP for generation of CHECKTABLE tasks. Used by FillQueueAll. No direct usage is advised.
SP for generation of index maintenance tasks. Used by FillQueueAll. No direct usage is advised.
Considers many factors to propose best types of operations possible.
SP for generation of statistics maintenance tasks. Used by FillQueueAll. No direct usage is advised.
Uses dynamic threshold as TF2371.
Procedure is used inside of SP StartWorker for execution of previously generated tasks.
Most of parameters are taken from dbo.Workers.
SP loops through dbo.Tasks and picks them one after another. Tasks are filtered by parameters provided.
There are four phases:
- Index maintenance
- Stats maintenance
- Integrity checks
- AfterParty if enough time left.
During each phase tasks are sorted specific way. For example – during index maintenance phase fist chosen task will be for the index that is most used for scan, heavily fragmented and largest.
For stats the rowmodfactor is used. It shows how hard stats is outdated based on dynamical threshold. For example, 200 means that the statistic is twice more outdated than threshold for it.
Integrity check tasks are sorted by days since last known check. Last known means last found in dbo.Tasks table.
Before executing each task, time left is evaluated. The previous executions of this kind of tasks against this object are used to predict how much time will be spent for this object this time. If prognosis shows that this task may need more time than we have then it is skipped.
Also the free space for transaction log is checked. See dbo.Parameters for details.
If task has been deadlocked it will be retried during afterparty as well as any skipped tasks if we have enough time during this phase. That may occur if stats and integrity check tasks haven't use all their time.
SP starts the execution of tasks in a queue. Parameter @worker_name is a name of saved profile from table dbo.Workers.
SP stops the execution of selected workers one by one. It is mainly an emergency mechanism.
Uses info from table dbo.WorkerSession. If worker executes index maintenance task (subsystem = 1) then SP switches to the next worker, as killing index maintenance task may cause long rollback. This behavior is changed by @force parameter.
Parameters:
Parameter | Data type | Description | Default value |
---|---|---|---|
worker_names | nvarchar(500) | Worker names to be killed without spaces. Ex: 'Worker1,Worker2,Worker3' | '' |
except_list | bit | Reverse worker_names selection. | 0 |
force | bit | Kill workers even if there are index operations in progress | 0 |
Procedure is used to add new rules to dbo.Blacklist table. Rules can be added through specifying all the parameters to identify blacklisted tasks or by specifying just entry_id from dbo.Tasks, which means to blacklist that kind of tasks. Blacklisted tasks won't be generated anymore. Rules can be enabled or disabled using [enabled] column.
Parameter | Data type | Description | Default value |
---|---|---|---|
entryid | bigint | NULL | |
database_id | int | NULL | |
table_id | int | NULL | |
index_id | int | NULL | |
partition_n | int | NULL | |
subsystem_id | int | NULL | |
action_type_id | int | NULL | |
worker_name | nvarchar(255) | NULL |
Let's say, for some reasons we don't want to perform any kind of maintenance operations on a table Sales.SalesOrderDetail in database AdventureWorks2014. Then we need to perform such a query:
USE [AdventureWorks2014]
DECLARE @db int=(SELECTDB_ID()),
@tbl int=(SELECTobject_id('Sales.SalesOrderDetail'))
exec VBMS.dbo.AddBlacklistItem
@database_id = @db,
@table_id = @tbl,
@index_id =NULL,--all indexes
@partition_n =NULL,-- all partitions
@subsystem_id =NULL,-- all subsystems
@action_type_id =NULL-- all types of actions
Or, if we have to block only update statistics on this table and we have this kind of task in dbo.Tasks previously generated under entry_id 123456, then we can just use this kind of query:
exec VBMS.dbo.AddBlacklistItem@entry_id = 123456
And you will not see this kind of tasks anymore. dbo AddBlacklistItem will just take all the parameters from the specified record of dbo.Tasks.
Function is used to make prognosis of execution time for tasks. Not used directly.
Run script VBMS_create.sql in SQLCMD mode
Deploy data-tier application
Open dbo.Parameters table and set all the parameters. Defaults may not work for you.
Maintenance itself consist of two phases – task generation and tasks execution which can be combined or separated. Each of them can be run in multiple threads. For example you can generate tasks foe all DB's at once and execute them in 5 different jobs (workers).
Generation of tasks is done by dbo.FillQueueAll procedure. It evaluates the state of the DB's that are passed to it and generate tasks for further execution based on the parameters. There are two main ways to use it:
- Manually fill all the parameters, except @worker_name.
- Fill only @worker_name
The first scenario is used when we use one FillQueueAll to generate tasks for several different workers.
The second one is much easier and used to create tasks for a specific worker.
Generated tasks are saved in dbo.Tasks table. After the execution those records will be just updated with the execution stats, duration etc.
Workers need to be configured to handle the tasks generated. You need to insert a record into dbo.Workers table, filling all the parameters. Some of them have default values, so doublecheck if the result suits you.
Some of the parameters are overriding the global dbo.Parameter when overlap.
Then configure tasks generation. Set up a SQL Server job that will execute dbo.FillQueueAll with proper parameters.
Consider we have one DB for maintenance – AdventureWorks. Maintenance can be performed only from 01:00 till 3:00. All sorts of operations allowed at this time.
We configure FillQueueAll to run every day at 00:00 not to use maintenance time for analysis. Filling queue and processing it is two different tasks and they are not really connected. We might collect all sorts of tasks and then execute only those we want. We can have one FIllQueueAll job and 5-6 StartWorker jobs, that will perform different tasks from queue.
Step of job should look like:
EXEC [dbo].[FillQueueAll]@workername ='AWFullMaint'
OR
EXEC dbo.FillQueueAll@indexes = 1,@stats =1, @stats_sample ='FULLSCAN',@checkall = 1,@use_user_db =1, @dbname_list ='AdventureWorks2016'
FillQueueAll returns batch_id which identifies all the tasks generated during this run. You can find them in dbo.Tasks with a query:
select*from dbo.Tasks where batch_id = 'YOUR BATCH_ID
To configure a worker you need to add a record to dbo.Workers table. Here we chose what types of tasks and how will be executed by the worker. All the bit parameters have defaults.
In our case, we set:
[worker_name] = 'AWFullMaint'
[owner] = 'John Doe'
[comment] = 'AW full maintenance at night only'
[use_user_db] = 1
[dbname_list] = 'AdventureWorks2016'
[online_only] = 0
Now, when profile is created, we can use it by worker_name.
Set up a job, that executes
EXEC dbo.StartWorker 'AWFullMaint'
every night at 01:00.
As we have only 2 hours for maintenance tasks (01:00 – 03:00) we must limit the execution time to 2 hours. We have two options:
- Set the global parameters at dbo.Parameters (TotalMaintenanceWindowMm = 120, IndexMaintenanceWindowPercent = 30, StatMaintenanceWindowPercent = 30, CheckWindowPercent =30),
- Set these parameters in worker profile in dbo.Workers (similar columns with similar effect).
Just to be sure that we won't violate SLA we can configure KillWorkers job.
Create a job, that executes EXEC dbo.KillWorkers @worker_names = 'AWFullMaint'
@force flag can used to allow killing with possible index operations rollback. However, it is not recommended.
It's needed only for safety, all the test show that after the first execution (when the system got its first experience) system is pretty accurate in its predictions (unless someone starts some huge workload that makes server twice slower than usually). So…normally it won't be needed to kill workers.
That's it. We're good to go.