Skip to content
/ VBMS Public
forked from trootnev/VBMS

SLA-aware SQL Server Maintenance Framework (Index rebuild, stats update, partial integrity checks)

License

Notifications You must be signed in to change notification settings

NWiddup/VBMS

 
 

Repository files navigation

VBMS

SLA-aware SQL Server Maintenance Framework (Index rebuild, stats update, partial integrity checks)

System purpose

System has been created to solve the problem of SLA violation during SQL Server Database Maintenance, that often been a reason why some of our customers gave up database maintenance at all. System has necessary capabilities to solve this problem:

  • Executes only needed maintenance tasks, evaluating database state
  • Prioritizes tasks to ensure the most important tasks are executed
  • Predicts execution time
  • Keeps the track of time during execution and does not start next task if it probably won’t fit remaining timeframe.

Supported Maintenance Tasks

Current version of system can perform various types of maintenance tasks

Index maintenance:

  • Offline index rebuild,
  • Online index rebuild (if it is possible due to columns data type, SQL Server version, edition and system settings),
  • Index reorganization,

Statistics update

  • Update statistics with FULLSCAN,
  • With limited SAMPLE,
  • With previously used SAMPLE (RESAMPLE).

Integrity checks

  • Metadata checks (CHECKCATALOG),
  • Allocation checks (CHECKALLOC)
  • Single table integrity check (CHECKTABLE)

Key features

  • Fully modular execution model can help adapt to any constraints
  • Scalability – most resource-intensive tasks can be configured to run in several threads – Frag evaluation tasks, execution of maintenance tasks (index maintenance, stats update, integrity checks)
  • Prediction – system uses ML algorithm to predict how much time each task can take based on previous executions
  • SLA-awareness – system will skip tasks that probably won’t fit in the remaining maintenance window
  • Time budget – Each phase (Tasks generation, Index maintenance, Stats Update and Integrity Checks) has its separate time budget. If some time remains then system will redistribute it, use for retries of failed or skipped tasks.
  • Locking control – utilizes WAIT_AT_LOW_PRIORITY feature or LOCK_TIMEOUT to avoid long blocking.
  • Soft stopping feature – at any time execution can be stopped either soft (finish current task and stop) or hard (just stop).
  • Granular Integrity Checks – instead of running full DBCC CHECKDB more granular checks can be run (CHECK CATALOG, CHECKALLOC, CHECKTABLE). All with time control.
  • Monitoring of AG synchronization. If redo queue exceeds the given threshold or AG is unhealthy then index maintenance task is skipped for this database.
  • Monitoring of running backups – can configure to skip index maintenance tasks for a database if running backup is detected to avoid log overflowing.
  • Log space control – system can control the amount of space left for transaction logging (including in ldf file and the volume it resides on)
  • Blacklisting of some operations or objects (exclusion from maintenance)

System components Overview

System is SQL Server database and a set of SQL Server Agent jobs. Database name is VBMS (Value Based Maintenance System – because system evaluates every task and executes the most valuable ones first). Database name should not be changed as it is used in code.

SQL Server Agent Jobs can be of three types:

  • Tasks generators (FillQueueAll)
  • Workers (StartWorker)
  • Killers (KillWorkers)

Tasks generation can be executed before maintenance time as it probably may not interrupt normal work. Needs to be tested on each solution if implied performance impact is acceptable.

Tables (some)

dbo.Parameters

Stores global parameters. Different data types stored in three different columns (string_value, int_value, float_value). Each parameter has it's brief description.

dbo.OperationTypes

Stores the types of operations that system can perform.

dbo.Tasks

The main table in the system. Here the tasks are added and then updated during their execution. Each task is a T-SQL command and a set of metadata, such as operation type, target object, dates when it was added, began and completed, etc.

Fields:

Field Data type Description
[entry_id] bigint Primary key, increment is 1.
[batch_id] uniqueidentifier Unique indetifier of a group of tasks, generated during one execution of FillQueueAll
[subsystem_id] int Id of type of work (references dbo.OperationTypes)
[action_type_id] int Id of type of action (references dbo.OperationsTypes)
[command] nvarchar(4000) T-SQL code to be executed
[date_added] datetime Datetime when task was generated by FillQueueAll
[date_started] datetime Datetime when task was taken by worker
[date_completed] datetime Datetime when task was completed by worker
[duration_s] int Time in seconds spent for execution
[database_id] int Id of database affected by task
[table_id] int Id of a table affected by task
[index_id] int Id of an index or statistics affected by task
[partition_n] int Number of partition affected by task
[maxdop] int Max degree of parallelism used for index maintenance task execution
[user_scans] int Number of scans, performed on the index. Taken from sys.dm_db_index_usage_stats
[user_seeks] int No longer used.
[user_updates] int No longer used.
[rowcnt] bigint Number of records, that are stored in a partition or table.
[size_mb] float Size of a table, partition, or index.
[ix_frag] float Percent of fragmentation
[time_factor] float Ratio of size_mb to duration_s. used to predict next execution duration of particular type of task on particular object.
[rowmod_factor] float Factor, used to measure how outdated the statistics is.
[checked_daysago] int Time since the table has been checked
[result] nvarchar(max) Outcome of operation execution. Ok or detailed error message.
[time_prognosis_s] int Approximate time that this task can take. Calculated basing on previous executions.
[exit_code] int 1 if execution was successful, -1 if not enough time left, -2 if not enough transaction log space left, or exact error number if error occurred.
[worker_name] nvarchar(50) Text identifier of a worker, that executed this task.
[execution_id] uniqueidentifier Uniqueidentitifier of a group of tasks, executed by same worker during same at one launch.
[priority] int Number, sets automatically to lower priority of tasks, faild last time. 1 - normal priority, 2 - low priority
volume_mount_point nvarchar(255) volume or mountpoint where the object physically resides.

dbo.Workers

Here the task execution profiles are stored. Profile is a set of parameters that are used to control what and how the worker will execute.

Fields:

Field Data type Description Default value
worker_name nvarchar(255) Unique profile name
date_added datetime Date, when profile has been added. Filled automatically. getdate()
owner nvarchar(150) Who to ask about this profile John Doe
comment nvarchar(500) Description of profile. What is this profile for? ex: AW_Nightly_full_maint Owner did not add any comment
use_user_db bit Perform tasks for user DBs. 1
use_system_db bit Perform tasks for system DBs (and VBMS too). 0
dbname_list varchar(500) List of databases. No spaces allowed. Ex: 'DB1,DB2,DB3' ''
except_list bit Invert dbname_list. 0
indexes bit Perform index maintenance tasks 1
frag_eval_time_limit_s Bigint Time limit for index frag analysis
stats bit Perform stats maintenance tasks 1
stats_sample nvarchar(50) Sample to use when generate statistics maintenance tasks after index maintenance. Supported values are: SAMPLE # PERCENT, SAMPLE # ROWS, FULLSCAN FULLSCAN
checkall bit Perform any kind of checks 1
checktable bit Perform CHECKTABLE 1
checkalloc bit Perform CHECKALLOC 1
checkcatalog bit Perform CHECKCATALOG 1
online_only bit Do not pick the Offline index rebuild tasks 1
check_backup_state Bit Check if a backup is currently running on this database
check_ag_state Bit Check if AG is unhealthy
ag_max_queue Bit Maximum redo queue size
afterparty bit In the end, if we have more than 5 minutes left then we can try to execute skipped tasks of any kind. Or if index maintenance time limit was faced, some index tasks remains and during next phases (statistics maintenance and integrity checks) some time have been saved, then it can be used to perform those index tasks. By default, this feature is on. Do not set it off unless you are sure. 1
add_stats_runtime bit Add statistics maintenance tasks AFTER index operations. This might save up some time, because some statistics will be updated during index rebuild. Without this option a lot of time will be spent for nothing. 1
totaltimemin int Total limit of worker execution time. Overrides TotalMaintenanceWindowMn global parameter.
indextime int Percent of totaltimemin that can be spent for index maintenance. Overrides IndexMaintenanceWindowPercent global parameter
stattime int Percent of totaltimemin that can be spent for stats maintenance. Overrides StatMaintenanceWindowPercent global parameter
checktime int Percent of totaltimemin that can be spent for integrity checks. Overrides CheckWindowPercent global parameter.
latched_spid int Flag used to prevent starting FillQueueAll for this worker several times. When FillQueueAll started for a particular worker it places current session_id here. 0
stoplight bit Flag used to prevent worker from taking next task. It's a signal to stop work after current task is completed. 0

dbo.WorkerSession

Table is used to track current worker activities. Each FillQueueAll or Worker add a record here when started and after each task executed. It is used mainly for killing workers if it is needed.

Fields:

Field Data type Description Default value
record_date datetime Timestamp of last update to this record getdate()
worker_name nvarchar(255) Name of a worker, executing tasks
session_id int Session_id of a worker
program_name nvarchar(255) Program name, taked from sys.dm_exec_sessions
subsystem_id int subsystem_id of a current task. If 1 then killing may cause rollback.
entry_id bigint Entry_id from dbo.Tasks, that is currently executing
is_afterparty bit Is worker in the afterparty phase. 0

dbo.DBCCChecksLog

Table is used to store output of DBCC checks. When DBCC command found any errors it outputs them here. Most of the fields are inherited from DBCC CHECK[TABLE, ALLOC, CATALOG] WITH TABLERESULTS.

Fields:

Field Data type Description Default value
rec_id int Primary key
execution_guid uniqueidentifier Unique identifier. Same for records generated during one DBCC launch. Not related to any fields in dbo.Tasks
log_date datetime Date and time when check has been executed
All other fields are from standard DBCC output.

dbo.Blacklist

This table stores records of all actions that should not be taken. You can blacklist any kind of maintenance task for any kind of objects. Each record represents a rule, described as subsystem_id, actiontype_id, database_id, table_id and index_id. NULL is used as *. To add any blacklist items there is a stored procedure dbo.AddBlacklistItem. Rules can be enabled or disabled.

Fie__lds:

Field Data type Description Default value
item_id bigint Primary key
date_added datetime Date when rule was added getdate()
database_id int Id of a database
table_id int Object_id of a table
index_id int Id of an index or a statistic
partition_n int Number of a partition
subsystem_id int Id of subsystem to blacklist (refers to dbo.OperationsTypes)
actiontype_id int Id of action type to blacklist (refers to dbo.OperationsTypes)
enabled bit Flag to enable/disable rule 1
worker_name nvarchar(255) Used to map entries to specific workers. Null = for every worker

Stored Procedures

dbo.FillQueueAll

SP for maintenance tasks generation. It evaluates chosen DB's and adds suitable maintenance tasks to dbo.Tasks table. In fact this SP is wrapper for a set of procedures dbo.FillQueueXXX.

Parameters

Parameter Data type Description Default value
worker_name nvarchar(50) Generate tasks only for a specific worker profile NULL
indexes bit Generate tasks for index maintenance 1
async_frag_collection bit Used when need to handover frag collection to separate thread(s) for scaling. Advanced scenario. 1
collect_frag_now Int 0-dont collect frag data, only prepare indexes list, 1- prepare indexes list, collect frag data and generate tasks in this session. 2 - Indexes list with frag data has been filled already, need just to prepare the tasks. Advanced scenario 1
stats bit Generate tasks for statistics maintenance 0
stats_sample nvarchar(50) Sample size used, when generating statistics maintenance tasks FULLSCAN
checkall bit Generate all sorts of DBCC check tasks. 1
checktable bit Generate DBCC CHECKTABLE tasks 1
table_max_size_mb int Maximum size (Mb) of a table that can be checked by DBCC CHECKTABLE 1000000000000
checkalloc bit Generate DBCC CHECKALLOC tasks 1
checkcatalog bit Generate DBCC CHECKCATALOG tasks 1
use_user_db bit Generate tasks for user databases 1
use_system_db bit Generate tasks for system databases 0
dbname_list nvarchar(500) Generate tasks for list databases ''
except_list bit Turns dbname_list into exception list 0
sortintempdb bit Use tempdb for sorting during index maintenance. Overrides values, stored in dbo.Parameters NULL
ix_maxdop bit Maximum number cores, used for index maintenance. Overrides values, stored in dbo.Parameters NULL

CHECKALL

This flag combines the three flags – checktable, checkalloc and checkcatalog. Each of this three flags can override CHECKALL flag.

Usage

FillQueueAll can be used in several ways. The most common scenario is to configure a worker in dbo.Workers table and then generate tasks for that particular worker:

Exec dbo.FillQueueAll @worker_name = 'AdventureWorks_FullMaint'

This will generate tasks, that can be executed by previously configured AdventureWorks_FullMaint worker.

The other option is to manually select which type of tasks should be generated regardless of are there any workers configured that can handle these tasks.

exec dbo.FillQueueAll @indexes = 1, @stats = 1, @dbname_list = 'AdventureWorks,DB1', @checkall = 0, @checktable =1

This will generate index, stats maintenance and table integrity check tasks for AdventureWorks and DB1 databases.

Procedure removes all incomplete tasks for corresponding objects before generating new one to avoid duplication and execution of outdated tasks.

Procedure returns the batch_id which is same for all tasks generated.

dbo.FillQueueCheckAlloc

SP for generation of CHECKALLOC tasks. Used by FillQueueAll. No direct usage is advised.

dbo.FillQueueCheckCatalog

SP for generation of CHECKCATALOG tasks. Used by FillQueueAll. No direct usage is advised.

dbo.FillQueueCheckTable

SP for generation of CHECKTABLE tasks. Used by FillQueueAll. No direct usage is advised.

dbo.FillQueueIndex

SP for generation of index maintenance tasks. Used by FillQueueAll. No direct usage is advised.

Considers many factors to propose best types of operations possible.

dbo.FillQueueStat

SP for generation of statistics maintenance tasks. Used by FillQueueAll. No direct usage is advised.

Uses dynamic threshold as TF2371.

dbo.ProcessQueueAll

Procedure is used inside of SP StartWorker for execution of previously generated tasks.

Most of parameters are taken from dbo.Workers.

SP loops through dbo.Tasks and picks them one after another. Tasks are filtered by parameters provided.

Operation cycle:

There are four phases:

  1. Index maintenance
  2. Stats maintenance
  3. Integrity checks
  4. AfterParty if enough time left.

During each phase tasks are sorted specific way. For example – during index maintenance phase fist chosen task will be for the index that is most used for scan, heavily fragmented and largest.

For stats the rowmodfactor is used. It shows how hard stats is outdated based on dynamical threshold. For example, 200 means that the statistic is twice more outdated than threshold for it.

Integrity check tasks are sorted by days since last known check. Last known means last found in dbo.Tasks table.

Before executing each task, time left is evaluated. The previous executions of this kind of tasks against this object are used to predict how much time will be spent for this object this time. If prognosis shows that this task may need more time than we have then it is skipped.

Also the free space for transaction log is checked. See dbo.Parameters for details.

If task has been deadlocked it will be retried during afterparty as well as any skipped tasks if we have enough time during this phase. That may occur if stats and integrity check tasks haven't use all their time.

dbo.StartWorker

SP starts the execution of tasks in a queue. Parameter @worker_name is a name of saved profile from table dbo.Workers.

dbo.KillWorkers

SP stops the execution of selected workers one by one. It is mainly an emergency mechanism.

Uses info from table dbo.WorkerSession. If worker executes index maintenance task (subsystem = 1) then SP switches to the next worker, as killing index maintenance task may cause long rollback. This behavior is changed by @force parameter.

Parameters:

Parameter Data type Description Default value
worker_names nvarchar(500) Worker names to be killed without spaces. Ex: 'Worker1,Worker2,Worker3' ''
except_list bit Reverse worker_names selection. 0
force bit Kill workers even if there are index operations in progress 0

dbo.AddBlacklistItem

Procedure is used to add new rules to dbo.Blacklist table. Rules can be added through specifying all the parameters to identify blacklisted tasks or by specifying just entry_id from dbo.Tasks, which means to blacklist that kind of tasks. Blacklisted tasks won't be generated anymore. Rules can be enabled or disabled using [enabled] column.

Parameters:

Parameter Data type Description Default value
entryid bigint NULL
database_id int NULL
table_id int NULL
index_id int NULL
partition_n int NULL
subsystem_id int NULL
action_type_id int NULL
worker_name nvarchar(255) NULL

Usage:

Let's say, for some reasons we don't want to perform any kind of maintenance operations on a table Sales.SalesOrderDetail in database AdventureWorks2014. Then we need to perform such a query:

USE [AdventureWorks2014]

DECLARE @db int=(SELECTDB_ID()),

@tbl int=(SELECTobject_id('Sales.SalesOrderDetail'))

exec VBMS.dbo.AddBlacklistItem

@database_id = @db,

@table_id = @tbl,

@index_id =NULL,--all indexes

@partition_n =NULL,-- all partitions

@subsystem_id =NULL,-- all subsystems

@action_type_id =NULL-- all types of actions

Or, if we have to block only update statistics on this table and we have this kind of task in dbo.Tasks previously generated under entry_id 123456, then we can just use this kind of query:

exec VBMS.dbo.AddBlacklistItem@entry_id = 123456

And you will not see this kind of tasks anymore. dbo AddBlacklistItem will just take all the parameters from the specified record of dbo.Tasks.

Function

dbo.GetTimeFactor

Function is used to make prognosis of execution time for tasks. Not used directly.

How-to

Installation

SQL Server 2008:

Run script VBMS_create.sql in SQLCMD mode

SQL Server 2008R2 and higher

Deploy data-tier application

Setup and execution

Open dbo.Parameters table and set all the parameters. Defaults may not work for you.

Maintenance itself consist of two phases – task generation and tasks execution which can be combined or separated. Each of them can be run in multiple threads. For example you can generate tasks foe all DB's at once and execute them in 5 different jobs (workers).

Generation of tasks is done by dbo.FillQueueAll procedure. It evaluates the state of the DB's that are passed to it and generate tasks for further execution based on the parameters. There are two main ways to use it:

  1. Manually fill all the parameters, except @worker_name.
  2. Fill only @worker_name

The first scenario is used when we use one FillQueueAll to generate tasks for several different workers.

The second one is much easier and used to create tasks for a specific worker.

Generated tasks are saved in dbo.Tasks table. After the execution those records will be just updated with the execution stats, duration etc.

Workers need to be configured to handle the tasks generated. You need to insert a record into dbo.Workers table, filling all the parameters. Some of them have default values, so doublecheck if the result suits you.

Some of the parameters are overriding the global dbo.Parameter when overlap.

Then configure tasks generation. Set up a SQL Server job that will execute dbo.FillQueueAll with proper parameters.

Consider we have one DB for maintenance – AdventureWorks. Maintenance can be performed only from 01:00 till 3:00. All sorts of operations allowed at this time.

We configure FillQueueAll to run every day at 00:00 not to use maintenance time for analysis. Filling queue and processing it is two different tasks and they are not really connected. We might collect all sorts of tasks and then execute only those we want. We can have one FIllQueueAll job and 5-6 StartWorker jobs, that will perform different tasks from queue.

Step of job should look like:

EXEC [dbo].[FillQueueAll]@workername ='AWFullMaint'

OR

EXEC dbo.FillQueueAll@indexes = 1,@stats =1, @stats_sample ='FULLSCAN',@checkall = 1,@use_user_db =1, @dbname_list ='AdventureWorks2016'

FillQueueAll returns batch_id which identifies all the tasks generated during this run. You can find them in dbo.Tasks with a query:

select*from dbo.Tasks where batch_id = 'YOUR BATCH_ID

To configure a worker you need to add a record to dbo.Workers table. Here we chose what types of tasks and how will be executed by the worker. All the bit parameters have defaults.

In our case, we set:

[worker_name] = 'AWFullMaint'

[owner] = 'John Doe'

[comment] = 'AW full maintenance at night only'

[use_user_db] = 1

[dbname_list] = 'AdventureWorks2016'

[online_only] = 0

Now, when profile is created, we can use it by worker_name.

Set up a job, that executes

EXEC dbo.StartWorker 'AWFullMaint'

every night at 01:00.

As we have only 2 hours for maintenance tasks (01:00 – 03:00) we must limit the execution time to 2 hours. We have two options:

  1. Set the global parameters at dbo.Parameters (TotalMaintenanceWindowMm = 120, IndexMaintenanceWindowPercent = 30, StatMaintenanceWindowPercent = 30, CheckWindowPercent =30),
  2. Set these parameters in worker profile in dbo.Workers (similar columns with similar effect).

Just to be sure that we won't violate SLA we can configure KillWorkers job.

Create a job, that executes EXEC dbo.KillWorkers @worker_names = 'AWFullMaint'

@force flag can used to allow killing with possible index operations rollback. However, it is not recommended.

It's needed only for safety, all the test show that after the first execution (when the system got its first experience) system is pretty accurate in its predictions (unless someone starts some huge workload that makes server twice slower than usually). So…normally it won't be needed to kill workers.

That's it. We're good to go.

About

SLA-aware SQL Server Maintenance Framework (Index rebuild, stats update, partial integrity checks)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • PLpgSQL 96.9%
  • PLSQL 3.1%