This article is about SQL Cost Based Query Optimization that describes the factors at which query optimization chooses the best lowest cost plan to optimize the query.
Table of Contents
Query
Query is a statement or non-procedural language written in a high-level language, i.e., SQL.
Query Optimization
Query Optimization is one of the essential components of the query optimizer. It aims to choose the best efficient plan at the lowest possible cost based on the query and resources involved in the optimization process.
Cost-Based Optimization
For a given query and environment, the Optimizer allocates a cost in numerical form, which is related to each step of a possible plan, and then finds these values together to get a cost estimate for the project. After calculating the costs of all possible scenarios, the Optimizer tries to choose a plan with the lowest cost estimate. Therefore, the Optimizer may sometimes be referred to as the Cost-Based Optimizer.
Points to be remembered:
- The query optimizer is a vital system component of database systems.
- The tremendous commercial success of database systems is partly due to the development of sophisticated query optimization technology.
- This component’s responsibility is to translate the user-submitted query, usually written in a non-procedural language, into an efficient query evaluation plan which is then executed against the database.
- Database users pose queries in a declarative way using SQL, and the optimizer of the database system finds an excellent way to execute these queries.
- Cost-based query optimization searches for low-cost query execution plans using a cost estimate for the computer resources involved.
- Cost-Based Query Optimization or CBO Optimizer is an optimization technique that uses table statistics to determine the most efficient query execution plan of a structured query.
Features of the Cost Based Optimization
- The cost-based optimization is based on the cost of the query to be optimized.
- The query can use a lot of paths based on the value of indexes, available sorting methods, constraints, etc.
- Query optimization aims to choose the most efficient path of implementing the query at the lowest minimum cost in the form of an algorithm.
- The algorithm’s cost needs to be provided by the query Optimizer so that the most suitable query can be selected for an operation.
- The cost of an algorithm also depends upon the cardinality of the input.
Optimization rules under CBO
- Cost Based Join Reorder to optimize the logical plan of a structured query based on statistics.
- Cost-Based Optimization uses the statistics stored in a metastore.
Cost-Based Optimizations – Table Statistics
The table statistics can be computed for tables, partitions and columns and are as follows:
- Total size (in bytes) of a table or table partitions
- Row count of a table or table partitions
- Column statistics, i.e. min, max, num_nulls, distinct_count, histogram
Factors of cost of optimization of the query
These measures are related, and one is derived from another. In the end, the estimator’s goal is to estimate the overall cost of a given plan. If statistics are available, then the estimator uses them to compute the measures. The statistics improve the degree of accuracy of the measurements.
Cardinality
Cardinality is known to be the number of rows that are returned by performing the operations specified by the query execution plan. The cardinality estimates must be correct as it highly affects all the possibilities of the execution plan.
In other words, Cardinality represents the number of rows in a row set. Here, the row set can be a base table, a view, or the result of a join or GROUP BY operator.
Base cardinality – Base cardinality is the number of rows in a base table. The base cardinality can be captured by analyzing the table. If table statistics are unavailable, then the estimator uses the number of extents occupied by the table to estimate the base cardinality.
Effective cardinality – Effective cardinality is the number of rows selected from a base table. The effective cardinality depends on the predicates specified on different columns of a base table, with each predicate acting as a successive filter on the rows of the base table. The effective cardinality is computed as the product of the base cardinality and combined selectivity of all predicates specified on a table. When there is no predicate on a table, its effective cardinality equals its base cardinality.
Join cardinality – Join cardinality is the number of rows produced when two-row sets are joined together. A join is a Cartesian product of two-row groups, with the join predicate applied as a filter to the result. Therefore, the join cardinality is the product of the cardinalities of two-row sets multiplied by the selectivity of the join predicate.
Distinct cardinality – Distinct cardinality is the number of different values in a column of a row set. The different cardinality of a row set is based on the data in the column. For example, in a row set of 100 rows, if distinct column values are found in 20 rows, then the distinct cardinality is 20.
Group cardinality – Group cardinality is the number of rows produced from a row set after the GROUP BY operator is applied. The effect of the GROUP BY operator is to decrease the number of rows in a row set. The group cardinality depends on the distinct cardinality of each of the grouping columns and the number of rows in the row set. For an illustration of group cardinality.
Selectivity
Selectivity refers to the number of rows that are selected. The selectivity of any row from the table or any table from the database almost depends upon the condition. The state’s satisfaction takes us to that specific row’s selectivity. The requirement to be satisfied can be any, depending upon the situation.
Selectivity is the first measure, representing a fraction of rows from a row set.
The row set can be a base table, a view, or the result of a join or a GROUP BY operator. The selectivity is tied to a query predicate, such as last_name = ‘Smith’, or a combination of predicates, such as last_name = ‘Smith’ AND job_type = ‘Clerk.’ A predicate acts as a filter that filters a certain number of rows from a row set.
Therefore, the selectivity of a predicate indicates how many rows from a row set will pass the predicate test. Selectivity lies in a value range from 0.0 to 1.0. A selectivity of 0.0 means that no rows will be selected from a row set, and a selectivity of 1.0 standards that all rows will be selected. If no statistics are available, the estimator uses an internal default value for selectivity. Different internal defaults are used, depending on the predicate type.
For example, the internal default for an equality predicate (last_name = ‘Alex’) is lower than the internal default for a range predicate (last_name > ‘Alex’). The estimator makes this assumption because an equality predicate is expected to return a smaller fraction of rows than a range predicate.
Cost
Cost refers to the amount of money spent on the system to optimize the system. The cost measure entirely depends upon the work done or the number of resources used.
The cost represents units of work or resources used. The CBO uses disk I/O, CPU usage, and memory usage as units of work. So, the price used by the CBO represents an estimate of the number of disk I/Os and the amount of CPU and memory used in operating. The operation can scan a table, access rows from a table by using an index, join two tables together, or sort a row set. The cost of a query plan is the number of work units expected to be incurred when the query is executed, and its result produced.
ANALYZE TABLE COMPUTE STATISTICS
The first step is to use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute table statistics. Use DESCRIBE EXTENDED SQL command to inspect the statistics.
Cost-Based Optimization uses the statistics stored in a meta store i.e., external catalog, using ANALYZE TABLE SQL command-
ANALYZE TABLE tableIdentifier partitionSpec;
COMPUTE STATISTICS (NOSCAN | FOR COLUMNS identifierSeq);
Depending on the variant, ANALYZE TABLE computes different statistics, i.e. of a table, partitions, or columns-
ANALYZE TABLE with neither PARTITION specification nor FOR COLUMNS clause.
ANALYZE TABLE with PARTITION specification (but no FOR COLUMNS clause).
ANALYZE TABLE with FOR COLUMNS clause (but no PARTITION specification).
DESCRIBE EXTENDED SQL Command
The statistics of a table can be viewed, partitions, or a column (stored in a meta store) using DESCRIBE EXTENDED SQL command-
(DESC | DESCRIBE) TABLE? (EXTENDED | FORMATTED);
tableIdentifier partitionSpec? describeColName;
Cost Component of Query Evaluation
The following are the cost components of the execution of a query-
Access cost to secondary storage
It could be the price of reading, searching, or writing a block of information initially on secondary storage, specifically on the disk. The cost of looking at records within the file also depends on the kind of access structure the file is equipped with.
Storage cost
It is also the price of the storage of any intermediate files(files that are the result of processing the input but aren’t exactly the outcome) created by the execution strategy used for the query.
Computation cost
This is the cost of performing the memory operations available on the record within the data buffers. Operations like searching for records, merging records, or sorting records. This can also be called the CPU cost.
Memory usage cost
The cost of memory usage can be calculated simply by using the number of memory buffers needed to execute the query.
Communication cost
This is the expense of transmitting or communicating the query results from one location to the next. This also covers the cost of transferring the results and table to different sites in evaluating the query.
Issues In Cost-Based Optimization
The following are the issues in cost-based optimization-
- In cost-based optimization, the number of execution strategies that can be considered is not really fixed. The number of execution strategies may vary based on the situation.
- Sometimes, this process is really very time-consuming because it does not always guarantee finding the best optimal strategy.
- It is an expensive process.
If you find anything incorrect in the above-discussed topic and have further questions, please comment below.
Connect on: