1. Theoretical background
A key feature of sql is that it is a programming language quantity-oriented
with set = amount) works. Each code block to be executed (batch via OSql, stored procedure, Bcp query string) is checked for syntactic correctness in a first step with a parser. However, the result cannot be executed directly. Because if a query intervenes to several tables and links selected rows via WHERE in a JOIN clause, it is not specified in which order the individual tables are accessed and which arrangement is selected for an ON section with certain conditions. Therefore, in a second step, a Execution plan
that determines all of these things. When creating the execution plan, the sizes of the tables involved, any existing indices and statistics for individual columns are taken into account and an attempt is made to determine the cheapest possible execution. Only this execution plan is then compiled and cached in the procedure cache for reuse. This reusability of execution plans, eliminating the need for recompilation, is essential for high performance. The same Sql query can, in contrast to many other programming languages, generate completely differently compiled code - depending on the size and data structure of the tables at the time of compilation. However, this logic can only really develop its strengths if lines are not individually but in blocks, as Sets of rows
are processed. Especially the abstract approach of the keywords WHERE, JOIN and GROUP BY, which - typical for a fourth generation language - only defines What
should be done, but not, how
If this is to be done, it enables the internal optimizer to find the relatively cheapest solution for the execution plan. Only the lack of certain commands / tools, which an inexperienced database programmer might perceive as a shortcoming of the language, allows the creation of an optimized layer, which is a variable execution plan between the sql code and the actual execution. However, if the language contains additional procedural elements such as cursors, variables, and loops, programmers who work primarily with procedural languages and rarely use Sql to access databases tend to overuse these tools in such a way that the code takes an unnecessarily long time to execute needed. The same problem can arise if the Sql dialect (e.g. mySql) itself knows a few procedural elements, but these are provided and used by the surrounding host language (here mostly PHP). The following examples illustrate this problem in more detail.
2. Obvious: WHERE is not done by hand
A subset of the data is required for an update. So you create a cursor and walk through it. Declare @id int, @info int Declare cs_Daten CURSOR LOCAL FAST_FORWARD FOR - Select all data Select A.Id, A.Kriterium From [important-data] Open cs_Daten Fetch Next From cs_Daten Into @id, @info While (@@ Fetch_Status = 0) Begin - test line by line If @info> 0 - search for current line - and update Update [important data] Set [additional info] = @info * 35 Where Id = @id - get next line Fetch Next From cs_Daten Into @id, @info End Close cs_Daten Deallocate cs_Daten Well - the following command delivers the same result: Update [important data] Set [additional info] = @info * 35 Where criterion> 0 - (*) However, on the one hand, this technique does not create an additional record set; on the other hand, the required rows are selected and changed in one step. If an analog solution is constructed in PHP, the difference is even more obvious, since the boundary between host language and Sql must first be crossed for the entire RecordSet and then for each line to be updated. Furthermore, the where condition (*) can use an existing index above the criterion column, whereas the select statement that creates the cursor forcibly processes all rows. What is striking about this example is that if the update affects only very few rows relative to the total number of rows, many rows are unnecessarily transported out and then discarded. If, on the other hand, the update affects most of the rows in the table [important data], almost every row must be searched for and updated again. So the performance is bad in all extreme cases.
3. Correct rank positions
The above example may seem obvious, as all update explanations refer to the fact that additional where restrictions can be used. An example that requires calculations with respect to the existing lines is therefore more interesting. A table is given, the ex. Contains athletes or books, one column shows the row's ranking position. With athletes one might think of a position according to performance, with books of the position according to sales figures. Now athletes are eliminated (moving away, doping, end of career) or books that are no longer available should be removed from the list. The primary keys and rank positions of the rows to be removed may be collected in the table [to be removed]. The ranking positions of the remaining entries must be corrected. Example: Table at the beginning:
The two entries marked with (*) should be removed so that the result is:
|6211||Model woman (*)||5|
3.1 First solution: Cursor based with rank calculation per line The first solution could look like this: Declare @id int, @Rang int Declare cs_Rangliste Cursor Local Fast_Forward For Select A.Id, A.Rang From Ranking list Open cs_Rangliste Fetch Next From cs_Rangliste Into @id, @Rang While (@@ Fetch_Status = 0) Begin - Determine whether record is to be removed If ((Select Count (*) From [to remove] As A Where A.Id = @id) = 0) Begin - No removal, instead new rank Select @ new_Rang = Count (*) From Ranking List As A Where A.Rang <= @Rang And A.Id Not In (Select B.Id From [to be removed] As B) - and Update Update Ranking List Set Rank = @neuer_Rang Where Id = @Id End Else - otherwise delete Delete From Ranking List Where Id = @Id Fetch Next From cs_Rangliste Into @id, @Rang End Close cs_Rangliste Deallocate cs_Rangliste Well - this is not a solution, but a job creation measure. After all lines have been copied out, a determination is made for each line whether it should be removed. If so, this is done, if not, the new rank is determined. This is achieved by determining the number of remaining data records whose rank is less than or equal to the rank of the current row. Finally, the row is updated with the new rank. Such a solution cannot be implemented within SQL. However, it is recommended as an intermediate step and was therefore listed here in order to break down the problem into its sub-problems and to put them in a meaningful order. Note that such a solution can be excellent within a procedural language with its own array structure, i.e. with direct access to the heap. Within SQL, however, the solution is inadequate because the internal optimizer, which can take into account the size and structure of the tables, now doesn't have too much to do. The code itself is 'too close' to the individual lines that there is no more space for a layer.
3.2 Second solution: Sorted-cursor-based, concurrent index and previous deletion A disadvantage of the previous solution is that when calculating the new rank position, reference must be made each time to the amount of remaining rows as the difference between the total table and the amount of rows to be deleted. This suggests to actually delete all lines to be removed in a preceding step. A second 'optimization' could be to process the remaining columns in order of their old rank. Since each rank is reduced at most, an additional variable can be set to 1 at the beginning and increased once in each step. This means that the rank is available without complex calculation and can be used directly for the update. The cursor only has to return the ID column, sorted according to ranks. Declare @id int, @new_Rang int Delete From Ranking List Where Id In (Select Id From [to-remove]) Declare cs_Rangliste Cursor Local Fast_Forward For Select A.Id From Ranking List As A Order By A.Rang Open cs_Rangliste Fetch Next From cs_Rangliste Into @id Set @new_Rang = 1 While (@@ Fetch_Status = 0) Begin Update Ranking List Set Rank = @new_Rang Where Id = @id Set @new_Rang = @new_Rang + 1 Fetch Next From cs_Rangliste Into @id End This solution affects the first Look already 'quite usable'. It represents a 'procedural optimization', since the introduction of the variable @new_range reduces the problem of determining the new rank to the sorting of the cursor. However, it insists on line-by-line processing and this is the only way to transport the rank in the @new_range variable. The improvement due to the renouncement of a determination of the new rank per line is bought with an increased dependency on working line by line. Really set-oriented would mean processing not one row, but all rows at once, so that the optimizer finds the most efficient solution and the table is only accessed a few times. This is because the instruction to determine the new ranking and the update condition implies repeated searches for rows. This logic forgets that the previous position was previously found in the loop.
3.3 Third solution: quantity-based and without cursor After deleting the lines to be removed, the Sql query contained in the first solution can be used directly to determine the new ranking for each line in one work step. This is done by the following self-link: Select B.Id, Count (*) From Ranking list As B Inner Join Ranking list As C On B.Rang> = C.Rang Group By B.Id Every line of A, all lines with a rank lower than or assigned equal to the rank of the row. The resulting link table is broken down into partial tables along the primary key with Group By and the number of rows is determined from each partial table. This value - the number of lines less than or equal to the current rank of this line - is the new rank. However, this calculation is not carried out line by line, but for the entire table at once, so that the partial tables are not created temporarily, but only the lines in the index are counted. Since the column contains as many different values as there are rows, it is ideal for creating an index that can be used to determine all ancestors. This means that only indices are accessed to calculate this intermediate table. The intermediate table can be used as a sub-query in a join and thus enables a direct update: Update Ranking List Set Rank = D.new_Rang From Ranking List As A Inner Join (Select B.Id, Count (*) As new_Rang From Ranking List As B Inner Join Ranking List As C On B.Rang> = C.Rang Group By B.Id) As D On A.Id = D.Id The actual update is also carried out line by line, as a new value must be determined for each line. However, no data has to be copied out of the table, all data remains internal. Each cursor, on the other hand, implicitly copies values, even if it only creates a list of primary keys. Likewise, when a host language is accessed, copies are created internally and transported externally. In addition, the calculation of the intermediate table can be optimized so that the calculation of an intermediate table for 1000 lines is more efficient than the calculation of a thousand individual items.
3.4 Fourth solution: Exclusion of the lines without a change in rank The previous solution still leads to unnecessary updates if the rank of a row does not change. In this case, too, the cell is updated. This can be prevented by either reducing the size of the inner table using a Having comparison between Count (*) and B.Rang or by also exporting the B.Rang column and performing the comparison in the outer Where clause. Having comparison in the subquery: Delete From Ranking List Where Id In (Select Id From [to remove]) Update Ranking List Set Rank = D.new_Rang From Ranking list As A Inner Join (Select B.Id, Count (*) As new_Rang From Ranking list As B Inner Join Ranking list As C On B.Rang> = C.Rang Group By B.Id, B.range Having Count (*) = C.Rang Group By B.Id) As D On A.Id = D.Id Where B.neuer_Rang , Count (*) should also only be calculated once in the first case. Similarly, in the second case, the where clause is likely to be implicitly dragged into the subquery so that both versions could effectively generate the same schedule.
3.5 Fifth solution: Direct reduction of the inner table If one takes a closer look at the above solution, it is noticeable that the minimum rank of the deleted lines marks a limit. All rows with a lower rank are not updated, all rows with a higher rank must be processed by the update. First of all, the additions under 3.4 could be dispensed with if the first deleted rank = 1. However, this condition can now also be added directly to the subquery so that it is restricted to the number of rows that actually need to be updated. Delete From Ranking List Where Id In (Select Id From [to remove]) Update Ranking List Set Rank = D.neuer_Rang From Ranking List As A Inner Join (Select B.Id, B.Rang, Count (*) As newer_Rang From Ranking List As B Inner Join Ranking As C On B.Rang> = C.Rang Where B.Rang> (Select Min (E.Rang) From [to-remove] As E) Group By B.Id) As D On A.Id = D .Id The new subquery in the where clause can also be brought forward and the minimum stored in a variable. If this is a one-time executed instruction, there should be no effect on the performance.
4. Summary Sql can only really exploit its capabilities when processing large amounts of data due to the reusability of flow charts if the Sql code is allowed to work in a quantity-based manner. This can be achieved by: Additional remark: The above problem of correcting rank positions appeared during the development of the main server projectD.aten: The web database for first class companies. When uploading a table, the external file contains some columns in a defined order / rank, this corresponds to a number of table rows with column name, data type and column position in the external file. If some columns are to be ignored, i.e. skipped during import, the above instruction provides the updated ranking positions for creating a new table without having to check each upload column individually and correct all successors individually.
Contact form: Write to me and we will build your new web database together!
© 2003-2021 Jürgen Auer, Berlin.