![]() |
|
Introduction¯Four Approaches to Surrogate Key GenerationSurrogate primary keys are the unique, system generated, row identifiers that are commonly used by database designers. The primary reasons for using surrogate primary keys are to create small, simple keys and to provide an audit trail. Small keys result in small indexes, narrow dependent tables having foreign keys, and joins that are easier to write and faster to execute. Additionally, surrogate primary keys can be used to provide an audit trail, such as in the case of invoice numbers.These keys are system-generated, and this TechNote deals with the question of how best to generate surrogate primary keys. The question of how to generate these keys is non-trivial because Sybase SQL Server's page-level locking causes the key generation algorithm to significantly affect overall SQL Server performance. This is especially true in an insert-intensive OLTP (online transaction processing) environment. The different algorithms for key generation can be grouped into four general approaches:
Monotonic SeriesThis section describes ways of generating monotonic keys. Monotonic keys are progressive in the collating sequence, for example:
1, 2, 3, 4, ... 1, 3, 4, 8, ... 15 March 1994, 21 March 1994, 1 April 1994, ... MethodsMethods which can be used to produce a monotonic series are:
Max Plus OnePerhaps the most apparent solution to surrogate primary key generation is to select the max from the data table and to add one to determine the next key value.
1> declare @next_key select @next_key = 1> insert into data_table (key_column, ...)However, this algorithm does have a serious problem. The transaction is not atomic. Two concurrent transactions could execute the select and obtain the same next key value before either one executes the insert. This algorithm could only be considered appropriate for a low-intensity environment.
Enhanced Max Plus OneThe first improvement to the "Max Plus One" algorithm is to include the holdlock keyword in the select statement.
1> declare @next_key select @next_key = 1> insert into data_table (key_column, ...)The SQL shown above will generate unique keys provided there is a clustered index on the key. However, this algorithm frequently causes deadlocks in a highly concurrent environment. If two concurrent processes execute this code, both acquire shared locks on the last page. One then attempts to obtain the exclusive lock on the last page required by the insert statement and becomes blocked by the shared lock of the other process. The second process then tries to acquire the exclusive lock required by the insert statement and a deadlock results. Even with the holdlock enhancement, the "Max Plus One" algorithm is suitable only in a low-intensity environment.
Next Key TableA more versatile approach is to introduce a separate table to store the next key value.
1> begin tranThe SQL is atomic and uniqueness is guaranteed. Notice that the update statement precedes the select statement. The update first acquires an exclusive lock on the table ensuring that no other processes can obtain a next key value from the table. In the above example, the transaction includes both the generation of the key and the insert into the data table. The transaction is completely single-threaded. Consequently, there are no gaps in the sequences of key values inserted into the data table.
Identity PropertyThe identity property, as supplied in Sybase System 10, provides a series of sequential numbers to use as primary keys. The implementation and issues regarding gaps in the key sequence are described in your System 10 SQL Server documentation.
Concurrency IssuesRegardless of the algorithm used to generate the keys, all monotonic key algorithms have similar effects on concurrency and cache hit.For a better understanding of the effects of the key generation scheme on concurrency, we must examine various indexing scenarios. These scenarios include the following:
No Index ConcurrencyIn the case where you have no index, the table acts as a heap and all inserts go to the last data page by default.When more than one process tries to concurrently insert, the first process will block the second. The first process acquires an exclusive lock on the last data page of the table. The second tries to acquire an exclusive lock on this same page but is blocked until the first transaction has either been committed or rolled back. This spot of contention is commonly referred to as a "hot spot."
Clustered Index ConcurrencyIf you create a clustered index on the surrogate primary key and there is still a similar concurrency situation. Because the series of key values generated is monotonic, all inserts are directed to the last page.If the maximum key value is 100, the next key value generated is 101. An insert with a primary key value of 101 is directed to the last data page of the table, as is an insert of the row with a primary key of 102. Only one of the two processes is able to have the necessary exclusive lock on the page at any one time. Consequently in this case as well, concurrent inserts must wait for an exclusive lock. Note that this situation applies for all algorithms that generate monotonic keys. When using the identity property, multiple processes can concurrently generate next key values without blocking each other. However, once two processes try to concurrently insert into the data table, they will block as described above.
Non-Clustered Index ConcurrencyIf you use a non-clustered index only on the surrogate primary key, the table acts as a heap. By default, all inserts go to the last data page. Consequently, there is still a troublesome hot spot and concurrent inserts must wait for an exclusive lock.
Clustered Index on Other Key ConcurrencyPerhaps the most frequently heard advice on the issue of page- level locking and surrogate primary key generation is to cluster on a different column.One possibility is to create a non-clustered index on the surrogate primary key and a clustered index on some other column or set of columns. For example, a table of employees could have a non-clustered index on the employee ID and a clustered index on last name, first name, which would be useful for sorting alphabetically. In this case, inserts are spread across the data pages based upon the distribution of the clustered index. For example, one process might be inserting employee Abbott while another is inserting employee Wilson. The first employee's row would be inserted toward the front of the table and the second employee's would be inserted toward the end. This is so even if Abbot has an employee ID of 101 and Wilson an employee ID of 102. At first glance, this method appears to solve the concurrency problem. However, while Abbott's and Wilson's rows are far apart on the data pages of the table, a single index page probably contains both employee identifiers and pointers to both of these data pages. That index page is the last leaf-level page of the non- clustered index. When the rows are inserted, the index page must be updated before the transaction is complete. This update to the index page requires an exclusive lock. Because both of these concurrent inserts must acquire an exclusive lock on the same index page, a hot spot exists. The last-page contention arguments still apply. The hot spot has not gone away. It has moved from the last data page to the last non-clustered index page. In summary, a monotonic approach might be effective only in an environment that rarely has concurrent processes inserting into the same table.
Pseudo-Random NumbersOne of the most common alternatives to the monotonic series is to use a pseudo-random key. For example:
1> create procedure get_next_key 1> select @next_key = 16807 * 1> if (@next_key <= 0)
Next-Seed TableTo use this technique, one must use a next-seed table to seed the random number generator. In the previous example, the seed must be in the range of four-byte integers. Each unique seed value will produce a unique surrogate primary key in the four-byte range.
ConcurrencyThe pseudo-random technique certainly is an improvement in concurrency. Inserts are now spread across the data pages and index pages. The inserts have a random chance of colliding. If the table is large, as is the case for most OLTP databases, inserts rarely block each other.While concurrency has been improved, the pseudo-random technique does have three significant disadvantages:
Comparing Monotonic and Pseudo-Random Cache Hit RatiosMonotonic CachingUnder the monotonic key approach, a good cache hit ratio is promoted because few additional pages must be hit with each new insert. While it is true that concurrent inserts contend for an exclusive lock, the pages that must be repeatedly hit are more than likely to already be in the data cache. Each insert descends this index down the same path. This one path down to the leaf page can easily remain in cache.
Pseudo-Random CachingAs for randomness, its strength is its weakness. Having blocking occur only randomly is an improvement. However, randomly hitting a page in data cache is not ideal.Each insert descends the index on the primary key randomly. Consequently, one insert will descend one path through the index while another insert will descend another path. While this prevents them from blocking, many more pages must be hit. Many large OLTP databases have a 1 to 100 data cache to database size ratio. An approximate one percent chance of a page being cache results in a poor cache hit ratio. Depending on the nature of the application, random keys might not result in an improvement in overall system performance. Therefore, a random approach would be most effective in an environment that is insert intensive but has a relatively large data cache.
Composite Key ApproachAre high concurrency and a high cache hit ratio mutually exclusive? Or is there a better approach?A third alternative to surrogate primary key generation is the "composite key" approach. The surrogate primary key is a single column composed of multiple parts. The first part of the key is used to control the spread of the inserts throughout the data table and the second part is used to ensure uniqueness. The following example is credited to Hal Spitz of Sybase:
1> create procedure get_next_key @next_key char(20)In this example, the first part of the key is the SQL Server user ID. Other possible values include the client host process ID and SQL Server process ID. The first part of the key is used to improve concurrency by controlling the spread of the inserts. Since this example uses the SQL Server user ID, all rows inserted by a given user would be grouped together. This grouping would occur on the data pages in the case of a clustered index on the key and on the leaf-level index pages in the case of a non-clustered index on the key. If each concurrent user logs in using a different SQL Server user ID, concurrent inserts are spread throughout the table, resulting in less contention. The second part of the key in this example is based upon datetime because it provides a practically unique value. In addition to providing uniqueness, the second part of the key keeps the cache hit ratio from deteriorating. Using the composite key approach, a given SQL Server user ID generates keys in a monotonic series. For example, SQL Server user ID 125 might generate the following values:
12519940411145439533This series causes each SQL Server user ID to consistently work in one area of the table or index. Consequently two consecutive inserts by the same SQL Server user ID will, more than likely, be directed to the same page in the database.
DisadvantagesThe primary disadvantage to this approach is that you must use a larger key. The example uses a twenty-byte string rather than a four-byte integer as used by the previous example. As previously mentioned, one of the primary reasons for using a surrogate primary key is to have a small key that results in a small index. Because this approach requires more space for the same amount of information, a lesser amount of valuable information can be stored on the pages in the data cache.A secondary disadvantage is that the datetime values are part of a continuous sequence¯not a discrete sequence. There is no way to look for or control gaps in the key sequence. A better solution for the insert-intensive OLTP environment would do all of the following:
Recycled Series ApproachIn the recycled series approach, each concurrent process has its own conceptual next-key table.For example, three concurrent processes can acquire next key values as follows: One process can work on the sequential series from 1 to 1,000,000, the second can work on the series from 1,000,001 to 2,000,000, and the third from 2,000,001 to 3,000,000. An analogy can be made to a diner and its five wait persons:
What happens if a wait person quits (one less concurrent process)? The manager takes his or her pad of order forms and saves it. What happens if the wait person is later replaced (a replaced concurrent process)? He or she is given the predecessor's pad of order forms to continue in the series. One implementation of this approach is included at the end of this TechNote. The included implementation is written completely in Transact-SQL. An alternative implementation that would require less overhead could be written as an Open Server application. In summary, a recycled series approach might be effective when both concurrency and cache hit ratio affect overall database server performance. This is usually the case in insert-intensive OLTP databases.
Trigger-Based ImplementationAny of the above approaches, except a monotonic approach using the identity property, could be implemented using triggers. However, such an implementation can adversely affect SQL Server performance. First, SQL Server must insert rows into the wrong area of the table and/or indexes, and the trigger relocate the rows to the correct place. Second, because all inserts will initially be inserted into the same area of the table and/or indices a hot spot will result.
Performance Comparison of the Four ApproachesRao Kanikicharla of Bell Atlantic Healthcare Systems tested comparative performance to measure the performance differences of these various approaches. With only one concurrently inserting process, the monotonic approach was found to be the most effective. As more concurrent processes were added, the overall throughput of the other approaches improved while the monotonic approach continued to process transactions at approximately the same rate. The break-even point was approximately four to five concurrent transactions performing simple insert statements. With ten concurrent transactions, the recycled series approach was found to be significantly faster. Additionally, it caused fewer deadlocks. The break-even point would be a smaller number of concurrent processes for larger transactions that hold the exclusive locks for a longer period of time.The various approaches are effective in different environments. The monotonic approach may provide best performance in a low activity environment. The pseudo-random approach may provide best performance in an insert-intensive environment with a large data cache. The composite approach may provide better performance than the previous two in an insert-intensive environment with a relatively small data cache. Finally, the recycled series approach also works well in an insert- intensive environment with a small data cache. However, the recycled series approach requires less overhead due to a smaller key and provides a way to manage gaps in the series of key values.
Recycled Series Implementation Example
create table next_key create unique clustered index next_key_spid /* Insert a firstrow */ insert into next_key create procedure get_next_key declare @error_flag int select
if (@@rowcount<=(@next_key-1)%1000000)
begin
else
update next_key select @error_flag = 0
set rowcount 1
if (@@rowcount = 1)
else begin
declare @max_next_key int select @next_key = (@max_next_key-1)/1000000+1
if not (@max_next_key is null)
else if (@error_flag = 1) go
update statistics next_key exec sp_recompile next_key by Gary Meyer
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||