Sybase Business Intelligence Solutions - Database Management, Data Warehousing Software, Mobile Enterprise Applications and Messaging
  Worldwide [Change] Contact Us  |  My Sybase  |  Shopping Cart - Buy Business Intelligence Products - Database Management, Data Warehousing and Mobile Software  

Search for    in all of Sybase.com
view all search results


Support & Services / Technical Documents / Document Types / White Paper-Technical / Adaptive Server Anywhere Data Storage  
Adaptive Server Anywhere Data Storage

Introduction

This whitepaper explains data storage within Adaptive Server Anywhere databases. Reading this whitepaper should help you to understand how the space within the database is used. Adaptive Server Anywhere is designed to store data efficiently, using a small amount of disk space and allowing database operations to be executed relatively quickly. However, there are some things that you can do to reduce the size of your database and to improve its performance. These include storing server files on different drives, choosing a proper page size, and using indexes appropriately. This whitepaper should give you some idea of the effects of these actions so that you can decide how they apply to your particular database.

Except where noted otherwise, this whitepaper refers to features of Adaptive Server Anywhere 8.0.0 and later.

Files used by the database server

Adaptive Server Anywhere uses four kinds of files when it is running a database. These are the database file, the transaction log, the transaction log mirror, and the temporary file.

  • The database file The database file holds all of the information in an Adaptive Server Anywhere database. It holds the data tables and indexes for the database as well as the system tables for the database, which contain information such as foreign key relationships and database characteristics such as page size and collation. You can create additional database files, called dbspaces. The database file and dbspaces have the suffix . db.
  • The transaction log file The transaction log holds a record of all of the changes that have been made to the database. It allows data recovery in the event of system failure and improves performance. The transaction log is also necessary for database replication using SQL Remote. It usually has the same name as the database file, except for the suffix, which is . log .
  • The transaction log mirror The transaction log mirror is an optional copy of the transaction log. Adaptive Server Anywhere writes the same information at the same time to both files. A transaction log mirror enables recovery if the transaction log becomes unusable due to media failure. It has the suffix . mlg. A transaction log mirror is not required for every database, but it is recommended for very critical applications.

¨ The temporary file As its name suggests, the temporary file is used to hold temporary information while Adaptive Server Anywhere is running. There is one temporary file for each of the databases running on the server; Adaptive Server Anywhere opens the temporary files when the server starts and closes it when the server is shut down. This file is used when the server requires more space than is available in the cache in order to perform certain operations. The database server assigns this file the suffix . tmp by default, but it is possible to assign a different extension.

Separate files onto different devices to improve performance. Dividing this information into different files has potential performance and recovery benefits for the database server. You can take advantage of these benefits by placing each of these files on a different physical device.

Placing the files on different physical devices improves performance by eliminating the need for disk head movement between the files. Separating the transaction log, the transaction log mirror, and the database file onto different devices has the further benefit of ensuring that media failure will affect only one of these files, allowing for full data recovery. Placing the files on different physical drives also helps prevent file fragmentation.

Database file pages

The information in the database file, the transaction log, and the temporary file is stored on fixed-size areas of disk called pages. The default page size is 2 kb, but pages can also be 1, 4, 8, 16, or 32 kb. All of the pages in the database must be the same size, and this size must be chosen when the database is created.

There are many types of pages used in the database file. These include table pages, index pages, and other pages used to increase performance or to aid in data recovery.

Table pages

Information in database tables is stored on table pages. Each table has one or more pages assigned to it, and each page is made up of three parts: a header, data rows, and a row offset table. Each table page can storedata from at most one table and at most 255 rows.

  • The header stores the object ID of the table that is using that page. It also holds pointers that link the current page to the next and the previous pages for that object.
  • The body of the page contains data rows for the object. These data rows contain the actual column data from the table and are stored contiguously on a page.
  • The row offset table stores pointers to the starting location for each data row on the page.

Adaptive Server Anywhere creates a bitmap for sufficiently large tables within databases that have at least a 2K page size. Each table’s bitmap reflects the position of each table page in the entire dbspace file. For databases of 2K, 4K, or 8K pages, the server utilizes the bitmap to read large blocks (64K) of table pages instead of single pages at a time, reducing the total number of I/O operations to disk and hence improving performance. Users cannot control the server’s criteria for bitmap creation or usage.

Note that bitmaps, also called page maps, are only available for databases created in version 8.0 and higher. If a database is upgraded from an older version, the database server will not create a bitmap for database tables. Bitmaps are not created for work tables or system tables.

Every new row that is smaller than the page size of the database file is always stored on a single page. If no present page has enough free space for the new row, Adaptive Server Anywhere writes the row to a new page. For example, if the new row requires 600 bytes of space, but only 500 bytes are available on a partially filled page, then Adaptive Server Anywhere places the row on a new page.

To make table pages more contiguous on the disk, Adaptive Server Anywhere allocates table pages as follows:

  • For 2 KB pages, table pages are allocated in blocks of 32.
  • For 4 KB pages, table pages are allocated in blocks of 16.
  • For 8 KB pages and above, table pages are allocated in blocks of 8.

For example in a database with 8 KB pages, when it needs to allocate a page, it allocates eight pages, inserts the page in the block, and then fills up the block with the next seven pages. In addition, it uses a free page bitmap to find contiguous blocks of pages within the dbspace, and performs sequential scans by reading groups of 64 KB, using the bitmap to find relevant pages. This leads to more efficient sequential scans.

Text and image pages

Character and binary data types that are larger than 255 characters are called binary large objects, or blobs. These objects are not stored in the table page in their entirety; instead, the page stores the first 255 characters of the object and a pointer to a page where the rest of the object is stored.

Storing blobs separately is more efficient The pages that store the remaining characters of these objects are maintained separately from the table pages. Because blobs take up a lot of space on a page, including them in the table with the rest of the columns would greatly increase the time required to scan the table sequentially. If you were searching sequentially for some data in the table, you would have to scan the blob data even if it was irrelevant to the information you were looking for. Since blob data is stored separately, it does not have to be scanned along with the rest of the table data. There are fewer pages to read, resulting in a more efficient scan.

Index pages

Index entries are stored on index pages, which are similar to the table pages described above. The main difference between the two types of pages is that the object being stored is a set of index entries rather than a set of data rows. In general, index entries are much smaller than data rows, so index pages usually contain many more entries per page than table pages. Like table pages, each index page can only include information for one index.

Indexes can form a large or a small percentage of the database, depending on the number of indexes created. Since primary and foreign keys are always indexed automatically, most databases contain indexes.

Rollback log pages

Each connection to the database server has a rollback log. The rollback log is a record of all of the changes that have been made to the database by a connection since the last COMMIT or ROLLBACK statement. This record allows uncommitted changes to be cancelled if a ROLLBACK is issued or in the case of system failure.

The rollback log consists of a linked list of pages that is stored in the database file. Each page contains a list of changes to the database, much like the transaction log. However, unlike the transaction log, which is involved in replication, the rollback log is only available internally to the database server.

Once a transaction has been committed or rolled back, the rollback log is erased and any resulting free pages are available for reuse within the database.

Checkpoint log pages

At any given time, the information in the database fileon disk is not guaranteed to be current or consistent with the transaction log. This is because pages are read into the cache and altered but are not immediately written to disk. Dirty pages, or pages that have been changed in the cache, are written to disk only when the cache becomes full or when a checkpoint is issued.

A checkpoint is a point at which all of the dirty pages in the cache are written to disk. After a checkpoint, the database file is up to date and accurately reflects the information in the transaction log. Frequent checkpoints reduce the time needed for data recovery, but this benefit should be balanced with the fact that executing checkpoints takes time. Adaptive Server Anywhere tries to execute checkpoints when there is not a lot of other activity taking place, according to a heuristic based on how long it has been since the last checkpoint and how long recovery would take in the event of a system failure. You can adjust the time between checkpoints using the CHECKPOINT_TIME option, which sets the maximum desired time between checkpoints, in minutes, and the RECOVERY_TIME option, which sets the maximum desired time for recovery in the event of system failure, in minutes.

The database server always executes a checkpoint during database shutdown, so that when the server is restarted, the state of the database file is known, up-to-date, and does not require recovery.

The checkpoint log If a system failure caused the database server to shut down without first executing a checkpoint, the databasefile may not be in a consistent state. That is, some changes that have been made since the last checkpoint may have been written to the database file, while others would not have been written and so would be lost. The database server maintains copies of all of the pages that have been modified since the last checkpoint, and uses these pages to recover from an improper shutdown. These copied pages are called rollback pages, and together they form the checkpoint log.

The checkpoint log is referenced through the free list, a linked list of pages similar to index pages that point to free pages within the database. When the database server requires a new page, it allocates a page from the free list. When a page is allocated as a rollback page, it is marked using a rollback bit to signify that it is part of the checkpoint log.

When the database is started, the checkpoint log is checked to see if it contains any rollback pages. If it does, the database server takes steps to recover from the system failure. It first restores the database file to its state at the last checkpoint by copying all the rollback pages over their corresponding pages in the database file. It then applies changes made since the last checkpoint by reapplying the operations in the transaction log. Finally, it rolls back any uncommitted changes, using the rollback logs. The entire checkpoint log, including all empty checkpoint log pages, is deleted at the end of each session.

As of Adaptive Server Anywhere 8, the checkpoint log is stored in consecutive pages at the end of the database file. This leads to improved performance by allowing sequential scans and multipage writes of the material in the checkpoint log. To gain the benefits of this enhancement on databases created before this release, you must upgrade the database file format by unloading and reloading the database.

Pages in the other files

Data within the transaction log, transaction log mirror, and temporary files is also stored in pages. These pages are the same size as the pages in the database file.

The transaction log

The transaction log and the transaction log mirror consist essentially of linked lists of pages that store all of the changes that have been made to the database. Each entry in the transaction log holds all of the information necessary to execute the SQL statement for one particularrow change. For example, the entry for a DELETE statement would contain information to accurately identify the row that was deleted.

Because the transaction log is used during recovery and during replication, it cannot use row IDs to refer to rows that have been modified. Instead, entries in the transaction log must always refer to a row either by primary key value, or, if there is no primary key, by copying the contents of the entire row.

Using a transaction log improves performance Although it may seem that using a transaction log would slow the performance of the database server, in fact the opposite is true. Without a transaction log, Adaptive Server Anywhere must perform a checkpoint at the end of each transaction. Checkpoints are time-consuming as they can involve writing a lot of pages to disk. Using a transaction log involves taking note of every operation, but it requires writing fewer pages to disk. Thus, using a transaction log actually improves performance.

Every update operation is written in the transaction log; however, in order to minimize I/O operations, the transaction log is only written to disk when a COMMIT statement occurs. Therefore, if a transaction log is used, the most recent transaction log pages are written to disk at each COMMIT statement. In contrast, if a transaction log is not used, a COMMIT statement forces a checkpoint. In this case, every page that has been changed in any way since the last COMMIT statement must be written to disk.

The size of the transaction log does not affect performance, because all changes are written to the end of the file, so the transaction log does not usually have to be scanned.

However, you may wish to limit the size of the transaction log in order to conserve disk space. Renaming the transaction log when a backup is performed keeps the transaction log small. You can also manage the growth of the transaction log file by ensuring that all tables have a primary key or a unique index that does not allow NULL. These columns allow the transaction log to identify a row by giving only one value. Without a primary key or a unique, non-NULL index, the transaction log must store a copy of the entire row in order to identify a row uniquely. Copying this extra information can also affect the performance of the database server.

The temporary file

The temporary file contains tables and indexes created by Adaptive Server Anywhere when needed for query processing. These objects are stored on pages identical in structure to the data and index pages within the database file. The temporary file is also used to store prepared statements, cursors, stored procedure definitions, view definitions, the lock table, and user-generated temporary tables.

Adaptive Server Anywhere uses temporary tables to perform operations such as sorts or complex unions that cannot be performed in a single step. For example, if you use a SELECT statement and ORDER BY a column that is not indexed, Adaptive Server Anywhere uses a temporary table to sort the result by the specified column.

Because the temporary file is used in order to solve certain queries, Adaptive Server Anywhere generally uses the temporary file intensively over short periods of time. For example, Adaptive Server Anywhere might use the temporary file to create a temporary table and execute a number of queries using that table, and then not use the temporary file again until the next temporary table is required.

Often, operations that require the use of the temporary file also require the retrieval of large amounts of data from the database file. In order to allow operations to take place in the database file and the temporary file simultaneously, you should place the temporary file on a fast device that is separate from the drive that holds the database file. You can do this by specifying an environment variable. Adaptive Server Anywhere places the temporary file in the location specified by ASTMP; if this variable is not defined, Adaptive Server Anywhere looks next at the variables TMP, TMPDIR, and TEMP, in that order.

Choosing a page size

The page size for a database can be 1, 2, 4, 8, 16, or 32 kb. The choice of page size can greatly affect the performance of the database and the size of the database file.

Tip
Both large and small pages have advantages, and there is a tradeoff to making pages larger or smaller. In general, performance suffers if very small or very large pages are chosen. Most databases perform best with 2, 4, or 8 kb pages.

When choosing a page size, you should consider the following factors:

  • Database server cache size
  • Table and row size
  • Index fan-out

Database server cache size

The advantages and disadvantages of different page sizes depend heavily on the way that you access the data, because of the importance of keeping useful data within the cache.

The cache The cache is an area of memory that holds pages that are currently being read and stores pages that are used frequently. Before information that is stored on disk can be accessed, it must first be read from disk into the cache. Because pages that are already in the cache can be scanned much faster than those that must be retrieved from disk, you should try to maximize the amount of useful information that can be stored in the cache.

Cache page size Like the files used by the database server, the cache is divided into pages. Information is moved between the cache and the disk one page at a time. Cache pages do not have to be the same size as database pages, but they cannot be smaller than the database pages of any database that will be run by the server. Each cache page can hold only one database page, so cache pages should not be made any larger than necessary.

The page size of the cache is set when you start the database server. By default, it is set to the largest page size of the database files opened when the server is started. Once the page size is set, you cannot open a database with pages larger than the cache pages. If you plan to open a database with large pages later, you can set the page size of the cache explicitly by using the -gp flag on the command line. For example, typing dbserv7 -gp 4K at the command line starts the server with 4 kb cache pages.

Caution
Using cache pages that are larger than your database pages wastes a lot of space in memory. If you are running multiple databases on one server, they should all use the same page size.

Sequential scans favor large pages It takes approximately the same amount of time to read a page into the cache regardless of the page size. For example, it takes much less time to read one 4 kb page into the cache than it takes to read four 1 kb pages into the cache. Therefore, if you frequently execute commands that require scanning an entire table, it is more efficient to use a large page size, such as 4 kb, which allows you to bring much more of the table into the cache more quickly.

Large page sizes also provide benefits when writing information to the transaction log and the temporary file. For example, a 4 kb transaction log page can hold more entries than a 1 kb page, and therefore transaction log pages have to be written to disk less often when using larger pages.

Smaller pages may improve performance with a small cache On the other hand, if your typical database commands are more likely to require reading small pieces of data from random locations, using large pages could result in wasted space in the cache. For example, if you use 4 kb pages, you always have to read 4 kb into the cache at a time, even if 1 kb would be adequate.

When a page is read into the cache, another page must usually be written to disk and discarded from the cache in order to make room for the new page. Therefore, if you need to access one row from a random location in your database, and you have 4 kb page sizes, you must discard 4 kb of possibly useful information from the cache in order to access that one row. Using smaller pages allows you to bring small pieces of many different tables into the cache at a time.

You may also want to use smaller database pages if you are running Adaptive Server Anywhere on a machine with limited resources. Small database pages allow a greater number of pages to fit into a smaller cache.

For more information on memory use and the effect of cache size, please see the paper entitled "SQL Anywhere 5.5 Memory Use", located at http://www.sybase.com/developer/ .

Table and row size

Each table requires at least one table page. Even empty tables have one table page assigned to them. Therefore, if your database consists mostly of tables that can fit on one page, you may want to choose a smaller page size, since larger pages could contain a lot of empty space. However, for a typical database, this is not a concern, because the majority of the database consists of large tables.

Inserted rows that do not fit on an existing page are added to a new page. Therefore, if you are using very large rows, you should be careful not to use a page size that is just under twice the average row size. In this case, almost half of each page is wasted, and you would probably benefit from a different page size.

On the other hand, if your rows are relatively small, you should be careful not to use a page size that is too large. This is because a maximum of 255 rows can be stored on each page, so if the page is too large, the rows do not fill the pages completely. For example, in order for a 32 kb page to be fully used, the average row size would have to be about 125 bytes.

Index fan-out

Index fan-out refers to the average number of index entries that can fit on an index page. Clearly, page size can significantly affect index fan-out. This can have a large effect on the performance of queries that use indexes. For more information, please see "Improving index performance" in the section "Improving database performance" in this paper.

Storage for data types

The number of rows that fit on a page depends upon the data types stored in those rows. Different data types have different storage requirements.

Note
Unlike most other database systems, Adaptive Server Anywhere stores all character and binary data types in the same way, as a variable length object. For example, there is no storage benefit to using a VARCHAR rather than a CHAR.

Data Type

Storage Requirement (in bytes)

CHAR

# of bytes in string + 1

(additional storage required if length is greater than 255 bytes)

VARCHAR

See CHAR

LONG VARCHAR

See CHAR

TEXT

See CHAR

BIGINT

8

DECIMAL

2 + the integer part of ((# of digits / 2) + 1)

DOUBLE

8

FLOAT

4 (single precision) 8 (double precision)

INT (INTEGER)

4

NUMERIC

See DECIMAL

REAL

4

SMALLINT

2

TINYINT

1

MONEY

See DECIMAL

SMALLMONEY

See DECIMAL

BIT

1

DATE

4

DATETIME

8

SMALLDATETIME

8

TIME

8

TIMESTAMP

8

BINARY

# of bytes + 1

(additional storage required if length is greater than 255 bytes)

LONG BINARY

See BINARY

IMAGE

See BINARY

VARBINARY

See BINARY

A case study

In order to look more carefully at some of the effects of page size on performance, we ran a number of tests on a sample database. These examples illustrate approaches you may find useful in assessing your own database needs.

The sample database comes from the Transaction Processing Performance Council (TPC) using the TPC-D Benchmark Specification. This database is designed to support a product inventory system, and contains information about orders, customers, and suppliers.

DBGEN, the benchmark's database population generator, was used to create 0.1 Gb of data that was used to fill the databases. The same data was used in three databases that had page sizes of 1 kb, 4 kb, and 16 kb. Queries were then executed on these three databases in order to determine which page size had the best performance results.

The queries used in this case study were executed in Interactive SQL, and the results were output to a file (using the UNLOAD statement) to ensure that all rows were fetched. The execution time reported in the Messages pane of Interactive SQL was the time used in comparisons.

Dynamic cache sizing was disabled during the execution of these queries; each query was run using a 12 Mb and a 60 Mb cache. In order to provide consistent and repeatable results, the cache was flushed between queries.

Since different page sizes often lead to different access plans being chosen by the optimizer, it was important to ensure that only queries using the same plan across page sizes were compared.

Note:
These tests were run on freshly loaded databases—no inserts, updates, or deletes were run on the databases before the queries were executed.

Results

As expected, both large and small page sizes showed advantages and disadvantages.

A single sequential scan of a table was always faster with a larger page size (16 kb was faster than 4 kb, which was faster than 1 kb.)

Queries that retrieved data from more random locations, using both sequential scans and index lookups, were sometimes faster with 4 kb pages and sometimes faster with 16 kb pages. The two major factors that affected these queries were index levels and cache size.

  • Index levels Since a single index lookup requires reading one page at each level of the index, plus the table page, increasing the number of index levels can greatly increase the number of page reads required for a single query. (See also "Improving index performance", in the "Improving database performance" section of this paper.)
  • Cache size When a large cache size is available, performance can be improved by having larger page sizes. However, if space is more limited, queries on databases with smaller page sizes can outperform those on databases with larger page sizes.

This is because an entire page of information is read into the cache even if only a single row on that page is needed. With larger pages, a small cache quickly fills, and each read into cache then means discarding an equivalent amount of information. If this information is needed later, it must be re-read into cache. With smaller pages, the cache does not fill up as quickly, and even once it is full, each read into the cache requires removing less information from it.

If there are many users accessing the database server simultaneously, this has the same effect as decreasing the cache size, since there is less cache space for each user, and pages are more likely to be swapped to disk.

These examples are designed to look at SELECT statements only. There are additional page size concerns that relate to inserts, updates, and deletes.

Note
This case study was run using Adaptive Server Anywhere 7. The results with
Adaptive Server Anywhere 8 would be different.

Example 1

The following query joins two tables in the TPC-D scheme, Lineitem and Partsupp, using a two-column foreign key relationship.

SELECT ps_supplycost, l_extendedprice
FROM partsupp, lineitem
WHERE l_extendedprice < 10000
AND ps_supplycost < 500
AND l_partkey = ps_partkey
AND l_suppkey = ps_suppkey

This query has the access plan "Scan Lineitem sequentially. Scan Partsupp using primary key for (ps_partkey, ps_suppkey) = (l_partkey, l_suppkey)." It requires using the index on the primary key of Partsupp.

There are 82 694 rows in Lineitem that have l_extendedprice < 10000. For each of these rows, the database server must do an index lookup in the primary key of Partsupp to find (ps_partkey, ps_suppkey) = (l_partkey, l_suppkey). For each index lookup, the number of pages read is the number of index levels plus one. The primary key of Partsupp has 4 levels using 1 kb pages and 2 levels using 16 kb pages; therefore, the server has to read about 400 000 pages using 1 kb pages and about 250 000 pages using 16 kb pages.

The number of pages that need to be read has a direct impact on the time taken to execute the query. The 16 kb page database had the highest performance for this query. Compared to its performance on the 16 kb database, the query took about 1.4 times as long on the 4 kb page database and about 2.5 times as long on the 1 kb page database.

Relative times are shown in the following table:

1 kb

4 kb

16 kb

2.5

1.4

1.0

Example 2

The following query joins a number of tables together, using three indexes.

SELECT l_orderkey, o_orderdate, o_shippriority
FROM order, customer, lineitem
WHERE c_mktsegment = 'building'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < '1995-03-15'
AND l_shipdate > '1995-03-15'

This query has the access plan "Scan Customer sequentially. Scan Order using foreign key Customer for o_custkey = c_custkey. Scan Lineitem using foreign key Order for l_orderkey = o_orderkey."

Even though the indexes used in this query also have more levels on the smaller page databases, in this case, the most important factor influencing execution time is cache size. With a 60 Mb cache, the databases on the larger page sizes have better results. However, with a 12 Mb cache, the 16 kb page database had the worst performance. The queries performed best on the 4 kb page database, followed by the 1 kb page database.

Conclusions

It can be difficult to predict which factors will have the most influence on a particular query and which page size will allow the best performance for your queries overall. For this reason, it is recommended that you do not use very large (16 and 32 kb) or very small page sizes (1 kb) unless you have carefully considered the performance and storage benefits. The page size that will provide optimal performance for your database depends on the specific operations performed by your applications.

Reducing database size

There are several factors that could cause your database to be larger than you expect. In particular, you should consider the number of empty pages in the database and the distribution of data within the database.

  • Empty pages There may be empty pages within the database that have not been reclaimed.
  • Empty space on the pages Data may not be stored compactly on the table pages.

Monitoring page usage You can display information about the database file by using the DBINFO command-line utility. Using DBINFO with the -u command-line flag allows you to see page usage statistics for each table in the database, including system tables. This command will display a chart showing the number of table and index pages assigned to each table and the percentage of these pages that the table is actually using. You can also see the number of free pages in the database. You can use this information to check how much empty space there is in your database.

Acceptable ranges of page usage vary widely from one database to the next, depending on the use of the database. In general, you should not be concerned about page usage unless you have noticed declining performance in your database.

The free page bitmap

Adaptive Server Anywhere databases never shrink in size. Pages that become empty are not removed from the database, but are instead kept to be reused the next time a new page is required. The free page bitmap, a linked list of pages that reference free pages within the database, is used to keep track of these free pages.

The checkpoint log is referenced through the free page bitmap. When the database server requires a new page, it allocates a page from the free page bitmap. When a page is allocated as a rollback page, it is marked using a rollback bit to signify that it is part of the checkpoint log.

Because free pages are not released from the database, deleting large amounts of information from the database does not cause the database to shrink. In fact, because every page on which an item is deleted needs to be copied to the checkpoint log, deleting large amounts of information may actually cause your database to grow.

If you are concerned with the size of the database file and you do not anticipate adding more information to your database in the near future, you can reclaim the free pages in your database by unloading and reloading the database. This process is known as rebuilding the database.

Caution:
Rebuilding a database is a time-consuming operation and requires a large amount of disk space. For this reason, rebuilding the database is not recommended unless you have a specific goal in mind.

Reducing empty space on the pages

Reading pages into the cache is more efficient when pages do not contain empty space, because more information can be obtained from each cache read. For example, reading two half-full pages into the cache takes longer than reading one full page, but does not provide any additional information. For this reason, Adaptive Server Anywhere attempts to fill pages as much as possible.

Keeping track of all of the empty space on each of the pages in the database file would not be an efficient use of time or space. Instead, Adaptive Server Anywhere keeps track of a small number of pages that have a lot of space available. The server maintains a list for each table in the database that indicates which pages belonging to that table have a lot of free space available and are good candidates for insertion.

The header of each table page stores information about space available on that page. Whenever a page is read into memory, the header is scanned to see how much space is available on that page. If the page has a lot of space available, it is added to the list of good candidates for insertion for that table.

When it is time to insert a row, this list can be scanned to see if one of the pages for the table where the row is to be inserted has enough free space to hold the row. If it is possible to put the row on an existing page, the database server does so in order to keep data as compactly as possible. As a result, rows are not necessarily stored in the table in the order that they were inserted.

When a row is inserted into the database, it is placed on one of the pages assigned to its table and allotted the space required to hold it. Once a row has been inserted on a specific page, it does not move from that page. Moving rows would be time-consuming because of the need to update indexes that refer to that row.

Because rows are not moved from one page to another, data is not rearranged to fill the empty space created when a row is deleted. This space can be filled only when inserting a row in the same table or if an update occurs on that page. Inefficiencies arise only when small rows are deleted and then rows that are too large to fit into the empty spaces left behind are inserted. The new rows are therefore placed on a new page, leaving behind empty space on the old page. Having data spread out in this way is not usually enough of a problem to affect performance, but if necessary, unloading and reloading the database can reclaim this empty space.

Using the REORGANIZE TABLE statement

The REORGANIZE TABLE statement, available in Adaptive Server Anywhere 8.0.0 and later, can be used as an alternative to unloading and reloading a database. It causes rows for a table to be allocated contiguously and reduces the amount of empty space on pages.

Adding data to inserted rows can be inefficient

Adaptive Server Anywhere provides a row with only enough space to hold the values that it contains at the time that it is inserted. It also leaves a little bit of extra room on each page to allow rows to grow slightly. If an update to a row causes it to grow in size beyond the extra space provided, then the row is split so that the original location of the row holds a pointer to another page where the entire row is stored. Accessing rows that are split in this way is less efficient, because two pages must be read instead of one.

While it may not always be possible to avoid making rows larger during routine updates, you should avoid entering empty rows and filling in data later using UPDATE statements. This practice can lead to severe row splitting. You should fill new rows with data at the time of insertion whenever possible. If data is not available at the time that the row is inserted, consider inserting default values to ensure that Adaptive Server Anywhere reserves enough space for the filled row.

Two types of fragmentation
The fragmentation discussed here is table fragmentation, which refers to the location of the rows of a table within the database file. If rows are split onto more than one page, or if rows are not stored contiguously, there could be space and performance effects on the database. You can determine the amount of table fragmentation that exists using the sa_table_fragmentation stored procedure.

As of Adaptive Server Anywhere 8.0.1, you can reduce table fragmentation by specifying the percentage of free space that should be reserved in table pages. To specify the percentage of space to be allocated on databases created before this release, you must upgrade the database file format by unloading and reloading the database.

Another kind of fragmentation is file fragmentation. A file (such as the database file) is fragmented when information from that file is split into several non-contiguous locations on disk. File fragmentation can also lead to performance problems, especially with a large, rapidly changing database. Running a disk defragmentation utility can eliminate file fragmentation problems.

Other tips

Here are some additional operations to consider if you are very concerned with database size.

  • Discard indexes that are not being used. Although indexes can improve database performance, they can take up a lot of space in the database file. You may have indexes whose performance benefits are slight. You can remove these indexes in order to decrease the size of the database file. This can also improve the performance of some operations because the database server will no longer have to update the index. The pages freed by deleting an index can then be reused within the database; however, the total size of the database will not decrease unless you unload and reload the database.
  • Choose a different page size. There may be a lot of wasted space on each page because of the relationship between the size of your database pages and the size of your tables. (See the section "Choosing a page size" for more information.)
  • Use TRUNCATE TABLE instead of DELETE. If you are planning to delete all the rows of a table, use TRUNCATE TABLE, not DELETE. TRUNCATE TABLE statements require only one entry to the transaction log, whereas DELETE statements log an entry to the transaction log for each row that is deleted. Furthermore, TRUNCATE TABLE does not cause pages to be copied to the checkpoint log, so it can save a lot of space.Finally, TRUNCATE TABLE is faster than a series of DELETE statements. Keep in mind, however, that TRUNCATE TABLE statements cannot be rolled back.
  • Execute frequent COMMIT statements, if appropriate. The rollback log stores all of the statements that have been executed since the last COMMIT. It will grow until the next COMMIT statement is executed, possibly causing new pages to be allocated to the database. While the execution of COMMIT statements is normally dictated by the logic of the application, there are some cases where you could use extra COMMIT statements to slow the growth of the rollback log. For example, if you are executing a number of INSERT or DELETE commands in a row, you may want to execute a COMMIT command after every few hundred rows that are inserted or deleted.

Improving database performance

There are a number of reasons that operations on your database may be going more slowly than expected. Of these, the following relate specifically to data storage and layout.

  • Improper use of indexes Each basic database operation can be greatly affected, either positively or negatively, by the use of indexes. You may want to consider adding more indexes, deleting indexes that are not useful, or changing some index features for better performance.
  • Efficient use of space If the database is not using space efficiently (see "Reducing database size", above), then the server may have to read many more pages than necessary into the cache for each operation, and this could be hindering performance.
  • Incorrect choice of data types Choosing an inappropriate data type can contribute to poor performance. For example, using CHAR to store numeric data can slow operations involving that data.

Improving operation performance

A single operation on a database requires a number of I/O operations to be performed. The following lists of steps indicate the approximate number of page reads and writes required for each basic data operation. The exact number of reads and writes required depends upon several factors such as the indexes used, the frequency of checkpoints, the space available in the cache, and the size of the table pages.

Inserting a row

  1. Find a page to insert the new row.
  2. For each table, the server maintains a list of pages that have some free space. Scan the list for the table where the row is to be inserted. If a page has enough space, read it into memory. If one of these pages has enough space, obtain and initialize a free page from the free list.

  3. If necessary, copy the page to the checkpoint log.
  4. If the chosen page is not a new page and if it has not been accessed since the last checkpoint, the server makes a copy of the page for the checkpoint log.

  5. Insert the row.
  6. Check referential integrity and uniqueness.
  7. Look at the index for each unique column in the table to ensure that a value has not been duplicated. If the inserted row contains any foreign keys, check that the corresponding primary keys exist. For each foreign key, this requires reading one page for every level of the related primary key index.

  8. Update any indexes associated with the table.
  9. The number of reads required to update an index depends on the number of levels in the index. If the index has not been modified since the last checkpoint, a copy of the index page is made for the checkpoint log before it is modified.

  10. Write the operation to the rollback log and to the transaction log.

These data structures are likely to be already in the cache, so a read from the disk may not be required. Changes to the transaction log are not written to disk until a COMMIT is executed. Changes to the rollback log are not flushed to disk unless the cache page is required.

Deleting a row

1 Locate the row to be deleted.

If an index can be used to locate the row, this requires only as many reads as the number of levels in the index. Otherwise, this requires a full table scan.

2 If necessary, copy the page to the checkpoint log.

3 Delete each index entry that exists for any of the values in the row.

If there are foreign keys that reference the primary keys in the table, check that no orphans (foreign keys that have no corresponding primary keys) have been created.

4 Mark the row as deleted.

The row is not actually deleted until the transaction is committed.

5 Write the operation to the rollback log and to the transaction log.

Updating a row

1 Locate the row to be updated.

2 If necessary, copy the page to the checkpoint log.

3 Delete each index entry that corresponds to the values that are about to be modified.

Keep track of any orphans that are created in the process.

4 Update each of the affected values.

If there is not enough room on the page for the updated row, this may involve splitting the row to another page.

5 Add new index entries if necessary.

Ensure that no orphans remain after the update has occurred.

6 Write the operation to the rollback log and to the transaction log.

Indexes can drastically reduce the number of pages that have to be read for queries on the column that is indexed. However, because indexes need to be updated every time an indexed value is added, deleted, or modified, using too many indexes can adversely affect the performance of update operations. Therefore, you should index only columns that are searched frequently.

The number of indexes that should be used may also depend on the way that the database is used. For example, if you need to access information to answer queries quickly, and it is not as important to have a small database, you may want to use more indexes. If, on the other hand, you make many inserts and updates, but execute fewer queries, you may want to use fewer indexes in order to minimize the time required for data modification.

Improving index performance

If you notice that queries that use indexes are taking more time than expected, you may want to consider choosing another page size, reorganizing your
composite indexes, or using clustered indexes. These measures are aimed at increasing index selectivity and index fan-out, as explained below.

  • Index selectivity Index selectivity refers to the ability of an index to locate a desired index entry without having to read additional data.

    If selectivity is low, additional information must be retrieved from the table page that the index references. These retrievals are called full compares, and they have a negative effect on index performance.

    The FullCompare property keeps track of the number of full compares that have occurred. You can also monitor this statistic using the Sybase Central
    Performance Monitor or the Windows Performance Monitor.
     
  • Index structure and index fan-out Indexes are organized in a number of levels, like a tree. The first page of an index, called the root page, branches
    into one or more pages at the next level, and each of those pages branches again, until the lowest level of the index is reached. These lowest level index pages are called leaf pages. To locate a specific row, an index with n levels requires n reads for index pages and one read for the data page containing the actual row. In general, fewer than n reads from disk are needed since index pages that are used frequently tend to be stored in cache.

    The index fan-out is the number of index entries stored on a page. Each index lookup requires one page read for each of the levels of the index plus one page read for the table page, and a single query can require several thousand index lookups. An index with a higher fan-out may have fewer levels than an index with a lower fan-out. Therefore, higher index fan-out generally means better index performance. For this reason, consider using a large page size, such as 4 KB, to improve index performance. You may also want to consider using a larger page size when you wish to index long string columns using compressed B-tree indexes, but the size limit on smaller page sizes is preventing their creation.

    For more information, see "Compressed B-tree indexes" below.

    You can display the number of levels used for the index on a table using the sa_index_levels system procedure.

Example

The following table lists the index depth and the average index fan-out for some sample indexes. Here, the index fan-out is calculated by dividing the number of leaf pages in the index by the total number of index entries. The four databases compared differ only in page size; all of the data contained within the databases is the same. (For more information about these databases, please see "Choosing a Page Size", in this document.)

Key

Table and index

# of rows

1 kb

Depth; Fan-out

2 kb

Depth; Fan-out

4 kb

Depth; Fan-out

8 kb

Depth; Fan-out

1

Supplier, primary key

1000

2
62.50

2
125.00

2
250.00

2
333.33

2

Supplier, foreign key Nation

1000

2
90.91

2
142.86

2
333.33

1
1000.0

3

Order, primary key

150 000

4
63.99

3
142.86

3
300.00

2
614.75

4

Order, idx_order_orderdate

150 000

4
83.89

3
164.11

3
313.81

2
585.94

5

Order, idx_order_clerk

150 000

4
76.18

3
160.77

3
328.23

2
595.24

6

Order, foreign key Customer

150 000

4
82.78

3
159.07

3
306.12

2
570.34

From this example, it is clear that the page size has a large impact on the average number of entries that can fit onto a page. Not surprisingly, the index fan-out essentially doubles as the page size doubles. A large fan-out often means that fewer index levels will be required. This has a considerable impact in terms of performance, since each index lookup requires one page read for each of the levels of the index plus one page read for the table page, and a single query can require several thousand index lookups.

Consider, for example, the indexes on the Order table (rows 3-6). Using 1 kb pages, these indexes require four levels; using 8 kb pages, they require only two levels.

This effect is not as pronounced with small tables, because small tables do not have enough entries to reach three or four levels at any page size. In this case, the interior levels of the indexes for these tables on larger page sizes are not completely full.

For example, the index on the primary key of the Supplier table (row 1) has only 1000 rows. Even though the number of leaf pages varies across page sizes, this index has the same number of levels for all of the page sizes.

Using Composite Indexes

An index can contain one, two, or more columns. An index on two or more columns is called a composite index. For example, the following statement creates a two-column composite index:

CREATE INDEX name
ON employee (emp_lname, emp_fname)

A composite index is useful if the first column alone does not provide high selectivity. For example, a composite index on emp_lname and emp_fname is
useful when many employees have the same last name. A composite index on emp_id and emp_lname would not be useful because each employee has a
unique ID, so the column emp_lname does not provide any additional selectivity.

Additional columns in an index can allow you to narrow down your search, but having a two-column index is not the same as having two separate indexes. A
composite index is structured like a telephone book, which first sorts people by their last names, and then all the people with the same last name by their first
names. A telephone book is useful if you know the last name, and even more useful if you know both the first name and the last name, but worthless if you only know the first name and not the last name.

When you create composite indexes, you should think carefully about the order of the columns. Composite indexes are useful for doing searches on all the columns in the index or on the first columns only; they are not useful for doing searches on any of the latter columns alone.

Primary keys that have more than one column are always automatically indexed as composite indexes with their columns in the order that they appear in the table definition, not in the order that they are specified in the primary key definition.

The compressed B-tree index type, introduced in Adaptive Server Anywhere 8, substantially improves performance for composite indexes.

Using Clustered Indexes

Although standard indexes can dramatically improve the performance of statements that search for a specific row or a specific subset of the rows, two rows
appearing sequentially in the index do not necessarily appear on the same page in the database.

However, you can further improve the performance of indexes by creating clustered indexes. Clustered indexes in Adaptive Server Anywhere store the
table rows in approximately the same order as they appear in the corresponding index. Clustered indexes are available in Adaptive Server Anywhere 8.0.2 and
later.

Using the clustered index feature increases the chance that the two rows will appear on the same page in the database. This can lead to performance benefits
by further reducing the number of times each page needs to be read into memory.

For example, in a case where you select two rows that appear sequentially in a clustered index, it is possible that you are retrieving two rows that appear
sequentially on the same page, thus reducing the number of pages to read into memory by half.

The clustering of indexes in Adaptive Server Anywhere is approximate. While the server attempts to preserve the key order, total clustering is not guaranteed. As well, the clustering degrades over time as more and more rows are inserted into your database.

You can implement clustered indexes using the following statements:

  • CREATE TABLE
  • ALTER TABLE
  • CREATE INDEX
  • DECLARE LOCAL TEMPORARY TABLE

Several statements work in conjunction with each other to allow you to maintain and restore the clustering effect:

  • The UNLOAD TABLE statement allows you to unload a table in the order of the index key.
  • The LOAD TABLE statement inserts rows into the table in the order of the index key.
  • The INSERT statement attempts to put new rows on the same table page as the one containing adjacent rows as per the primary key order.
  • The REORGANIZE table statement can restore the clustering by rearranging the rows according to the clustering index. On tables where clustering is not specified, tables are ordered using the primary key.

The Optimizer assumes that the table rows are stored in key order and costs index scans accordingly.

Types of Indexes

Adaptive Server Anywhere supports two types of indexes, and automatically chooses between them depending on the declared width of the indexed columns.
For a total column width that is less than 10 bytes, Adaptive Server Anywhere uses a B-tree index that contains an order-preserving encoding, or hash value,
that represents the indexed data. Hash B-tree indexes are also used when the index key length is longer than one-eighth of the page size for the database or 256 bytes (whichever is lower). For data values whose combined declared length is between these two bounds, Adaptive Server Anywhere uses a compressed B-tree index that stores each key in a compressed form.

Hash B-tree indexes

A hash B-tree index does not store the actual row value(s) from the table. Instead, a hash B-tree index stores an order-preserving encoding of the original data. The number of bytes in each index entry used to store this hash value is termed the hash size, and is automatically chosen by the server based on the declared width of all the indexed columns. The server compares these hashed values as it searches through an index to find a particular row.

When you index a small storage type, such as an integer, the hash value that Adaptive Server Anywhere creates takes approximately the same amount of
space as the original value. For example, the hash value for an integer is 5 bytes in size. If the total hash size is small enough, Adaptive Server Anywhere can tell whether two values are equal, or which one is greater by comparing their hash values. However, it can retrieve the actual value only by reading the row from the corresponding table.

When you index a column containing larger data types, the hash value will often be shorter than the size of the type. For example, if you index a column of string values, the hash value used is at most 9 bytes in length. Consequently, Adaptive Server Anywhere cannot always compare two strings using only the hash values. If the hash values are equal, Adaptive Server Anywhere must retrieve and compare the two actual values from their corresponding rows in the table.

For example, suppose you index the titles of movies, many of which are similar. If you wish to search for a particular title, the index may identify only a set of possible rows. In this case, Adaptive Server Anywhere must retrieve each of the candidate rows and examine the full title.

Compressed B-tree indexes

Compressed B-tree indexes store a compressed form of each indexed value in the index’s internal nodes. To do this, compressed B-tree indexes store the values using Patricia tries, an optimized form of a trie data structure that is augmented with a skip-count to compress its representation. As a result, compressed B-tree indexes offer substantial improvements over hash indexes when the overall data length is reasonably large. More significantly, the compaction algorithm efficiently handles index values that are identical (or nearly so), so that common sub strings within the indexed values have negligible impact on storage requirements and performance. Compressed B-tree indexes are chosen automatically if the sum of the declared width of the indexed columns is between 10 bytes and one-eighth of the database page size, to a maximum of 256 bytes.

Hardware considerations

Increasing the cache size can dramatically improve database performance. Adaptive Server Anywhere 7.0 uses dynamic cache sizing, which grows the cache if it needs more memory and if it can do so without affecting other applications on the system. This prevents database performance from being hindered because of inadequate cache allocation. However, the cache size is limited by the physical memory. You may even want to consider buying more RAM to allow for a bigger cache.

Disk striping can significantly improve the performance of the database because of improved I/O performance. However, because of the nature of disk striping, the failure of one disk can lead to the corruption of all of the database files. Therefore, the reliability of your database is compromised unless you have RAID above level 0. For this reason, you should be sure to implement a good backup plan along with disk striping.

Conclusion

This paper has provided a number of facts about data storage and tips to modify your database in order to achieve performance and storage benefits. The following table summarizes a number of these facts and tips.

Fact

Tip

A database server requires access to several files (database file, transaction log, temporary file).

Separate files onto different physical drives to obtain performance and recovery benefits.

Cache size can have a significant impact on the performance of your database.

  • A larger cache significantly improves performance.
  • If you have a small cache, using smaller page sizes may provide performance benefits.

Page size affects the index fan-out.

Small pages can decrease fan-out and increase the number of levels in an index, which can severely affect performance.

Both large and small page sizes have advantages and disadvantages.

You should choose 2, 4, or 8 kb pages unless you have done thorough research and have a good reason not to. Most large databases perform well with 4 kb pages.

The database never shrinks.

Empty pages remain in the database file to be reused. Rebuilding the database allows you to reclaim these pages, but this process is time-consuming and requires a lot of space.

Fragmented rows can drastically affect the performance of the database.

Avoid adding empty rows and inserting data later.

Index use has an enormous effect on database performance.

Indexes can greatly improve the performance of SELECT statements, but they also take up space and slow down inserts, updates, and deletes.


 


Powered By Sybase