Performance comparison of DW Data Structures
Rows versus Columns
Summary: This article discusses some of the improvements in computer technology over the last 20 years and how they should have changed the approach to the design of Data Warehouse Data Structures. In particular, it addresses the issue of which data structures make best use of disk drive technology.
Technology
Computer technology continues to evolve at such a rate that any discussion of current capabilities becomes obsolete even as it is written. Nevertheless, there are certain trends that have continued over time that determine the approaches that should be used to take maximum advantage of the improvements in technology.
By 2002 the business desktop processor is a 2+ GHz P4 or equivalent processor, with 256MB or more of DDRAM memory and a 6OGB or larger disk, connected to a 100Mb/s LAN. Such a configuration is almost 1000 times larger and faster than the time-sharing computers used to implement the first Data Warehouses, and infinitely larger in processing power than the character cell terminals used to connect to them. Today, even the smaller DW servers can be economical clusters of multiple such machines, with larger and faster disk drives arranged in RAID configurations. Despite this, the design of DW data structures remains much the same as in the early days. This lack of change should raise questions about whether the design approach still applies, or whether alternative approaches are more valid. Given the magnitude of the change in technology over 20 years, the answer should be obvious.
The Disk Drives
Disk drives are the main repository of DW data, and the main determinants of their performance. Today, disk drives are at 160+GB in a standard 3.5, in. drive, heading for 400GB in a single drive with the greater adoption of "pixie dust" technology. The increased storage density leads to an automatic increase in peak transfer rates, but the head motion speeds do not increase proportionately. Thus random access rates to sets of data scattered over the disk remain much the same over the years, and become an ever smaller fraction of the peak transfer rates and total disk capacity.
To illustrate the difference, accessing 500 byte records randomly located over a disk results in a data read rate of some tens of KB/s. This compares with peak transfer rates now in the tens of MB/s or 1000 times greater. Peak transfer rates go even higher with parallel disks as in some RAID configurations. While there are software techniques to minimize head travel and rearrange the records once read, they cannot overcome a 1000:1 disadvantage.
A typical 7200, rpm drive today has an average track seek time of 9 ms with a latency of some 4 ms. Peak transfer rate from the drive (as opposed to the buffer), is around 40MB/s. Even a less typical 15,000, rpm disk with a special high-speed actuator (3.9 ms. seek versus a more typical 9 ms. on 7200 rpm disks), the I/O rate is still only quoted as 140 I/O per sec. On the same disk, the sustainable peak transfer rate is 48MB/s.
Twenty years ago, a "main frame" disk drive was more likely to be 120MB, 33.3 ms average access time and latency of 8.33 ms on a 3600 rpm, 14, inch platter. Peak transfer rate was 1.3 MB/s. It was on these, or slightly faster drives, that the first DW were implemented. Ten years later, seek times were still in the 20-40 ms range, and peak transfer rates had increased only to the 2-4MB/s range.
Over twenty years, peak transfer rates have increased more than 10 times as much as have random access rates. Any technique oriented to taking advantage of peak transfer rates, rather than random seek times, thus has a major and continuously increasing advantage.
Despite the random access nature of a disk drive, random access is its worst performing feature. Yet, because it provides random access, software and application designers persist in using this feature to the detriment of overall application performance. It is easy, does not require much thought, and random access "seems" the obvious way to utilize a disk drive. Unfortunately, the 1000:1 performance difference between random and sequential access says it is often faster, much faster, to read in a whole lot of data and select from it the desired subset, discarding the rest, than it is to selective pick off the disk only the desired pieces.
The other technology constraint that made a "select from disk" approach essential, that of small main memory, no longer applies. Twenty years ago 32MB of main memory was a big, time-sharing system. Today, it is often not enough to store the OS, and memory capacities on even desktop machines are often at least 10 times greater. The physical constraints that required processing only one record at a time, no longer apply, yet such a design remains the preferred approach.
A further consideration in using a disk drive, is that a random read of a file via an index can read the same file section into the buffer many times, as first one, then later (after the buffer has been reused) another record is read. A poorly used index can result in reading all the data from the file many times, in getting at even a fraction of the records. (The author has observed systems taking hours to read all the records from a file, when reading through the whole file sequentially would take less than a minute.)
Column Orientation
What alternatives data structures exist that can make better use of the disk drive characteristics?
One option for a DW to make better use of a disk drive is to store data in columns, rather than the usual row or record oriented form. All values from a given field are stored in contiguous locations, rather than all values from a given record or row, as in most conventional relational data management systems.
Whether this approach will provide major benefits depends on the nature of the queries, for it is an unfortunate axiom that in evaluating alternative DBMS:
- "One can always formulate a data retrieval test that makes a given data structure perform poorly relative to a competitor."
Its corollary is also true, namely that:
- "One can always formulate a data retrieval test that makes a given data structure perform well relative to a competitor."
Those axioms should be kept in mind during the ensuing discussion. If the examples appear biased to a given data structure, one can always redo the numbers with alternative parameters.
In comparing performance between column-oriented and row-oriented data structures, there are two main determinants of performance for each. For row-oriented data, (all values of a record in adjacent physical locations), the determinant is the average seek time to a randomly located record. This is because both the query and the result extract is likely to refer to records scattered over the file. For column oriented data, (all values of a given field in adjacent physical locations), the major determinant is peak sustainable transfer rate. While a single track may lack the capacity to hold all the data of a really large column of data, switching heads, or switching to an adjacent track, is much faster than a random move. There are secondary factors that adjust the times in each case, but these two values have the major performance impact for each type of data structure.
[I am assuming in this discussion that the file is stored in physically contiguous locations on the disk. If the disk is highly fragmented, with the OS assigning space in small and scattered blocks, even reading a large chunk of data will physically result in many scattered reads, thus defeating the speed objective.]
Performance determinants
Disk actions on a query
For record oriented structures, random selection of records from the file, results in an average access rate determined by the combination of track seek time and latency reduced somewhat by the times the next record is already in the buffer.
On the other hand, access to large sets of data as for column oriented data, is to tracks in a cylinder (depending on the controller logic), or to adjacent tracks with access of 1, ms or less. In addition, since all of a track is usually wanted, one can perhaps assume the disk is smart enough to start reading blocks wherever it lands on the track and rearrange them into the correct sequence after the fact. Thus latency is not, or should not be, much of a consideration in large reads.
The above description indicates that reading chunks of 0.5 to several MB of data at a time (all values of a field) is 1-2 head moves, followed by almost continuous reads. The read of the record structure, on the other hand, is multiple moves, waits, and reads of small records, first in the index, then in the data, each time discarding the rest of the data in the buffer. This is why performance with the column-oriented structure is largely determined by peak transfer rates, while with record-oriented structures it is largely determined by random access times.
Disk actions on result extract
For both structures, the result of a query is pointers to record locations, which meet the search criteria. The next stage is to access the values in the records to perform some further analysis, or to present the results in a report.
In the case of the record oriented structure this means an essentially random access and read of all the flagged records.
For the column orientation it means random accesses to as many values as are required followed by an in memory extract of the flagged values from the columns.
In one case, the more records flagged for the result, the more accesses, while in the other, the more values required (usually many fewer than the number of records) the more accesses, almost independent of the number of records involved.
Even without doing the arithmetic, the type of query for which each structure is superior should be intuitively apparent.
The data content
For the comparisons, (and to simplify the arithmetic) let us assume a source file or table with 250 Byte records of some 30 fields. Three table sizes are assumed at 100K, 1M and 10M rows or records each.
In a heavily indexed relational structure 1M records means 250MB of data and, depending on DBMS, up to 4 times that (1GB) in index structure and buffers. Access involves several index reads followed by the data record reads needed for analysis or reporting.
In the column-oriented structure, the resulting size is between 60 and 250MB depending on compression techniques used and type of data. (See related articles on data compression.) Typically, a lightly compressed 1M record file might be 100MB of data at a compression of 2.5 times. At worst, the uncompressed file is 250MB.
[If queries usually result in an extract of very few records, it may be desirable, in the column-oriented version; to also retain an un-indexed record oriented structure. This second structure is only accessed when very few records are needed after a search, but in most cases the extra structure is not warranted.]
The results
The data table below attempts to represent the times for typical queries on the two structures for tables of 100K, 1M and 10M rows. For the column-oriented structures, numbers are given for extracting 5 or 15 sets of values. The first is typical of an analysis of a few values, while the 15 fields are more typical of a report. As discussed, the number or rows has little effect on times.
For the Row-oriented data, values are given for taking from 1 to 100,000 rows out of the total. The colored cells show where the transition occurs between the preferred processes. For the row orientation, problem types falling to the left of the color indicate rows are preferable to columns. To the right columns are faster. With time, the row orientation remains superior for an ever-smaller problem set as shown by the leftward trend of the colored cells with time. Note also that both row and column numbers change by factors of ten.
Disk access times for each case
| Age | File Size | Column Oriented Fields Extracted | Row Oriented Records Extracted |
| 5 | 15 | 1 | 10 | 100 | 1000 | 10000 | 100000 |
| -20 years |
10M | 1000.00 | 3000.00 | 0.50 | 0.95 | 5.45 | 50.45 | 500.45 | 5000.45 |
| 1M | 100.00 | 300.00 | 0.50 | 0.95 | 5.45 | 50.45 | 500.45 | 5000.45 |
| 100K | 10.00 | 30.00 | 0.50 | 0.95 | 5.45 | 50.45 | 500.45 | 5000.45 |
| -10 years |
10M | 200 | 600.00 | 0.40 | 0.76 | 4.36 | 40.36 | 400.36 | 4000.36 |
| 1M | 20.00 | 60.00 | 0.40 | 0.76 | 4.36 | 40.36 | 400.36 | 4000.36 |
| 100K | 2.00 | 6.00 | 0.40 | 0.76 | 4.36 | 40.36 | 400.36 | 4000.36 |
| Today |
10M | 6.00 | 18.00 | 0.15 | 0.29 | 1.64 | 15.14 | 150.13 | 1500.14 |
| 1M | 0.60 | 1.80 | 0.15 | 0.29 | 1.64 | 15.14 | 150.13 | 1500.14 |
| 100K | 0.06 | 0.18 | 0.15 | 0.29 | 1.64 | 15.14 | 150.13 | 1500.14 |
The three sets of data represent numbers for some 20 years ago, over 10 years ago, and current.
The three lines in each set of data are for tables of 10M rows down to 100K rows. In the very old times, files tended to be smaller to match the disk capacity, while today, disk size is not really a constraint and larger files are more common even though they could be partitioned.
The data table is unfortunately rather difficult to decipher and I have been unable to come up with a simple graph to better show the results. If you will bear with me I will try to explain the numbers.
As discussed, the main determinant of elapsed time is average seek time in the case of a record oriented structure, and maximum transfer rate in the case of the column oriented structure. While other parameters affect the numbers in each case, these have the largest impact and essentially determine the results.
What the numbers show is that with the old technology, row oriented structures tended to be faster when up to a few thousand records were involved in any analysis or report. The advantage was less with the more typical (at the time) smaller files - the cutoff was then at a few hundred records of extracted result data. For very large numbers of extracted rows, the inverted structures were a clear winner even then, as reading even all the column values and throwing 99% of them away, (particularly for smaller files), was faster than randomly accessing a large number of rows.
Twenty years ago the switch occurred at anything over 1% of the records in the file. If more than 1% of the records needed to be analyzed or used in a report, the column structure was faster. Ten years ago, it was closer to 0.1% and now is around 0.01%. In other words, if the analysis or report creation after the query needs more than a few hundred records out of a million records, the inverted column structure is faster. In the very near future, the numbers are even more in favor of the column structure.
With the maximum sustainable transfer rate having increased some 50 times over 20 years, while seek times have improved only a little over 3 times, such a result is to be anticipated.
Furthermore, these theoretical results were confirmed in multiple tests between competing DBMS, each of which used different internal structures. The tests also confirmed the axioms, for by understanding the internals, one could always formulate a query to show one or another system to have an advantage or disadvantage.
However, one should note that the column, oriented table is not a panacea. While the numbers indicate it is perfect for a WORM (Write Once Read Many) Data Warehouse structure, it is mostly pretty useless for updateable transaction data capture.
Joins
For a DW, the foreign key to primary key join can and should be static. Since any tables in the DW are a WORM type, the joins can be created once the tables are loaded. Since there is no record level updating, there is no need to create a new join at query time to ensure all the latest updates are included. Avoiding a dynamic join creation removes a major startup process load typical of most DBMS handling queries on live data.
VLM impact
The use of VLM or Very Large Memory to overcome the actuator speed problem is an expensive option, (although decreasingly so) which can equally be applied to caching the "hot" segments of a column-oriented file. Alternatively, other articles have shown that a compressed version of the column structure occupies about 1/3 the space of the original, while a fully indexed DBMS occupies about 3-5 times the space of the original raw data. Consequently, VLM applied to the column data would require only about 1/10 as much memory, or, alternatively, be able to store 10 times the amount of data.
Disk Design Constraints
It is interesting that over 20 years ago, disk drives would read data on all heads simultaneously. Thus the way to increase data capacity and rates was to add more platters and heads. Size alone dictated high mass and inertias, limiting the speed of head repositioning. Miniaturization and increased storage density has allowed large capacities with one to four small platters and 1-8 heads. The size reduction alone decreases head movement and mass of the actuators, allowing faster repositioning. There were no technical breakthroughs needed in that area for performance improvement, they were all byproducts of the increased bit density and miniaturization.
Improvements in the random access speed are restricted by the inertia of the actuator, while the higher transfer rate is promoted by what is now a doubling of storage density every 12-18 months. For mass storage, physical devices, such as disk drives, there is nothing on the technical horizon that will alter this situation.
Today, data is read from only one head at a time, even when a disk has 4 platters. The reason for this "less than optimal" transfer is a combination of economics and technology.
On the economics side, the target is a low cost PC where adding the control electronics for simultaneous reads adds to the cost. On the technology side, the resulting data transfer rate from 8 heads in parallel far exceeds the capacity of the associated controllers. Furthermore, a transfer rate approaching 1GB/s (from a 15,000 rpm disk with 4 platters) is of little value when accessing randomly scattered records of 500 bytes each, or even files of a few KB. The only technology that could benefit from such extreme transfer rates is a large Data Warehouse using partially inverted files (a column oriented structure). Since that is not the market norm, neither is there any availability of such disk drives.
Conclusion
Over time, the advantage of the column-oriented structure over the row-oriented one for WORM type Data warehouses, has increased markedly. There appears to be nothing looming on the technical horizon that is likely to change this situation - all the trends increase the advantage. Techniques such as RAID configurations or VLM also provide as much or more advantage to column structures as they do for the more traditional ones.
Copyright © 2003 K. I. Gordon. All rights reserved.