This index page provides a summary of, and pointers to, a number of articles related to data issues, Data Warehouse implementations, standards, Technical Architectures, applications development using high level languages, and others. It is organized by most popular to least popular or newest.
The list was last updated in January 2003 with the addition of an article on the superiority of column oriented data structures for WORM type Data Warehouses. A number of the less popular or obsolete articles have been removed while the article on "The why of data standards" is retained and continues to be used and referenced around the world by academics using it as a source for students, and by practicioners interested in practical rather than theoretical data issues. In the case of the university use, it was no doubt a case of "OK, you now have the theory. To understand what actually happens in the real world, read this article." It has been copied (with permission) to US Government servers, for easier access by their employees. The counter on the article page thus likely underestimates its popularity by several thousand hits.
The why of data standards - Do you really know your data? Summary:Much of the data "mess" is hidden from the business users by the filtering which occurs when a report is created. Even many senior IS managers have little idea of the real content of their files. The article provides results from the analysis of millions of values of actual data, illustrates what standards can be applied to correct the problems, and shows why any Data Warehouse implementation must convert the source data to standards. Data Warehouse Implementation Plan Summary: An objective for a Data Warehouse is to make all critical business data available in standard form for rapid inquiry, analysis and reporting. To achieve this, it is necessary to move data out of its non-standard form in existing applications by steps such as those outlined below. These steps were followed in a successful implementation which resulted in useful data being available within 3 months, and a full capability being available in 6. For most organizations, creating a Data Warehouse should not be a major undertaking. Data Warehouse Implementation Summary: Much effort in IS is currently going into creating "Data Warehouses". These are stores of data periodically extracted from older legacy applications, converted to common standards and made accessible for user analysis. The warehouse acts as a WORM (Write once Read Many times) storage. Where the extract and transfer is performed nightly, they provide access to what is termed 'Near Operational' data and can be used to replace much of the existing reporting. In other cases they are used to store mostly historical data for analysis of trends, market impact, financial status and do on. While often implemented with a variety of different clean up tools, languages, data base products and query tools, this article describes an implementation done almost entirely with APL. It includes a query capability termed "Query by Mail" which enables anyone with access to E-Mail to send queries to the Warehouse and receive responses or extracts of data by return mail. The "query" includes customized analysis of field content to allow identification of fields and records containing invalid data. Built upon a proprietary inverted file system, it provides rapid response to user queries and little load on the server system. Data Warehouse oriented Data Analysis Tools Summary: Data Warehouses implemented from legacy applications data often encounter difficulty mapping data to standards because of the poor quality of the sources and the disparities between documentation and actuality. The tools described here address the issue of analyzing the actual content of the data field in a flat file so as to provide the information for correct mapping of the source data to standards. The tools are generic enough that they can also support the initial warehouse data access capabilities and look up of data on the part of users. The tools are GUI based to allow interactive use during the analysis phase, and results can be written to text files for transfer to other tools or simply for documentation. A management perspective of the "J" programming language - Updated 2004 Summary: Since the original article was written, there have been many changes and improvements to the J language. While the mathematically oriented users no doubt drove these, many of the changes have not only simplified the code, but also provided considerable speed up to the operations typical of commercial processing. As a result I have updated the original article with discussion of these changes. Array processing of commercial data - to drastically cut elapsed times Summary: Commercial data processing has traditionally used scalar processing of fields in records. To take advantage of today's high speed processors, large main memory, and peak disk transfer rates, requires a paradigm shift to processing data as arrays. Illustrated is how typical commercial processes can be re-designed for array processing to provide up to 2 orders of magnitude reduction in elapsed times. While optimally requiring new data structures, as well as new processing techniques, the methods are generic enough to apply to many applications from payroll to billings. Compressed Data Structures for Data Warehouses Summary: The increasing importance of the Data Warehouse concept implies a need for an alternative data structure optimized for read-only access. This objective also allows more emphasis on data compression techniques to both reduce the total volume, and, even more important, to reduce the disk transfer for faster processing on access. Queries on Packed data Summary: Relatively simple code can be used to compress or pack data for a WORM Data Warehouse and even simpler code to unpack it. This can speed up data transfer rates from the disk by an average of 3 to 5 times (depending on data characteristics) with little increase in CPU times. In addition, many queries can be performed directly against the compressed data, reducing the CPU cycles relative to the uncompressed data, and also reducing the amount of data movement in memory. This latter is a major component of cycle times when dealing with very large arrays of data and can exceed the CPU time for the comparison operations. New Performance comparison of DW Data Structures - Rows versus Columns Summary: This article discusses some of the improvements in computer technology over the last 20 years and how they should have changed the approach to the design of Data Warehouse Data Structures. In particular, it addresses the issue of which data structures make best use of current and future disk drive technology. Author's Bio - A very brief look at the author's background, work experience and academic qualifications.Last updated 2003 01 23, Counters reset 1997 12 10.