OLAP ile ilgili yazilar:
An analysis of what the increasingly misused OLAP term is supposed to mean
One problem that confronted us when first researching The OLAP Report was deciding which products fell into the category. Deciding what is an OLAP has not got any easier since then, as more and more vendors claim to have 'OLAP compliant' products, whatever that may mean. It is not possible to rely on the vendors' own descriptions and membership of the defunct OLAP Council is not a reliable indicator of whether or not a company produces OLAP products. For example, several significant OLAP vendors were never members or resigned, and several members are not OLAP vendors. Membership of the replacement Analytical Solutions Forum is no more reliable a guide.
The Codd rules also turned out to be an unsuitable way of detecting 'OLAP compliance', so we were forced to create our own definition. It had to be simple, memorable and product-independent, and the resulting definition is the 'FASMI' test. The key thing that all OLAP products have in common is multidimensionality, but that is not the only requirement for an OLAP product.
We wanted to define the characteristics of an OLAP application in a specific way, without dictating how it should be implemented. As our research has shown, there are many ways of implementing OLAP compliant applications, and no single piece of technology should be officially required, or even recommended. Of course, we have studied the technologies used in commercial OLAP products and this report provides many such details. We have suggested in which circumstances one approach or another might be preferred, and have also identified areas where we feel that all the products currently fall short of what we regard as a technology ideal.
Our definition is designed to be short and easy to remember — 12 rules or 18 features are far too many for most people to carry in their heads; we are pleased that we were able to summarize the OLAP definition in just five key words: Fast Analysis of Shared Multidimensional Information — or, FASMI for short. This definition was first used in early 1995, and has not needed revision since.
FAST means that the system is targeted to deliver most responses to users within about five seconds, with the simplest analyses taking no more than one second and very few taking more than 20 seconds. Recent research in The Netherlands has shown that end-users assume that a process has failed if results are not received with 30 seconds, and they are apt to hit 'Alt+Ctrl+Delete' unless the system warns them that the report will take longer. Even if they have been warned that it will take significantly longer, users are likely to get distracted and lose their chain of thought, so the quality of analysis suffers. This speed is not easy to achieve with large amounts of data, particularly if on-the-fly and ad hoc calculations are required. Vendors resort to a wide variety of techniques to achieve this goal, including specialized forms of data storage, extensive pre-calculations and specific hardware requirements, but we do not think any products are yet fully optimized, so we expect this to be an area of developing technology. In particular, the full pre-calculation approach fails with very large, sparse applications as the databases simply get too large (the database explosion problem), whereas doing everything on-the-fly is much too slow with large databases, even if exotic hardware is used. Even though it may seem miraculous at first if reports that previously took days now take only minutes, users soon get bored of waiting, and the project will be much less successful than if it had delivered a near instantaneous response, even at the cost of less detailed analysis.
ANALYSIS means that the system can cope with any business logic and statistical analysis that is relevant for the application and the user, and keep it easy enough for the target user. Although some pre-programming may be needed, we do not think it acceptable if all application definitions have to be done using a professional 4GL. It is certainly necessary to allow the user to define new ad hoc calculations as part of the analysis and to report on the data in any desired way, without having to program, so we exclude products (like Oracle Discoverer 3.1) that do not allow adequate end-user oriented calculation flexibility. We do not mind whether this analysis is done in the vendor's own tools or in a linked external product such as a spreadsheet, simply that all the required analysis functionality be provided in an intuitive manner for the target users. This could include specific features like time series analysis, cost allocations, currency translation, goal seeking, ad hoc multidimensional structural changes, non-procedural modeling, exception alerting, data mining and other application dependent features. These capabilities differ widely between products, depending on their target markets.
SHARED means that the system implements all the security requirements for confidentiality (possibly down to cell level) and, if multiple write access is needed, concurrent update locking at an appropriate level. Not all applications need users to write data back, but for the growing number that do, the system should be able to handle multiple updates in a timely, secure manner. This is a major area of weakness in many OLAP products, which tend to assume that all OLAP applications will be read-only, with simplistic security controls. Even products with multi-user read-write often have crude security models; an example is Microsoft OLAP Services.
MULTIDIMENSIONAL is our key requirement. If we had to pick a one-word definition of OLAP, this is it. The system must provide a multidimensional conceptual view of the data, including full support for hierarchies and multiple hierarchies, as this is certainly the most logical way to analyze businesses and organizations. We are not setting up a specific minimum number of dimensions that must be handled as it is too application dependent and most products seem to have enough for their target markets. Again, we do not specify what underlying database technology should be used providing that the user gets a truly multidimensional conceptual view.
INFORMATION is all of the data and derived information needed, wherever it is and however much is relevant for the application. We are measuring the capacity of various products in terms of how much input data they can handle, not how many Gigabytes they take to store it. The capacities of the products differ greatly — the largest OLAP products can hold at least a thousand times as much data as the smallest. There are many considerations here, including data duplication, RAM required, disk space utilization, performance, integration with data warehouses and the like.
We think that the FASMI test is a reasonable and understandable definition of the goals OLAP is meant to achieve. We encourage users and vendors to adopt this definition, which we hope will avoid the controversies of previous attempts.
The techniques used to achieve it include many flavors of client/server architecture, time series analysis, object-orientation, optimized proprietary data storage, multi-threading and various patented ideas that vendors are so proud of. We have views on these as well, but we would not want any such technologies to become part of the definition of OLAP. Vendors who are covered in this report had every chance to tell us about their technologies, but it is their ability to achieve OLAP goals for their chosen application areas that impressed us most.
In 1993, E.F. Codd & Associates published a white paper, commissioned by Arbor Software (now Hyperion Solutions), entitled 'Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate'. Dr Codd is, of course, very well known as a respected database researcher from the 1960s through to the late 1980s and is credited with being the inventor of the relational database model, but his OLAP rules proved to be controversial due to being vendor-sponsored, rather than mathematically based. It is also unclear how much involvement Dr Codd himself had with this activity, but it seems likely that his role was very limited. The white paper needs to be regarded as a vendor-published document (which it is) rather than as an academic paper (which it is not). Note that this paper was not published by Codd and Date.
The OLAP white paper included 12 rules, which are now well known (and available for download from vendors' Web sites). They were followed by another six (much less well known) rules in 1995 and Dr Codd also restructured the rules into four groups, calling them 'features'. The features are briefly described and evaluated here, but they are now rarely quoted and little used.
F1: Multidimensional Conceptual View (Original Rule 1). Few would argue with this feature; like Dr Codd, we believe this to be the central core of OLAP. Dr Codd included 'slice and dice' as part of this requirement.
F2: Intuitive Data Manipulation (Original Rule 10). Dr Codd prefers data manipulation to be done through direct actions on cells in the view, without recourse to menus or multiple actions. One assumes that this is to be by using a mouse (or equivalent), but Dr Codd does not actually say so. Many products fail on this, because they do not necessarily support double clicking or drag and drop. The vendors, of course, all claim otherwise. In our view, this feature adds little value to the evaluation process. We think that products should offer a choice of modes (at all times), because not all users like the same approach.
F3: Accessibility: OLAP as a Mediator (Original Rule 3). In this rule, Dr Codd essentially describes OLAP engines as middleware, sitting between heterogeneous data sources and an OLAP front-end. Most products can achieve this, but often with more data staging and batching than vendors like to admit.
F4: Batch Extraction vs Interpretive (New). This rule effectively requires that products offer both their own staging database for OLAP data as well as offering live access to external data. We agree with Dr Codd on this feature and are disappointed that only a minority of OLAP products properly comply with it, and even those products do not often make it easy or automatic. In effect, Dr Codd is endorsing multidimensional data staging plus partial pre-calculation of large multidimensional databases, with transparent reach-through to underlying detail. Today, this would be regarded as the definition of a hybrid OLAP, which is indeed becoming the most popular architecture, so Dr Codd has proved to be very perceptive in this area.
F5: OLAP Analysis Models (New). Dr Codd requires that OLAP products should support all four analysis models that he describes in his white paper (Categorical, Exegetical, Contemplative and Formulaic). We hesitate to simplify Dr Codd's erudite phraseology, but we would describe these as parameterized static reporting, slicing and dicing with drill down, 'what if' analysis and goal seeking models, respectively. All OLAP tools in this report support the first two (but some other claimants do not fully support the second), most support the third to some degree (but probably less than Dr Codd would like) and few support the fourth to any usable extent. Perhaps Dr Codd was anticipating data mining in this rule?
F6: Client Server Architecture (Original Rule 5). Dr Codd requires not only that the product should be client/server but that the server component of an OLAP product should be sufficiently intelligent that various clients can be attached with minimum effort and programming for integration. This is a much tougher test than simple client/server, and relatively few products qualify. We would argue that this test is probably tougher than it needs to be, and we prefer not to dictate architectures. However, if you do agree with the feature, then you should be aware that most vendors who claim compliance, do so wrongly. In effect, this is also an indirect requirement for openness on the desktop. Perhaps Dr Codd, without ever using the term, was thinking of what the Web would deliver? Or perhaps he was anticipating a widely accepted API standard, which OLE DB for OLAP is expected to become.
F7: Transparency (Original Rule 2). This test is also a tough but valid one. Full compliance means that a user of, say, a spreadsheet should be able to get full value from an OLAP engine and not even be aware of where the data ultimately comes from. To do this, products must allow live access to heterogeneous data sources and a full function spreadsheet add-in, with the OLAP server engine in between. Although all vendors claimed compliance, many did so by outrageously rewriing Dr Codd's words. Even Dr Codd's own analyses of Essbase and TM1 ignore part of the test. In fact, there are a few products that do fully comply with the test, including Acumate and Express, but neither Essbase nor TM1 do so (because they do not support live access, transparent access to external data), in spite of Dr Codd's apparent endorsement. Most fail to give either full spreadsheet access or live access to heterogeneous data sources. Like the previous feature, this is a tough test for openness.
F8: Multi-User Support (Original Rule 8). Dr Codd recognizes that OLAP applications are not all read-only and says that, to be regarded as strategic, OLAP tools must provide concurrent access (retrieval and update), integrity and security. We agree with Dr Codd, but are also aware that many OLAP applications are still read-only. Again, all the vendors claim compliance but, on a strict interpretation of Dr Codd's words, few are justified in so doing.
F9: Treatment of Non-Normalized Data (New). This refers to the integration between an OLAP engine and de-normalized source data. Dr Codd points out that any data updates performed in the OLAP environment should not be allowed to alter stored de-normalized data in feeder systems. He could also be interpreted as saying that data changes should not be allowed in what are normally regarded as calculated cells within the OLAP database. For example, Essbase allows this, and Dr Codd would perhaps disapprove.
F10: Storing OLAP Results: Keeping Them Separate from Source Data (New). This is really an implementation rather than a product issue, but few would disagree with it. In effect, Dr Codd is endorsing the widely held view that read-write OLAP applications should not be implemented directly on live transaction data, and OLAP data changes should be kept distinct from transaction data. The method of data write-back used in Microsoft OLAP Services is the best implementation of this, as it allows the effect of data changes even within the OLAP environment to be kept segregated from the base data.
F11: Extraction of Missing Values (New). All missing values are cast in the uniform representation defined by the Relational Model Version 2. We interpret this to mean that missing values are to be distinguished from zero values. In fact, in the interests of storing sparse data more compactly, a few OLAP tools do break this rule, without great loss of function.
F12: Treatment of Missing Values (New). All missing values to be ignored by the OLAP analyzer regardless of their source. This relates to Feature 11, and is probably an almost inevitable consequence of how multidimensional engines treat all data.
F13: Flexible Reporting (Original Rule 11). Dr Codd requires that the dimensions can be laid out in any way that the user requires in reports. We would agree, and most products are capable of this in their formal report writers. Dr Codd does not explicitly state whether he expects the same flexibility in the interactive viewers. We prefer that it is available, but note that relatively fewer viewers are capable of it. This is one of the reasons that we prefer that analysis and reporting facilities be combined in one module.
F14: Uniform Reporting Performance (Original Rule 4). Dr Codd requires that reporting performance be not significantly degraded by increasing the number of dimensions or database size. Curiously, nowhere does he mention that the performance must be fast, merely that it be consistent. In fact, our experience suggests that merely increasing the number of dimensions or database size does not affect performance significantly in fully pre-calculated databases, so Dr Codd could be interpreted as endorsing this approach, which may not be a surprise given that Arbor Software sponsored the paper. However, reports with more content do take longer (in the good products, performance is almost linearly dependent on the number of cells used to produce the report, which may be more than appear in the finished report) and some dimensional layouts will be slower than others, because more disk blocks will have to be read. There are differences between products, but the principal factor that affects performance is the degree to which the calculations are performed in advance and where live calculations are done (client, multidimensional server engine or RDBMS). This is far more important than database size, number of dimensions or report complexity.
F15: Automatic Adjustment of Physical Level (Supersedes Original Rule 7). Dr Codd requires that the OLAP system adjusts its physical schema automatically to adapt to the type of model, data volumes and sparsity. We agree with him, but are disappointed that most vendors fall far short of this noble ideal. We would like to see more progress in this area and also in the related area of determining the degree to which models should be pre-calculated (a major issue that Dr Codd ignores). The Panorama technology, acquired by Microsoft in October 1996, broke new ground here, and users can now benefit from it in Microsoft OLAP Services.
F16: Generic Dimensionality (Original Rule 6). Dr Codd takes the purist view that each dimension must be equivalent in both its structure and operational capabilities. However, he does allow additional operational capabilities to be granted to selected dimensions (presumably including time), but he insists that such additional functions should be grantable to any dimension. He does not want the basic data structures, formulae or reporting formats to be biased towards any one dimension. This has proven to be one of the most controversial of all the original 12 rules. Technology focused products tend to largely comply with it, so the vendors of such products support it. Application focused products usually make no effort to comply, and their vendors bitterly attack the rule. With a strictly purist interpretation, few products fully comply. We would suggest that if you are purchasing a tool for general purpose, multiple application use, then you want to consider this rule, but even then with a lower priority. If you are buying a product for a specific application, you may safely ignore the rule.
F17: Unlimited Dimensions & Aggregation Levels (Original Rule 12). Technically, no product can possibly comply with this feature, because there is no such thing as an unlimited entity on a limited computer. In any case, few applications need more than about eight or ten dimensions, and few hierarchies have more than about six consolidation levels. Dr Codd suggests that if a maximum must be accepted, it should be at least 15 and preferably 20; we believe that this is too arbitrary and takes no account of usage. You should ensure that any product you buy has limits that are greater than you need, but there are many other limiting factors in OLAP products that are liable to trouble you more than this one. In practice, therefore, you can probably ignore this requirement.
F18: Unrestricted Cross-dimensional Operations (Original Rule 9). Dr Codd asserts, and we agree, that all forms of calculation must be allowed across all dimensions, not just the 'measures' dimension. In fact, many products which use only relational storage are weak in this area. Most products with a multidimensional database are strong. These types of calculations are important if you are doing complex calculations, not just cross tabulations, and are particularly relevant in applications that analyse profitability.