Data Warehousing Concepts for AS/400Introduction AS/400 information warehouse architecture and components Data warehousing is an increasingly popular and powerful concept of applying information technology to solve business problems. Meta Group, Inc., a leading consultant in the data warehousing environment, suggests that over 900f the Fortune 2000 businesses will put into place a data warehouse by the end of 1996. But what is data warehousing, and why is it so important to so many businesses? How can IBM help move your company into the data warehousing environment? Why does AS/400 Advanced Series make an excellent data warehousing platform? This paper focuses on answers to these questions and describes the products and components that make up AS/400 data warehousing solutions. Data warehousing is a concept. It is not a product that you can buy off the shelf. It is a set of hardware and software components that can be used to better analyze the massive amounts of data that companies are accumulating to make better business decisions. The data you use to operate your business represents a wealth of knowledge that you may not be fully tapping into. It is an asset that is probably not being used to its fullest potential. Data warehousing can help you take advantage of the knowledge base you've created over time--your data! The data warehouse can be used to:
In short, data warehousing can be a solution for maintaining your company's competitive edge. A data warehouse customer example One AS/400 customer, a large supplier of consumer goods, found that their sales force lacked the kind of information needed to expand their business. Monthly sales reports simply were not timely or detailed enough to be very useful. To help make strategic business decisions, the company needed to analyze historical information to be able to spot trends. Furthermore, the company needed this analysis to be fast and on-demand. Their existing applications could not meet either of these requirements. With their new AS/400 data warehouse, the company's salespeople can formulate sales strategies based on the information the warehouse provides. This includes information about the success of previous promotions, regional trends, product profitability, and the effect of product packaging. Salespeople have access to immediate information in order to make on-the-spot decisions that are customized to particular clients. Other company departments, such as finance and operations, are also using the warehouse to identify and analyze product successes across regions, salespeople, and time. Personnel and machinery resources can be shifted based on demand or projections, to ensure that those resources are optimized for production purposes. Data warehousing can be a key differentiator in many different industries. Data warehouse applications include:
Data warehousing concepts and terms Operational versus informational data Operational data is the data you use to run your business. This data is what is typically stored, retrieved, and updated by your Online Transactional Processing (OLTP) system. An OLTP system may be, for example, a reservations system, an accounting application, or an order entry application. Operational data is typically stored in a relational database, but may be stored in legacy hierarchical or flat file formats as well. Some of the characteristics of operational data include:
Informational data is typically stored in a format that makes analysis much easier. Analysis can be in the form of decision support (queries), report generation, executive information systems, and more in-depth statistical analysis. Informational data is created from the wealth of operational data that exists in your business. Informational data is what makes up a data warehouse. Informational data is typically:
Transforming operational data into informational data Creating the informational data, that is, the data warehouse, from the operational systems is a key part of the overall data warehousing solution. Building the informational database is done with the use of transformation or propagation tools. These tools not only move the data from multiple operational systems, but often manipulate the data into a more appropriate format for the warehouse. This could mean:
Why is this transformation necessary? Consider operational data that contains every detail about every purchase made in a retail outlet. This data holds the answers to the questions:
But because the data is stored in a format appropriate for recording these transactions, querying this data to analyze it can be a very time consuming effort and can adversely affect the performance of the transactional system. A business analyst tracking sales may want to review data collected from every store by product to forecast inventories, determine profit margins, or track revenues compared to last year. The ability to generate summary level informational data from the operational data provides the performance benefits the analyst can take advantage of to quickly view the trends and problem areas affecting the business. Multi-dimensional analysis and OLAP Relational databases store data in a two dimensional format: tables of data represented by rows and columns. Multi-dimensional analysis solutions, commonly referred to as On-Line Analytical Processing (OLAP) (3) solutions, offer an extension to the relational model to provide a multi-dimensional view of the data. For example, in multi-dimensional analysis, data entities such as products, geographies, time periods, and sales channels may all represent different dimensions. Actual sales revenues and forecasted revenues may represent measurements to analyze. Multi-dimensional data structures provide both a mechanism to store the data and a way for a business analyst to, for example, view actual sales versus forecast numbers across the different dimensions in a very timely, very powerful fashion. The power of the data is given to the users who need it most, without the need for users to wait for complex reports to be generated. Multi-dimensional solutions provide the ability to:
Multi-dimensional data structures can be implemented in many different manners. Relational databases can support this structure through:
Another method of providing multi-dimensional analysis is through the use of specialized databases that form a "cube"-like structure for achieving high performance analysis. Data marts are workgroup or departmental warehouses, which are small in size, typically 1-10GB. The data mart contains informational data that is departmentalized, tailored to the needs of the specific departmental work group. Multi-dimensional data structures are one way of defining a data structure that can be used for data marts. Information about the data warehouse and the data that is contained in the data warehouse comes in two basic parts. There is technical data the warehouse administrator uses, and there is business data that is of use to the warehouse users. All of this data is referred to as metadata, data about the data. The technical data contains a description of the operational database and a description of the data warehouse. From these two descriptions, or schema, the data movement operation can be implemented. This data helps the data warehouse administrators maintain the data warehouse and know where all of the data is coming from. The business data helps users find information in the data warehouse without knowing the underlying implementation of the database. This information is presented in business terms, instead of the terms used by the programmers when the database was built. The business data also gives the user information about:
Business intelligence software and data mining Business intelligence software is a fairly new term referring to the tools that are used to analyze the data. This software can consist of:
These tools are typically easy to use and graphically oriented, with point-and-click functionality. The tools often can present the data to the user in tabular or graphical format, in a report, or with key performance indicators highlighted. A significant benefit of today's analysis tools is the ability to quickly and easily put the power of the data in the hands of the people who can best use it. Data mining tools, such as Intelligent Miner from IBM, can find potential gold mines of valuable information in the massive amounts of data that a company accumulates. These tools take decision support and executive information systems to a new level. With traditional DSS and EIS systems, the user forms a hypothesis and uses the query tools to verify or reject the hypothesis. With data mining, the system researches the data and determines patterns, classifications, and associations, while the analyst determines what to do with the results. Data mining has been quite useful in the retail industry to analyze consumer buying patterns and form marketing programs to take advantage of the analysis results. For instance, data mining can find patterns in your data to answer questions like:
Once the buying patterns have been discovered, the retailer can use this information to tailor a marketing strategy that appeals to each type of buyer, thereby maximizing profits or minimizing costs by optimizing inventory management. But the retail industry is not the only industry to take advantage of data mining. Other uses for data mining include: risk assessment and portfolio management in the finance industry; fraud detection and policy assessments in health and insurance; and optimization, scheduling, and visual inspection systems in manufacturing and process control. AS/400 information warehouse architecture and components IBM's approach to data warehousing is to provide solutions for building the warehouse and to assist the decision-making process through a set of products and services. This set of products and services, or components, defines the AS/400 information warehouse architecture. The information warehouse architecture provides the means to change raw data into information for making effective business decisions--the emphasis on information, not data. This architecture provides the overall structure for a comprehensive, flexible, and integrated data warehouse solution.
There are four main components of the AS/400 information warehouse architecture. These are:
Transformation/propagation tools A key component of the warehouse implementation is the loading of the warehouse. This component consists of building the informational database from the operational data in the format optimized for decision support. It also includes regularly updating the informational database as additional transactions to the operational system(s) occur. The complexity of loading the warehouse can vary depending on a number of factors, including compatibility of operational and informational systems, amount of data massaging that must occur, and timing requirements of the data. Moving data between homogeneous systems is always easier than moving data between heterogeneous systems, because of different database support, differences in (or lack of) support of industry standards for database and/or networks, different security implementations, and so on. Special programming may be required to ensure that the integrity of data is maintained in a heterogeneous environment.
Data warehouse database servers--the heart of the warehouse There are at least five requirements of a data warehouse server:
The AS/400 Advanced Server models, with 64-bit PowerPC technology, provide an excellent data warehouse solution. This line of AS/400 servers combined with the most widely installed multi-user database in the industry, DB2 for OS/400, provide the basis for meeting the five requirements noted above. Let's take a look at some of these individually. The AS/400 Advanced Server models are specifically tuned for the type of work loads used in data warehousing applications. For example, the DSS or EIS tools used to access the warehouse utilize a client/server database serving workload. The loading of the data warehouse, massaging of the data, and additional analysis tools take advantage of the high performance batch mode on the server models. Scalability within the server models provides data mart type systems, low-end servers that can handle 10 to 25 users, all the way to enterprise-wide warehouses supporting very large databases and large numbers of users. The PowerPC based server models, combined with the clustering technologies described in reference #1, provide an almost unlimited growth path for data warehouse implementations. Parallel processing techniques Storing and analyzing very large amounts of data require the database to take full advantage of the hardware it runs on, in order to get the most performance out of the system and meet the user's expectations. Several advances in the area of parallel computing have been announced for AS/400 and DB2 for OS/400, available in three stages. Parallel I/O processing: The first stage of the parallel technologies is parallel I/O processing. This feature, available in Version 3 Release 1 and subsequent releases of OS/400, provides parallel computing to occur at the I/O processor level for a single job. AS/400 business computing systems consist of multiple processors: the main CPU(s) and multiple I/O processors (microprocessor-based adapters) on which to offload the main processors and create a more efficient system. This has been an inherent feature of the AS/400 architecture since its inception. Parallel I/O processing also takes advantage of another inherent architectural feature of the AS/400: single level storage. Single level storage allows the system to automatically spread data across all available disk drives in the system for more efficient processing. No manual "data striping" is required on an AS/400 system.
Parallel I/O processing means that a single user can submit a query operation against the database, and multiple I/O processor tasks are initiated internally to process the request. This capability to have multiple resources in the system working to retrieve the query result data has produced, in some cases, up to a 700 0mprovement in I/O bound query performance. Symmetric multiprocessing: The second stage is symmetric multiprocessing, or SMP. Similar to parallel I/O processing, SMP will take advantage of multiple main CPUs for a single job/query. DB2 Symmetric Multiprocessing for OS/400 is required for this function. AS/400 business computing systems today can contain up to four CPUs, in addition to the multiple I/O processors. With SMP, all four CPUs and the I/O processors can work together to retrieve the query result data for a single job. DB2 for OS/400 uses industry-leading techniques in its symmetric multiprocessing implementation to actually break down the query (or SQL) statement across the multiple processors. SMP has resulted in performance increases of over 400 0.000000or certain queries. Even single processor configurations can make use of the SMP feature, and have shown significant performance improvements through enhanced optimization techniques. Loosely Coupled systems: (Ref #1.) The third stage provides an almost unlimited scalability option. Loosely coupled systems will provide the capability to attach multiple (up to 32) systems together in a "Shared Nothing" (4) cluster. Once connected, those loosely coupled systems appear as a single database to the users; the database can be spread across the storage unit on each coupled system. Whereas there is a limit of 520 gigabytes on the high end server models, with the loosely coupled environment you can expand your disk capacity to 16 terabytes. DB2 Multisystem for OS/400 must be installed to support the loosely coupled technology. These multiple systems can be coupled together using any form of communications transport. A typical environment might be through the AS/400 high-speed, fiber-optic connection product, OptiConnect for OS/400. The high transfer rate provided by OptiConnect (1 gigabit per second on PowerPC systems) prevents the connection method from being a bottleneck when passing large amounts of data across a coupled set of systems. The combination of SMP with loosely coupled systems effectively provides a massively parallel processing (MPP) environment. Extremely large data warehouse implementations can now be achieved. In a single query, it is feasible that you could have as many as 128 (32 coupled systems times four CPUs in each system) CPUs all working together with potentially hundreds of I/O processors to retrieve the query result data. A major focus area for AS/400 development has been to ensure compliance with industry and de facto standards for database access. Providing support for open standards allows AS/400 customers the most flexibility in choosing tools and inter-operating with non-AS/400 platforms and databases. Some of these standards include:
The benefit of this is that client/server tools that also support these standards can very easily use DB2 for OS/400 as their database server. Many decision support and EIS front-end tools have been successfully tested with the new AS/400 ODBC driver that is included in the Client Access product. For an extensive list of client/server tools, refer to the AS/400 Client/Server Applications Directory, G325-6202. Internet access and Lotus Notes: The Internet and collaborative computing are important components of today's information technology departments. AS/400 support of the Internet and the leading collaborative groupware solution, Lotus Notes, enhances AS/400 data warehouse capabilities. All that is needed to access the data warehouse by way of the Internet is a standard web browser on the client PC to allow the AS/400 HTTP World Wide Web server to access your data. One way to access the data warehouse using the Internet is through server-based applications. Any AS/400 application can be accessed through the Internet using the 5250/HTML Workstation Gateway. The Workstation Gateway enables the web browser to act as an AS/400 workstation by converting text-based data streams into HTML (hyper-text markup language) documents as the application runs. Therefore, an existing AS/400 application that accesses your data warehouse can be accessed through the Internet, and new applications can be created quickly with existing tools, libraries, and procedures. A second way to access your data through the Internet is by using the DB2 World Wide Web (WWW) offering that is part of Internet Connection for AS/400. DB2 WWW is a program that works from SQL and HTML macro files to set up DB2 for OS/400 queries, with no programming required. You use HTML forms on web browser displays to specify input values for the query. The input values are sent back to the server to query the data warehouse using SQL statements.
As a third option, you can write a program, using the Internet standard Common Gateway Interface (CGI), to collect input from the user with a web browser and to run a DB2 for OS/400 query. This provides the maximum flexibility for creating dialogs with the user to gather input and to display the results, perhaps using graphs and charts to show the information. The AS/400 World Wide Web server simplifies the accessibility of your data warehouse data to your company, customers, and potential markets. Lotus Notes is an industry leading groupware product. Its powerful collaborative computing and data replication features can be seamlessly integrated with the AS/400 data warehouse. Built-in replication technology is provided with the Lotus Notes implementation on AS/400, providing the ability to share information from a DB2 for OS/400-based warehouse and Notes applications. Support for multiple data structures AS/400 supports many different data structures appropriate for analytical processing. DB2 for OS/400 provides a leading edge relational database management system, with multiple interfaces for defining, loading, accessing, and managing the data warehouse. OLAP data structures are supported through IBM Business Partner products that take advantage of the scalability and performance that AS/400 provides. See Table 1 for a list of Business Partner solutions. Data analysis and end-user tools The open interfaces of DB2 for OS/400, combined with the high performance programming interfaces in Client Access, give you the base with which you can select any number of tools for analysis purposes. These are typically PC-based tools that access the AS/400 data warehouse through the Client Access product. The choice of tools includes report writers, query tools, EIS tools, spreadsheets, and client/server programming tools for highly customized implementations. The wealth of tools does not seem to be a problem; picking one from the hundreds on the market is not as easy! IBM and its Business Partners offer a wide range of end-user products to choose from for analyzing the data warehouse from the desktop. IBM can assist you in determining the right tool for you, based on your specific requirements. The IBM Client Series program defines a selected set of leading-edge products to help you through the selection process. These products include IBM Visualizer and Lotus Approach. The IBM Client Series end-user certified products are included in Table 1.
Managing information about the warehouse--metadata Factors such as the number of operational systems, diversity of platforms, and consistency and size of data across the platform can affect the need for metadata. Metadata can be found in database catalogs, data dictionaries, and repositories. Anything that contains information about the operational and data warehouse databases can be used to supplement this information. By far the largest source of this information, especially the business information, is the person who created it. The correlations between business terms and the data warehouse structure do not exist in any online facility and must be created by hand. With Version 3 of OS/400, the enhancements to the system-wide catalog may mean that all the metadata you need is already in the AS/400. The DB2 for OS/400 system-wide catalog contains information about your files--independent of whether they were created with data description specifications or using SQL. Information about tables, fields, referential integrity constraints, and cross-reference information are kept in the system-wide catalog. The IBM DataGuide set of products provide a more extensive information catalog that can include data from multiple platforms. Extraction utilities allow object descriptions of tables, columns, queries, spreadsheets, and so on to be gathered from a wide range of sources. The DataGuide administrator function provides a common point of control for gathering and organizing information for the catalog for all end-user environments. Business views allow users to access the data warehouse while shielding them from the complexities of the database. Many IBM Business Partner products offer their own layer of metadata for use with their product sets. For example, IBM Visualizer includes an administrative function that can provide descriptions of files or tables and fields or columns, and can even control join processing. Putting it all together (or where to start) Up until now we've addressed the various components that make up a data warehouse implementation. Particular requirements will dictate which components are appropriate for each individual situation. Do you need a multi-dimensional data structure? Is 80 percent of your operational data in legacy flat files? How sophisticated should the desktop presentation be for the users of the warehouse? Has someone already built the transformation programs to extract data from the software package that you run your business with today? Do you have the tools you need to help you manage the warehouse? Selecting the data warehouse platform is just one piece to the puzzle. Defining requirements, selecting the right components, and implementing the solution completes the process. Whether you build your own solution, or choose to have the solution built for you, IBM can help. Following is a list of IBM and Business Partner
solutions and their appropriate solution areas. Table 1. Data warehouse
solutions
IBM offers an extensive set of consulting services specializing in data warehouse and data mining solutions. Expert consultants can help in the planning, design, systems integration, product installation, and testing of a data warehouse implementation, as well as provide the follow-on support for managing and automating the data warehouse processes. Data warehousing implementations need not be complex. The AS/400 Advanced Server models can meet the demands of data warehousing while letting you focus your energies on running your business, and not on running your computer. These models build on the strengths of AS/400 business computing systems: ease of use, advanced architecture, open interfaces, scalability, and investment protection. IDC has completed a number of studies on cost of ownership showing AS/400 business computing systems as the least expensive system compared to LANs and UNIX environments. (5) Recent benchmarks by Client/Server Labs, an independent benchmarking firm, shows AS/400 beating 15 other competitive systems by at least 52%. (6) The affinity you have with your operational systems make the AS/400 Advanced Server models the most logical choice for the data warehouse. Not only is data transformation made simpler, but systems administration is significantly easier as well. You'll have the same backup procedures, security, and operations as your existing system, but also the ability to use the AS/400 server for operations such as batch offload, application development, or communications. One customer had this to say about an AS/400 warehouse solution: "It has allowed our users to analyze and spot trends we simply could not do previously. By implementing a data warehousing solution with the AS/400, we projected an improvement in our merchandising and category management areas by ten percent." Another: "I believe that it has changed our style of management throughout the whole company, since managers at all levels are getting the information they need at the right time and in the right format!" And still another: "We measure the performance of each SKU relative to plans, historical trends, and forecasts. This in turn helps us make management decisions--taking advantage of the more profitable business and controlling the less profitable." For more information about products or getting started on your data warehousing solution:
Footnotes: (1) Normalized data refers to data defined according to one of the key rules of a relational model. Multiple tables are built to represent one-to-many, many-to-many, or one-to-one relationships within the data. Normalization minimizes duplication of data and preserves integrity of the database by ensuring that changes to data fields are made once, and that all applications and views of the data are unaffected. (2) Cleansing the data in data warehousing terms means to ensure the integrity of data when it is loaded into the warehouse. For example, if data were pulled from two operational systems that use different field names for the same data, cleansing techniques would ensure that the data fields conform to the warehouse definition upon loading. (3) E.F. Codd, who defined the 12 rules of a relational database model, has developed 12 rules that define an OLAP model. Products that conform to a portion or all of Codd's rules label themselves OLAP products. Some of the key rules are multi-dimensionality and intuitive data manipulation (for example, drilling down to new levels of data). (4) Shared Nothing refers to each system in the coupled network owning and managing its own main memory and disk storage. In other words, the memory and disk are not shared. (5) "Cost-to-Use of Midrange and PC LAN Systems in the Networked Enterprise", International Data Corporation. (6) "Performance Profile", Aberdeen Group, April 1996. International Business Machines Corporation 1996 Printed in the United States of America Mail comments to: Fax comments to: This publication may refer to products that are not currently available in your country. IBM makes no commitment to make available any products referred to herein. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to the IBM Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY 10594, USA This publication could contain technical inaccuracies or typographical errors. AIX, AS/400, BrioQuery, Client Series, DataGuide, DataJoiner, DataPropagator, DB2, Distributed Relational Database Architecture, DRDA, IBM, IMS, OS/2, OS/400, PowerPC, and 400 are trademarks of the IBM Corporation in the United States or other countries or both. Apple is a registered trademark of Apple Computer, Inc. DAL is a trademark of Independent Technologies, Incorporated. Impromptu is a registered trademark of Cognos, Incorporated. Microsoft is a registered trademark of Microsoft Corporation. PowerPlay is a registered trademark of Cognos, Incorporated. ShowCase is a registered trademark of Rochester ShowCase Corporation. UNIX is a registered trademark in the United States and other countries licensed exclusively through X/Open Company Limited. X/Open is a trademark of X/Open Company Limited. Other company, product and service names may be trademarks or service marks of others. All AS/400 manufacturing and development sites meet the ISO 9000 standard for quality assurance. |
Source:http://www.as400.ibm.com/db2/dataware.htm