Print this page

White Paper

Enterprise Data Warehouse: Some Thoughts and Lessons Learned.

Enterprise Data: What's Useful and What's Junk?

If you want information for innovation, you can't try to capture all the data from an enterprise. Instead, you seek to capture only data that is designated for innovative uses that lead to bottom-line value to the enterprise. Not everyone is a fan of enterprise data warehouses. Many think they are not feasible—that it is not reasonable to place all data from an enterprise into a single, monolithic data warehouse. Our experience has taught us that those people are right.

Placing all data from an enterprise into a monolithic data warehouse leads to junk drawer mentality. The resulting data warehouse looks like an overstuffed drawer. The excessive amounts of data make it impossible to provide adequate attention to the details of data quality and the requisite transformations. Additionally, users feel overwhelmed and find the whole warehouse unusable.

The lesson to be learned is a version of the old "quality, not quantity" adage. If you want information for innovation, you can't try to capture all the data from an enterprise. Instead, you seek to capture only data that is designated for innovative uses that lead to bottom-line value to the enterprise. This leads to the obvious question: What kind of data is that? The beginning of the answer is that if you want data that leads to bottom-line value, you need to target it for that purpose. Unfortunately, not every business user has the insight into data to be able to selectively target data.

The ability to target the right data begins with data warehouse designers who seek out data-savvy business users with analytic skills and experience. They can identify the information that will truly lead to bottom-line improvements. They are the ones who can discuss the characteristics of success within various parts of the enterprise and identify the information that helps to drive that success.

Here's another way to consider what enterprise data should be: Think about a business function or use, and then work back to the data needed to perform that function optimally. For example, data warehouses are commonly used to deliver performance scorecards—high-level metrics that highlight enterprise performance. In order to drive bottom—line results, there needs to be enough supporting information to enable an analytic assessment of underlying causes and behaviors that might lead to innovations that will change future performance. Thus, the enterprise data for performance scorecards includes the underlying detail that supports analysis to drive future improvements in performance.

Consider data warehouses that support marketing analytics, such as the planning of marketing campaigns and the analysis of completed campaigns. As with scorecards, it is not enough to deliver high-level summarized results. The detailed information needed to change future performance of campaigns is the enterprise data: it includes demographic, behavioral, relationship and contact information. If such data has the potential to change future performance, then it is enterprise data.

Finally, data warehouses are commonly used to support supply chain information. This may include information about purchase orders and receiving that allows enhanced vendor performance or information on materials and supplies that allows improved product costing. Other relevant information may include market conditions to allow improved risk management or information so suppliers can manage on-hand quantities independently. To the extent that such information can support innovative analyses to improve bottom-line performance, it is enterprise data.

Thinking about the useful data in this way allows one to make some choices. A significant amount of data in enterprise databases cannot be used for the kinds of innovative analyses we're talking about here. Certainly, that data may be valuable within a certain domain, but it cannot contribute to the analyses that drive enterprise performance. Such data properly belongs in the operational databases that support its domain, and there it should stay.

One might argue that such data is merely waiting for the right clever analyst to figure out how to convert the data to innovative information. That's a valid argument. Good analysts will recognize the potential of data and will seek to store it in the warehouse, with the expectation of using the data soon. We need to tolerate and encourage such thinking within limits.

The challenge for us as information professionals is that we know that the drawer is large, but not infinitely large. Judgment and discrimination are necessary so that we use enterprise resources to store data that could be used more fruitfully, but don't waste those resources by storing too much "junk." Just having these kinds of discussions early with analytical business users can help everyone find and maintain the appropriate balance.

On-demand Data Marts

Data marts have become essential tools in today's business environment—repositories of data gathered from many different sources, which analyze and then present information to a particular group of users in terms relevant to their work. Take a manager who has just purchased some external data to help with an important marketing campaign. For that data to be meaningful, it first needs to be merged with the company's internal data. From that data pool, the company will then have what is needed to drive the campaign.

The typical solution today is to extract internal data and store it in SAS files along with the external data. While this solution is well accepted, SAS files are more difficult to manipulate than data in a relational database. Another limitation is that relational databases require skills (logical database design, physical database design, physical database generation and loading) not normally possessed by SAS programmers. Recruiting a database administrator to provide these services may introduce a several-month delay.

What if, instead, the marketing manager could combine the internal and external data and immediately populate a data mart that is automatically defined simply by the act of requesting it? A pipe dream? Not at all. Thanks to today's innovative uses of meta data, companies have the potential for real-time generation of data marts to meet specific, short-term needs of individuals who need the right combination of data for important business needs.

Three technologies are needed to generate on-demand Data Marts:

  • An accessible meta data repository that users can query.

  • An extract tool that can get to the data on demand.

  • A target database management system (DBMS) that can automatically generate a database.

The starting point for on-demand data marts is an accessible meta data repository that reveals what data is available. The repository also needs to assist the user in the selection of individual data elements needed to support a business need, such as the marketing campaign. Users need to be able to pull multiple data item collections as part of the same activity—for example, multiple items that define customer, product or sales. In logical database terms, this should translate into logical tables that reflect the collections of data items as defined by the user. Where can such meta data repositories be found? Several vendors have products that provide this functionality.

The extract tool should provide access to data in a variety of data sources. Data is typically stored in several places: a data warehouse, operational data stores, operational databases and flat files. The extract tool most likely will be a batch extractor, with specified extraction schedules. Some data stores can be extracted in near-real time, such as a data warehouse. Others have to wait so that key business functions are not disturbed. Either way, the extraction should be scheduled automatically, with clear communication to the user of the time schedule. The source should be relatively transparent to the user, while ensuring that the meta data descriptions provide enough context for the user to know that the correct source is being tapped.

What extract tools exist? Not surprisingly, because such tools must be tightly linked to the meta data repository just described, Informatica is an example of extract tool vendor.
The target database has been a major stumbling block for designers in the past. To provide easier data analysis, this database should be relational. However, substantial technical expertise is required to design and create a relational database that has acceptable performance characteristics. The major technology advance in this area has been the introduction of databases that permit the effective creation of in-memory databases. In turn, this technology delivers analysis performance without the need for physical database design in the traditional sense. This provides a great solution for the creation of short-lived databases that are needed quickly, but only for limited periods of time.

The delivery of this newer technology means that companies now have the capability to create very focused collections of data to meet specific analysis needs for a short period of time, eliminating the need to continue using an existing database once the original need has changed. This is the heart of on-demand data marts.

What databases exist that meet this need? One such product is the Sand Analytic Server from Sand Technology. Used in combination with Informatica or some other similar tool, current customers are generating data marts as needed and discarding them when the analysis need is complete.

Data warehouse managers may protest that analysis belongs on the data warehouse. In the broadest sense, we agree based on the work we have done at American CIT. However, there seems to be a growing consensus today that data warehouses and data marts serve two related but different purposes. In many situations today, business professionals have a short-term need for external data or customer data that must be combined with data warehouse data in order to be useful. However, the resulting collection should not then be stored on the data warehouse for various reasons. For these kinds of needs, on-demand data marts are an excellent solution.