Enterprise Data Warehouse: Some Thoughts and Lessons Learned.
Enterprise Data: What's Useful and What's Junk?
If you want information for innovation, you can't try to
capture all the data from an enterprise. Instead, you seek to capture
only data that is designated for innovative uses that lead to
bottom-line value to the enterprise. Not everyone is a fan of enterprise
data warehouses. Many think they are not feasible—that it is not
reasonable to place all data from an enterprise into a single,
monolithic data warehouse. Our experience has taught us that those
people are right.
Placing all data from an enterprise into a monolithic data warehouse
leads to junk drawer mentality. The resulting data warehouse looks like
an overstuffed drawer. The excessive amounts of data make it impossible
to provide adequate attention to the details of data quality and the
requisite transformations. Additionally, users feel overwhelmed and find
the whole warehouse unusable.
The lesson to be learned is a version of the old "quality, not quantity"
adage. If you want information for innovation, you can't try to capture
all the data from an enterprise. Instead, you seek to capture only data
that is designated for innovative uses that lead to bottom-line value to
the enterprise. This leads to the obvious question: What kind of data is
that? The beginning of the answer is that if you want data that leads to
bottom-line value, you need to target it for that purpose.
Unfortunately, not every business user has the insight into data to be
able to selectively target data.
The ability to target the right data begins with data warehouse
designers who seek out data-savvy business users with analytic skills
and experience. They can identify the information that will truly lead
to bottom-line improvements. They are the ones who can discuss the
characteristics of success within various parts of the enterprise and
identify the information that helps to drive that success.
Here's another way to consider what enterprise data should be: Think
about a business function or use, and then work back to the data needed
to perform that function optimally. For example, data warehouses are
commonly used to deliver performance scorecards—high-level metrics that
highlight enterprise performance. In order to drive bottom—line results,
there needs to be enough supporting information to enable an analytic
assessment of underlying causes and behaviors that might lead to
innovations that will change future performance. Thus, the enterprise
data for performance scorecards includes the underlying detail that
supports analysis to drive future improvements in performance.
Consider data warehouses that support marketing analytics, such as the
planning of marketing campaigns and the analysis of completed campaigns.
As with scorecards, it is not enough to deliver high-level summarized
results. The detailed information needed to change future performance of
campaigns is the enterprise data: it includes demographic, behavioral,
relationship and contact information. If such data has the potential to
change future performance, then it is enterprise data.
Finally, data warehouses are commonly used to support supply chain
information. This may include information about purchase orders and
receiving that allows enhanced vendor performance or information on
materials and supplies that allows improved product costing. Other
relevant information may include market conditions to allow improved
risk management or information so suppliers can manage on-hand
quantities independently. To the extent that such information can
support innovative analyses to improve bottom-line performance, it is
Thinking about the useful data in this way allows one to make some
choices. A significant amount of data in enterprise databases cannot be
used for the kinds of innovative analyses we're talking about here.
Certainly, that data may be valuable within a certain domain, but it
cannot contribute to the analyses that drive enterprise performance.
Such data properly belongs in the operational databases that support its
domain, and there it should stay.
One might argue that such data is merely waiting for the right clever
analyst to figure out how to convert the data to innovative information.
That's a valid argument. Good analysts will recognize the potential of
data and will seek to store it in the warehouse, with the expectation of
using the data soon. We need to tolerate and encourage such thinking
The challenge for us as information professionals is that we know that
the drawer is large, but not infinitely large. Judgment and
discrimination are necessary so that we use enterprise resources to
store data that could be used more fruitfully, but don't waste those
resources by storing too much "junk." Just having these kinds of
discussions early with analytical business users can help everyone find
and maintain the appropriate balance.
On-demand Data Marts
Data marts have become essential tools in today's
business environment—repositories of data gathered from many different
sources, which analyze and then present information to a particular
group of users in terms relevant to their work. Take a manager who has
just purchased some external data to help with an important marketing
campaign. For that data to be meaningful, it first needs to be merged
with the company's internal data. From that data pool, the company will
then have what is needed to drive the campaign.
The typical solution today is to extract internal data and store it in
SAS files along with the external data. While this solution is well
accepted, SAS files are more difficult to manipulate than data in a
relational database. Another limitation is that relational databases
require skills (logical database design, physical database design,
physical database generation and loading) not normally possessed by SAS
programmers. Recruiting a database administrator to provide these
services may introduce a several-month delay.
What if, instead, the marketing manager could combine the internal and
external data and immediately populate a data mart that is automatically
defined simply by the act of requesting it? A pipe dream? Not at all.
Thanks to today's innovative uses of meta data, companies have the
potential for real-time generation of data marts to meet specific,
short-term needs of individuals who need the right combination of data
for important business needs.
Three technologies are needed to generate on-demand Data
An accessible meta data repository that users can
An extract tool that can get to the data on demand.
A target database management system (DBMS) that can
automatically generate a database.
The starting point for on-demand data marts is an
accessible meta data repository that reveals what data is available. The
repository also needs to assist the user in the selection of individual
data elements needed to support a business need, such as the marketing
campaign. Users need to be able to pull multiple data item collections
as part of the same activity—for example, multiple items that define
customer, product or sales. In logical database terms, this should
translate into logical tables that reflect the collections of data items
as defined by the user. Where can such meta data repositories be found?
Several vendors have products that provide this functionality.
The extract tool should provide access to data in a variety of data
sources. Data is typically stored in several places: a data warehouse,
operational data stores, operational databases and flat files. The
extract tool most likely will be a batch extractor, with specified
extraction schedules. Some data stores can be extracted in near-real
time, such as a data warehouse. Others have to wait so that key business
functions are not disturbed. Either way, the extraction should be
scheduled automatically, with clear communication to the user of the
time schedule. The source should be relatively transparent to the user,
while ensuring that the meta data descriptions provide enough context
for the user to know that the correct source is being tapped.
What extract tools exist? Not surprisingly, because such tools must be
tightly linked to the meta data repository just described, Informatica
is an example of extract tool vendor.
The target database has been a major stumbling block for designers in
the past. To provide easier data analysis, this database should be
relational. However, substantial technical expertise is required to
design and create a relational database that has acceptable performance
characteristics. The major technology advance in this area has been the
introduction of databases that permit the effective creation of
in-memory databases. In turn, this technology delivers analysis
performance without the need for physical database design in the
traditional sense. This provides a great solution for the creation of
short-lived databases that are needed quickly, but only for limited
periods of time.
The delivery of this newer technology means that companies now have the
capability to create very focused collections of data to meet specific
analysis needs for a short period of time, eliminating the need to
continue using an existing database once the original need has changed.
This is the heart of on-demand data marts.
What databases exist that meet this need? One such product is the Sand
Analytic Server from Sand Technology. Used in combination with
Informatica or some other similar tool, current customers are generating
data marts as needed and discarding them when the analysis need is
Data warehouse managers may protest that analysis belongs on the data
warehouse. In the broadest sense, we agree based on the work we have
done at American CIT. However, there seems to be a growing consensus today
that data warehouses and data marts serve two related but different
purposes. In many situations today, business professionals have a
short-term need for external data or customer data that must be combined
with data warehouse data in order to be useful. However, the resulting
collection should not then be stored on the data warehouse for various
reasons. For these kinds of needs, on-demand data marts are an excellent