Which data belongs to the data warehouse?

April 10, 2024April 10, 2024 Documentation

Everything?

“No” 😉

Data should have a purpose in the data warehouse (DW). Every source and it’s administration will impact costs and resources (time, money, computing power, development capacity, operating costs, …).

Having “everything” in the DW just for the sake of having it, would raise resource costs (without generating any direct benefit). A consideration of what is urgent and important should be made for each source, report or use case.

A few of these reasons/arguments are listed below. Depending on the context of the company, the prioritization is very individual.

„Rules“

As usual, the following list provides a suggestion and is intended to provide food for thought.

data which should be stored in an audit-proof manner
data that should be analyzed. Either in reports or as an export (e.g. to third-party systems, external vendors, …)
data that needs to be historicized (possibly because the upstream source system does not support this)
performance reduction of an operational upstream system on which queries are carried out
as soon as two sources are to be “merged”, it can make sense to store them centrally in the DWH. This can be extremely important in terms of data quality/up-to-dateness, especially if Excel or CSV files are used.
all data that is to be used as a source(s) for data science
all produced analytics data (results from data science)

Note

It should also be considered that the more data there is in the data warehouse, the more attention must be paid to governance (security, housekeeping, performance, …).

In the next parts of the article series will also address these points and others in order to suggest ideas.