A look at the lifecycle of a data warehouse build, and why it is as enjoyable as it is frustrating
Some people are great at cooking, others can create wonderful art or write beautiful music. Some are leaders, some are teachers, some people can rub their tummies and pat their heads at the same time while reciting poetry.
I design and build data warehouses.
Why is this special?
Well, it might not be special to you, but I love it, I’ve built a career out of it and here’s why I think it’s special.
Building a data warehouse is a little bit like building an actual house.
There’s a lot of planning, a lot of pre-design/build cost and time analysis.
Then there’s designing the overall architecture, figuring out what it needs to accomplish.
I need to speak to the people who will be ‘consuming’ the data that will be held in the data warehouse, i.e. requesting and reading the reports and analyses, find out what they want from the data, how it will help them.
I need to understand what level of hand-holding they’ll need. Will they want to download the data and do their own analyses, or will they want to press a button and have it done for them?
There’s a big element of what I call ‘back-to-front’ design. Begin with the end in mind. If I know what I need it to accomplish now, and predict what it might need to accomplish in the future, then that will inform every decision I make from here on in.
Then there’s looking at the component parts, the data. Where’s it coming from? What will we need to do the data in order to make it fit for purpose?
Next, I’ll need to design a detailed table schema. This is my favourite part, designing the tables, the fields, the data types, the relationships and constraints, figuring out my primary keys, foreign keys, surrogate keys and natural keys.
This is a big part of the process, and involves close scrutiny of the data coming in from the various sources. Some people like to do this visually with an entity relationship diagram, others can visualise it in their heads and just do it in code.
I prefer to do it in code, but an ERD can help visualise the end product.
Then the build. No, THIS is my favourite part, getting up to my elbows in SQL. Creating the tables, defining those data types, keys and all those other things I just listed in the previous point. This is known as ‘the fun stuff’. I like it when someone walks past my desk and sees a screen full of SQL and says, “Blimey, you understand all that?”, and I say “No. You?” and we have a bloody good laugh about it.
When the basic database schema has been designed and built, that’s when we start to look at the ETL – Extract, Transform and Load. This is where we might use SSIS, or just SQL stored procedures (depends on whether the data is external to your SQL environment or not – could be CSV files, JSON, an API, Excel or another database such as Oracle, MySQL even MS Access (Yes people do still use that you know!).
Building the ETL can be fun, but more likely it will be time-consuming, boring, highly repetitive and often very frustrating. ETL processes fail. They fail because you’ve missed something out, they fail because you’ve left something in, they fail because you got a join wrong, because of a foreign key, a primary key, duplicate values (that’s a common one), NULL values (that’s a very common one), they fail because they just ran three times successfully and just want a change.
SSIS packages love to throw you errors. Errors you understand, errors you don’t and errors you’ll spend six hours Googling before you figure it out. Going back ten years or more we used a thing called BIML to create SSIS packages IN CODE. Who the hell thought that was a good idea? Going back even further we used DTS which is like comparing an electric car to a horse and cart (but at least the horse and cart still work when it’s freezing eh readers?).
Testing. Oh yes, testing. Testing is about as much fun as it sounds. In other words, none. Here’s where you write and run some queries. You’re testing your table designs, your dimensions, your joins, your relationships (foreign keys), your indexes (and many things can go wrong with indexes – the wrong ones, not enough or too many. Beware of index overload). And here’s also where you discover issues. Issues that’s can’t easily be resolved without tracing some processes back to the start, the very origin of the data and that initial grab from the data source. Then you have to change every step and anything that has touched it. It can be something as simple as too short a varchar field or a numeric with insufficient size or precision.
Another question with ETL is whether you want your processes to run concurrently, or in succession. You can save a LOT of time be running a load of non-dependent processes at the same time.
Monitoring is a big consideration. Performance monitoring and error monitoring.
Error monitoring is where you need to make some big decisions. Do you want one failed step to fail the whole process or simply take a note of it and let you know when it’s finished? How do you collect errors while at the same time ignoring them? TRY CATCH blocks, that’s how I like to do it. This is like auditing your process, and how much information do you want to collect? It’s very tempting to collect heaps of information, and then….. never ever look at it.
Performance monitoring is related to how quickly and efficiently the process runs. Where can you make savings? Minutes matter, even seconds and you need to make sure your queries and processes are the right tool for the job, and as optimised as is necessary. Only experience will tell you what’s the right move to make.
At some point we need to think about reporting and the actual building of reports and writing of report queries, whether that’s SSRS, Power BI, Tableau, Excel or something else. Now that’s more a job for BI Analysts so when the time’s right, they will get on with that, and I’ll have to make sure they have the information they need to do that. It can be done early in the process (especially if the DW is replacing a previous version), or it can be done as one of the final tasks. As with a lot of things, it depends.
And when you’ve done all that, you can sit back and admire your handywork.
Oh, but don’t forget your documentation… I’m kidding of course, no-one bothers with documentation. Shh, don’t tell anyone I said that.
I love data warehouses so much, that I think I’d like to go and live in one.