Lecture at ASU
There is an opportunity for me to give a lecture to a group of Computer Science Database Concentration students at ASU. I am debating on the topic. Many topics came to my mind. But my recent experience of working on a Verizon integration project with two other database developers (or data analysts) has inspired me to reflect upon myself as a database developer. How to be a disciplined database developer might be a more practical topic than topics such as ETL Architecture, Best Practices etc.
In IT field, new programming languages, new tools, and new methodologies come out quicker than we can name them, followed by devotees singing its praises and job descriptions demand we know them all. They all promise increases in productivity and quality. But there is only one common quality that all successful developers possess, discipline. We might be using different tools, different programming languages, different project methodologies, and different architectural designs, but discipline is one trait that will make or break every project.
Here are some of the horrors I see that put the word “discipline” in my head and the word will not go away.
1) This is part of the Data Flow of Filter Servers. Based on this visual representation of the “Filter Servers”, can anybody write down the logic for “Filter Servers” in 10 minutes? There is no metadata from the original developer; there is no BRS; there is no field mapping available. There is no single staging point in Relational database/flat files for the entire ETL process. The only staging is done with raw files that are native only to SQL Server Integration Services.
2) Compound the above problem by more similar poorly-design SSIS packages, AND a manually running process on production server. When I asked the reason for a manual process on the production server, I got the following answer:
I can set them up as a job but nothing is outputting the numbers from the run, so if you’re fine with not having those I can do it, otherwise I can just run them manually.
They run so quickly and if we are going to be replacing them anyways, I’m not sure what is to be gained by setting them up as jobs. Just my 2 cents.
3) Data from hardware discovery tools needs a very vigorous data profiling process. On example is this: The record with the last inventory date should be retained and the other should be deleted. When I sent the following question to the original developer, I never received any answer back.
Which part of your SSIS package retains the device record with the last inventory date and delete the other?
Cleansing the source data is an important step
4) If you decided to get “incremental” daily data from source based on last update date, you need to be very careful to make sure a) you don’t miss any data since last run; b) the process is recoverable from the last run without losing data. Here is a mechanism I saw that uses a date filter at the table level to get the daily “incremental” data:
where last_inventoory_date >= dateadd(dd, -1, cast(getdate() as smalldatetime))
Using the above date filter, you will miss many data, if your process
- Didn’t run on certain days for some reason, or
- Failed on certain days
It failed the recoverability test. It also failed the test of not losing any data. The Comparison needs to be at the record level, not a system date.
5) Compound the problem at number 4 by repeating the same code 10 times in 10 different SQL tasks in a SSIS package. The concept of object-oriented and reusability in database design/modeling is well known, but when it comes to writing SQL code and designing ETL process, it requires disciplines and practices from us.
6) How much confidence would you have on your ETL process, when you worked on your laptop for 6 months, and overnight, moved the process to the production server to run manually? During the 6 months, the business requirements have been changes, database schema has changed, hardware architecture has changed.