How would a null value for ID zero hurt you?
Here is an example
I suggested that we update the null value to N/A. A co-worker, a very experienced Data Analyst responded this way:
I can go either way. This is a typical condition in Asset Manager so I’ve tried to “armor” my code with outer joins and isnulls.
I have to say that I am really impressed by the persistence of this candidate. I received several emails from him today. Of cause it has links to a couple of discussions on Internet with titles such as “How to UPDATE One Record From Two Records”.
Reading those online discussions is not a easy task. So I skipped it and sent him another very polite email.
Sorry I didn’t get back to you sooner.
I guess we didn’t really complete our conversation about the IP addresses over the phone. So it will be pointless for me to comment on your work.
I really appreciate your time!
Some interesting emails from him:
Of course, this is where meta-data logging would be important: We would need to let someone know what we did, and why… even if we did not do a commit…?
Thanks & comments welcome
Thanks – excellent advice. Data normalization is always interesting. –I used to joke that there needed to be a final rule – “de-normalize for speed.”
So I assume that we have a single table with all IP addresses in therein. What SQL would you use to merge two records?
Adding to the job description. Note the Deliverables at the end which include:
- Requires 5 or more years business intelligence experience
- In-depth knowledge of business intelligence concepts, practices, and procedures
- Advanced understating of relational database principles
- Advanced understating of multi-dimensional analysis and design, score carding, and dash boarding
- Ability to discover, analyze, synthesize, and document information needs for both analytical and decision support reporting and monitoring
- Experience designing and constructing metrics, standardized reports, and custom reports for business intelligence solutions
- Experience in creating testing methodologies and criteria resulting in the design and implementation of test plans
- Demonstrable ability documenting tools and applications and providing end user support
- Strong analytical and organizational skills
- Strong problem solving and conceptual thinking abilities
- Able to work independently and deliver high quality results when given a set of objective requirements
- Excellent written and verbal communication skills, with the ability to present ideas and concepts in a clear, concise manner to technical and non-technical audiences
- SQL Server Reporting Services and Analysis Services required
- SQL Server 2005/2008, including strong SQL programming skills
- Performance Point and Silverlight experience preferred
- SAP experience desirable
- Business Objects experience desirable
- Knowledge of SQL Server Integration Services and DataStage desirable
- Proficiency with VB Script, C#, Java, and XML
- Experience in the Utility industry preferred
- Visio, Excel, PowerPoint
- Agile Methodology experience preferred
- This position is responsible for the Microsoft Business Intelligence platform architecture, security, administration, and support; provide technical guidance related to performance, scalability, and best practices.
- Provide subject matter expertise in business intelligence analytic tools and technologies to support data warehouses, report development, and integration with web-based portals.
- This position is responsible for developing and deploying analytical solutions utilizing Silverlight, Performance Point, SQL Server Reporting and Analysis Services, and Excel Services.
- Should be an experienced, highly self-motivated, quick and creative thinker with the ability to conceptualize, plan, and implement business intelligence solutions.
- Should be comfortable working as part of a team with the ability to function as Project Manager when necessary.
- Architecture diagrams
- Requirements and technical design documents
- Implementation and administration policy, procedure, and standards documents
- Multi-dimensional cubes, complex reports, ad-hoc queries, dashboards, scorecards
Required Education/Experience: Bachelors Degree
Background Check Required: Yes
Drug Test Required: Yes
Overtime Required: Occasional
Travel Required: No
My current boss asked me to have a chat with somebody whom he is considering for a Data Analyst position.
He told me that he had DBA experience a few years ago, and he has done some SQL work through ODBC.
I knew right away that he has not done any analysis work using SQL as a query tool. He has never heard of ETL process for data integration/migration/data warehouse project.
In order to report back to my boss, I had to ask him a few questions about data analysis, SQL and ETL. I started with explaining ETL process, tasks we need to do, tools we would use etc. Then I tried to explain what Data Profiling means, and gave him two specific examples. Then I asked him how he would use SQL as a query tool to “discover” the data pattern.
My good intention seemed to irritate him so much that he told me that “I would not site here babbling about SQL with you. I can do the job and find answers. ” I asked him how would he find answer. The answer is “Google internet”.
The next day he sent me links to two web sites, one is “SQL Interview Questions and Answers”, and one is a discussion site where people were suggesting Cursors for “merging rows in a SQL query”. He also told me what he would do to “merge table to get IP address”:
… what I might also have done would be to (1) create a new record using our rules, (2) delete the other two records, then (3) record the reason for the change in the meta-data / system log. –Either that, or update + commit the new record with the new IP, then log + delete the other one? Depending upon how the row has been indexed, it might not be too bad of a hit … ?
I sent him a very polite email:
Thanks for digging into the problem.
One mystery about SQL is that it’s a set-based query (or analytic) tool, not primarily for record-by-record processing as you would do in C# via the ODBC layer.
This concept is a little hard to understand until you start o understand the set-based functions in SQL, such as max/min, row_number, identity etc. You will see that the functions are applied to the entire record set without you writing code to loop through each record.
This concept leads to another mystery about data analysis, which is to look for data patterns in a given record set, instead of each individual record.
To apply business logics to a record set, we would not look at each record individually, we would apply the logic to the entire set, then narrow down to the sub-set we need.
If you can keep the above concept in mind, and get some practices, you will be an expert soon.
There is an opportunity for me to give a lecture to a group of Computer Science Database Concentration students at ASU. I am debating on the topic. Many topics came to my mind. But my recent experience of working on a Verizon integration project with two other database developers (or data analysts) has inspired me to reflect upon myself as a database developer. How to be a disciplined database developer might be a more practical topic than topics such as ETL Architecture, Best Practices etc.
In IT field, new programming languages, new tools, and new methodologies come out quicker than we can name them, followed by devotees singing its praises and job descriptions demand we know them all. They all promise increases in productivity and quality. But there is only one common quality that all successful developers possess, discipline. We might be using different tools, different programming languages, different project methodologies, and different architectural designs, but discipline is one trait that will make or break every project.
Here are some of the horrors I see that put the word “discipline” in my head and the word will not go away.
1) This is part of the Data Flow of Filter Servers. Based on this visual representation of the “Filter Servers”, can anybody write down the logic for “Filter Servers” in 10 minutes? There is no metadata from the original developer; there is no BRS; there is no field mapping available. There is no single staging point in Relational database/flat files for the entire ETL process. The only staging is done with raw files that are native only to SQL Server Integration Services.
2) Compound the above problem by more similar poorly-design SSIS packages, AND a manually running process on production server. When I asked the reason for a manual process on the production server, I got the following answer:
I can set them up as a job but nothing is outputting the numbers from the run, so if you’re fine with not having those I can do it, otherwise I can just run them manually.
They run so quickly and if we are going to be replacing them anyways, I’m not sure what is to be gained by setting them up as jobs. Just my 2 cents.
3) Data from hardware discovery tools needs a very vigorous data profiling process. On example is this: The record with the last inventory date should be retained and the other should be deleted. When I sent the following question to the original developer, I never received any answer back.
Which part of your SSIS package retains the device record with the last inventory date and delete the other?
Cleansing the source data is an important step
4) If you decided to get “incremental” daily data from source based on last update date, you need to be very careful to make sure a) you don’t miss any data since last run; b) the process is recoverable from the last run without losing data. Here is a mechanism I saw that uses a date filter at the table level to get the daily “incremental” data:
where last_inventoory_date >= dateadd(dd, -1, cast(getdate() as smalldatetime))
Using the above date filter, you will miss many data, if your process
- Didn’t run on certain days for some reason, or
- Failed on certain days
It failed the recoverability test. It also failed the test of not losing any data. The Comparison needs to be at the record level, not a system date.
5) Compound the problem at number 4 by repeating the same code 10 times in 10 different SQL tasks in a SSIS package. The concept of object-oriented and reusability in database design/modeling is well known, but when it comes to writing SQL code and designing ETL process, it requires disciplines and practices from us.
6) How much confidence would you have on your ETL process, when you worked on your laptop for 6 months, and overnight, moved the process to the production server to run manually? During the 6 months, the business requirements have been changes, database schema has changed, hardware architecture has changed.
Adding to the job description. Note the note from Manager: MUST be able to figure out old, crummy data. The last requirement: the ability to work under time pressure are absolutely critical.
Job Title: Sr. SQL Server Database Analyst
Location:AZ (Phoenix), Iowa (Des Moines), NC (Charlotte) or Oregon (Beaverton)
Will hire off a phone interview – there will be 2 phone interviews
Must have excellent communication skills and be able to communicate with end users.
Candidates with 8+ years experience are strongly preferred.
Note from Manager:They MUST be able to figure out old, crummy data. There is not a lot of documentation (in some instances- none) and they need to dig in to the data, understand it and run with it. They also have to have enough business knowledge – specifically around Finance and Risk Mgmt to understand what is going on – and to make corrections as necessary. . They have to figure out what is going on by themselves. Many times the requestor will not know how to do it or where to go to get it – this person will be the one to figure it out and be able to spell it out to other people. The person needs to be strong on taking ownership and taking initiative.
The position will be responsible for developing solutions to meet the reporting and information needs of the business unit. These solutions will vary in type and scope from quick-hit ad-hoc reports to new productionalized reports with a moderate to large breadth requiring new infrastructure processes.
Daily activities could include any/all of the following:
· Monitoring reporting system processes
Working with Business Intelligence management co workers, and collections management, to identify, refine, or review data or report needs.
· Report and data process development working with appropriate business personnel on UAT of reports. Issue research and resolution related to reports or processes.
· Other activities as assigned and are related to other Home Equity reporting and/or business reporting needs.
Must have strong emphases on SSRS, SSIS, SSAS and SQL Server Stored Procedures.
Extensive knowledge of SQL Server back-end development (ETL, transact SQL, DDL/DML), strong reporting services development skills, reliability, attention to details, the ability to quickly learn, retain, and apply business rules and terminology, good interpersonal skills, and the ability to work under time pressure are absolutely critical for this position.
Namespace for variables can be a helpful tool to enforce naming conventions.
Here is an example. If I am using a variable to store the path for raw files in SSIS, I would use the word FilePath (or RAW) as the namespace for the variable.
In the variable property list, change the default namespace User to FilePath.