Posts Tagged ‘data profiling’

Fun as a Contractor #52 – Job description for data analysis and data profiling

June 14, 2011 Leave a comment

Adding to the job description collection for two reasons.

  • This is one of the good job descriptions what actually spelled out job duties and required skills.
  • One of the rare job descriptions for data analysis, data profiling and data sourcing.

Data System Analyst and Designer – FINANCE


-Lead with data analysis, data profiling and data sourcing.

-Work with the end-users, data architect to create/validate and enhance source to Client mapping documents.
-Support business analysts and developers with subject expertise in source-to-Client data flows.

-Work closely with the data architect to define, design and implement Center of Excellence of Risk management tools and large data warehouse.

-Work closely with Risk management tool vendors to ensure timely detailed knowledge transfer on the topics of risk modeling, data management of risk models and governance/control of risk management environment.

-Develop thorough understanding of the data definitions, domain values, data relationships, business rules, sources and data integration for each subject area and/or input source file and develop functional and non-functional specifications
with the data architect, translate business requirements into design specifications.
-Assist in testing, creating test scenarios and troubleshooting data issues including source system gap analysis and remediation
-Work with the project manager, data architect and other project teams including test, QA and production to coordinate and conduct impact analysis.

  • Data analysis background / Data de-duplication / Denormalization of  data structures and flat file linking
  • Interface with BAs and end-users to understand structures, quality and processing rules
  • Strong knowledge of relational databases (preferably Oracle, AbInitio, Datastage-Infosphere)
  • SQL query writing skills
  • Solid knowledge of principles of data warehousing projects, data landscape surveying, data analysis, mapping and evaluation
  • Strong data analytical, data profiling and data quality troubleshooting skills
  • Experience in data integration, reference data and data cross reference techniques
  • Solid expertise interacting and working with data architects, solution architects
  • Self Starter / Team Player / good problem solver


ETL Toolkit– Error Event Table and Audit Dimension

March 28, 2011 2 comments

Chapter 4 “Cleaning and Conforming” from Kimball’s The Data Warehouse ETL Toolkit takes about 50 pages, not an extremely long chapter. Here is a warning from the beginning of the chapter.

Please stay with us in this chapter. It is enormously important. This chapter
makes a serious effort to provide specific techniques and structure for an
often amorphous topic. The chapter is long, and you should probably read
it twice, but we think it will reward you with useful guidance for building
the data cleaning and conforming steps of your ETL system.

I have to confess that I read it twice, but never finished reading the entire chapter each time. That doesn’t mean that I totally ignored the data quality issue in my ETL design. Although I didn’t finish reading the entire chapter, I focused on these three sections:


Without the effort and hard work during the data profiling phase (or I called it data discovery phase), cleansing and auditing are not going anywhere. Here is an example from the IT Asset Management system from my previous project.

Network auto-discovery software are installed on servers. Multiple such software can be installed on the same server, collecting duplicate data or conflicting data. Only after lots of hard work of data profiling, we were able to create a strategy to resolve duplicate or conflicting data.

Looking at the Error event table schema suggested from the ETL Toolkit, it doesn’t not take long to come with a Screen Type of “Duplicate Composite Key” as a Screen Type for my data exception table. I called it data exception, instead of error event table.

By the way, the suggested schema here is a generic schema. I’d think that you can design your table any way you want to fit into your specific project.


As for the #2 Cleaning Deliverable “Audit Dimension”, my understanding is that it is just a summary from your data exception screens, with a score for each exception category. I’ve designed an audit table to summarize all the important measures for data exception and also for normal row counts, but I’ve never created scores for them. I guess I really didn’t know how to score them.


%d bloggers like this: