18 May 2014

Data Flow -- SSIS

Data Flow Task that literally means a task where DATA flows. That’s exactly what ETL does, Extract Transform & Load of data. Most of the ETL is done in the Data Flow Task.

A Data Flow Task has to be a part of a ‘Control Flow’. One control flow can have a number of DFTs in it, as the Control Flow is the place where we can decide the ways the multiple DFTs will be placed in our solution. 

Each DFT has to have a Source, may or may not have a Transform, may or may not have a Destination. However a DFT needs to have at least a transform or Destination followed by the Source. Source cannot just get data from somewhere and do nothing with it. If that’s the case, DFT will fail.

Ideally, DFT has a Source, Transform (optional) and a Destination. Source is used to extract the data, Transform is used to transform the data while Destination is used to Load the data to a destination. That's how ETL is implemented in DFT.


The Data Flow tab of the Package Designer is where you specify the details of any Data Flow tasks that you've added on the Control Flow tab. Data Flows are made up of various objects that you drag and drop from the Toolbox:
  • Data Flow Sources are ways that data gets into the system. Table 16-5 lists the available data flow sources.
  • Data Flow Transformations let you alter and manipulate the data in various ways. Table 16-6 lists the available data flow transformations.
  • Data Flow Destinations are the places that you can send the transformed data. Table 16-7 lists the available data flow destinations.
The way the Control Flow contains tasks, DFT contains Source,Transforms and destinations and they are not called Tasks as many people call them. Let list down a few examples from each:

Data Source:
Source
Use
ADO NET
Extracts data from a database using a .NET data provider
Excel
Extracts data from an Excel workbook
Flat File
Extracts data from a flat file
OLE DB
Extracts data from a database using an OLE DB provider
Raw File
Extracts data from a raw file (proprietary Microsoft format)
XML
Extracts data from an XML file
Transformations:


Transformation
Effect
Aggregate
Aggregates and groups values in a dataset
Audit
Adds audit information to a dataset
Cache Transform
Populates a CACHE connection manager
Character Map
Applies string operations to character data
Conditional Split
Evaluates and splits up rows in a dataset
Copy Column
Copies a column of data
Data Conversion
Converts data to a different datatype
Data Mining Query
Runs a data mining query
Derived Column
Calculates a new column from existing data
Export Column
Exports data from a column to a file
Fuzzy Grouping
Groups rows that contain similar values
Fuzzy Lookup
Looks up values using fuzzy matching
Import Column
Imports data from a file to a column
Lookup
Looks up values in a reference dataset
Merge
Merges two sorted datasets
Merge Join
Merges data from two datasets by using a join
Multicast
Creates copies of a dataset
OLE DB Command
Executes a SQL command on each row in a dataset
Percentage Sampling
Extracts a subset of rows from a dataset
Pivot
Builds a pivot table from a dataset
Row Count
Counts the rows of a dataset
Row Sampling
Extracts a sample of rows from a dataset
Script Component
Executes a custom script
Slowly Changing Dimension
Updates a slowly changing dimension table
Sort
Sorts data
Term Extraction
Extracts data from a column
Term Lookup
Looks up the frequency of a term in a column
Union All
Merges multiple datasets
Unpivot
Normalizes a pivot table

Destinations:

Destination
Use
ADO NET
Sends data to a .NET data provider
Data Mining Model Training
Sends data to an Analysis Services data mining model
DataReader
Sends data to an in-memory ADO.NET DataReader
Dimension Processing
Processes a cube dimension
Excel
Sends data to an Excel worksheet
Flat File
Sends data to a flat file
OLE DB
Sends data to an OLE DB database
Partition Processing
Processes an Analysis Services partition
Raw File
Sends data to a raw file
Recordset
Sends data to an in-memory ADO Recordset
SQL Server Compact
Sends data to a SQL Server CE database
SQL Server
Sends data to a SQL Server database
In the figure below, you can have a look at how a DFT looks once we have various Sources, Transforms and Destinations attached. You would notice the Red circle on each transform, it is because they are not yet configured and SSIS is showing that this will error out if executed now.

   


Just remember the way we connect the tasks in Control Flow with the use of precedence constraints. Similarly, we use arrows in DFT but they are not termed as precedence constraints. Here they are available in just two colours: Green for valid records and Red in case the data is incorrect as per our setup.

Unlike Control Flow in Data Flow, the various transforms cannot have any number of inputs or outputs from them. They are predefined barring the Multicast Transform which can have any n number of outputs configured. OLEDB Source takes no input and gives one success (green) and one error output (red), we have similar rule for each source, transform and destination.

Steps for adding Data Flow 

No comments:

Post a Comment