Parsing Non-Standard Text File Formats with the Script Component
Applies to: SQL Server SSIS Integration Runtime in Azure Data Factory
When your source data is arranged in a non-standard format, you may find it more convenient to consolidate all your parsing logic in a single script than to chain together multiple Integration Services transformations to achieve the same result.
Example 1: Parsing Row-Delimited Records
Example 2: Splitting Parent and Child Records
Note
If you want to create a component that you can more easily reuse across multiple Data Flow tasks and multiple packages, consider using the code in this Script component sample as the starting point for a custom data flow component. For more information, see Developing a Custom Data Flow Component.
Example 1: Parsing Row-Delimited Records
This example shows how to take a text file in which each column of data appears on a separate line and parse it into a destination table by using the Script component.
For more information about how to configure the Script component for use as a transformation in the data flow, see Creating a Synchronous Transformation with the Script Component and Creating an Asynchronous Transformation with the Script Component.
To configure this Script Component example
Create and save a text file named rowdelimiteddata.txt that contains the following source data:
FirstName: Nancy LastName: Davolio Title: Sales Representative City: Seattle StateProvince: WA FirstName: Andrew LastName: Fuller Title: Vice President, Sales City: Tacoma StateProvince: WA FirstName: Steven LastName: Buchanan Title: Sales Manager City: London StateProvince:
Open Management Studio and connect to an instance of SQL Server.
Select a destination database, and open a new query window. In the query window, execute the following script to create the destination table:
create table RowDelimitedData ( FirstName varchar(32), LastName varchar(32), Title varchar(32), City varchar(32), StateProvince varchar(32) )
Open SQL Server Data Tools and create a new Integration Services package named ParseRowDelim.dtsx.
Add a Flat File connection manager to the package, name it RowDelimitedData, and configure it to connect to the rowdelimiteddata.txt file that you created in a previous step.
Add an OLE DB connection manager to the package and configure it to connect to the instance of SQL Server and the database in which you created the destination table.
Add a Data Flow task to the package and click the Data Flow tab of SSIS Designer.
Add a Flat File Source to the data flow and configure it to use the RowDelimitedData connection manager. On the Columns page of the Flat File Source Editor, select the single available external column.
Add a Script Component to the data flow and configure it as a transformation. Connect the output of the Flat File Source to the Script Component.
Double-click the Script component to display the Script Transformation Editor.
On the Input Columns page of the Script Transformation Editor, select the single available input column.
On the Inputs and Outputs page of the Script Transformation Editor, select Output 0 and set its SynchronousInputID to None. Create 5 output columns, all of type string [DT_STR] with a length of 32:
FirstName
LastName
Title
City
StateProvince
On the Script page of the Script Transformation Editor, click Edit Script and enter the code shown in the ScriptMain class of the example. Close the script development environment and the Script Transformation Editor.
Add a SQL Server Destination to the data flow. Configure it to use the OLE DB connection manager and the RowDelimitedData table. Connect the output of the Script Component to this destination.
Run the package. After the package has finished, examine the records in the SQL Server destination table.
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Dim columnName As String
Dim columnValue As String
' Check for an empty row.
If Row.Column0.Trim.Length > 0 Then
columnName = Row.Column0.Substring(0, Row.Column0.IndexOf(":"))
' Check for an empty value after the colon.
If Row.Column0.Substring(Row.Column0.IndexOf(":")).TrimEnd.Length > 1 Then
' Extract the column value from after the colon and space.
columnValue = Row.Column0.Substring(Row.Column0.IndexOf(":") + 2)
Select Case columnName
Case "FirstName"
' The FirstName value indicates a new record.
Me.Output0Buffer.AddRow()
Me.Output0Buffer.FirstName = columnValue
Case "LastName"
Me.Output0Buffer.LastName = columnValue
Case "Title"
Me.Output0Buffer.Title = columnValue
Case "City"
Me.Output0Buffer.City = columnValue
Case "StateProvince"
Me.Output0Buffer.StateProvince = columnValue
End Select
End If
End If
End Sub
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
string columnName;
string columnValue;
// Check for an empty row.
if (Row.Column0.Trim().Length > 0)
{
columnName = Row.Column0.Substring(0, Row.Column0.IndexOf(":"));
// Check for an empty value after the colon.
if (Row.Column0.Substring(Row.Column0.IndexOf(":")).TrimEnd().Length > 1)
// Extract the column value from after the colon and space.
{
columnValue = Row.Column0.Substring(Row.Column0.IndexOf(":") + 2);
switch (columnName)
{
case "FirstName":
// The FirstName value indicates a new record.
this.Output0Buffer.AddRow();
this.Output0Buffer.FirstName = columnValue;
break;
case "LastName":
this.Output0Buffer.LastName = columnValue;
break;
case "Title":
this.Output0Buffer.Title = columnValue;
break;
case "City":
this.Output0Buffer.City = columnValue;
break;
case "StateProvince":
this.Output0Buffer.StateProvince = columnValue;
break;
}
}
}
}
Example 2: Splitting Parent and Child Records
This example shows how to take a text file, in which a separator row precedes a parent record row that is followed by an indefinite number of child record rows, and parse it into properly normalized parent and child destination tables by using the Script component. This simple example could easily be adapted for source files that use more than one row or column for each parent and child record, as long as there is some way to identify the beginning and end of each record.
Caution
This sample is intended for demonstration purposes only. If you run the sample more than once, it inserts duplicate key values into the destination table.
For more information about how to configure the Script component for use as a transformation in the data flow, see Creating a Synchronous Transformation with the Script Component and Creating an Asynchronous Transformation with the Script Component.
To configure this Script Component example
Create and save a text file named parentchilddata.txt that contains the following source data:
********** PARENT 1 DATA child 1 data child 2 data child 3 data child 4 data ********** PARENT 2 DATA child 5 data child 6 data child 7 data child 8 data **********
Open SQL Server Management Studio and connect to an instance of SQL Server.
Select a destination database, and open a new query window. In the query window, execute the following script to create the destination tables:
CREATE TABLE [dbo].[Parents]([ParentID] [int] NOT NULL, [ParentRecord] [varchar](32) NOT NULL, CONSTRAINT [PK_Parents] PRIMARY KEY CLUSTERED ([ParentID] ASC)) GO CREATE TABLE [dbo].[Children]([ChildID] [int] NOT NULL, [ParentID] [int] NOT NULL, [ChildRecord] [varchar](32) NOT NULL, CONSTRAINT [PK_Children] PRIMARY KEY CLUSTERED ([ChildID] ASC)) GO ALTER TABLE [dbo].[Children] ADD CONSTRAINT [FK_Children_Parents] FOREIGN KEY([ParentID]) REFERENCES [dbo].[Parents] ([ParentID])
Open SQL Server Data Tools (SSDT) and create a new Integration Services package named SplitParentChild.dtsx.
Add a Flat File connection manager to the package, name it ParentChildData, and configure it to connect to the parentchilddata.txt file that you created in a previous step.
Add an OLE DB connection manager to the package and configure it to connect to the instance of SQL Server and the database in which you created the destination tables.
Add a Data Flow task to the package and click the Data Flow tab of SSIS Designer.
Add a Flat File Source to the data flow and configure it to use the ParentChildData connection manager. On the Columns page of the Flat File Source Editor, select the single available external column.
Add a Script Component to the data flow and configure it as a transformation. Connect the output of the Flat File Source to the Script Component.
Double-click the Script component to display the Script Transformation Editor.
On the Input Columns page of the Script Transformation Editor, select the single available input column.
On the Inputs and Outputs page of the Script Transformation Editor, select Output 0, rename it to ParentRecords, and set its SynchronousInputID to None. Create 2 output columns:
ParentID (the primary key), of type four-byte signed integer [DT_I4]
ParentRecord, of type string [DT_STR] with a length of 32.
Create a second output and name it ChildRecords. The SynchronousInputID of the new output is already set to None. Create 3 output columns:
ChildID (the primary key), of type four-byte signed integer [DT_I4]
ParentID (the foreign key), also of type four-byte signed integer [DT_I4]
ChildRecord, of type string [DT_STR] with a length of 50
On the Script page of the Script Transformation Editor, click Edit Script. In the ScriptMain class, enter the code shown in the example. Close the script development environment and the Script Transformation Editor.
Add a SQL Server Destination to the data flow. Connect the ParentRecords output of the Script Component to this destination.Configure it to use the OLE DB connection manager and the Parents table.
Add another SQL Server Destination to the data flow. Connect the ChildRecords output of the Script Component to this destination. Configure it to use the OLE DB connection manager and the Children table.
Run the package. After the package has finished, examine the parent and child records in the two SQL Server destination tables.
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Static nextRowIsParent As Boolean = False
Static parentCounter As Integer = 0
Static childCounter As Integer = 0
' If current row starts with separator characters,
' then following row contains new parent record.
If Row.Column0.StartsWith("***") Then
nextRowIsParent = True
Else
If nextRowIsParent Then
' Current row contains parent record.
parentCounter += 1
Me.ParentRecordsBuffer.AddRow()
Me.ParentRecordsBuffer.ParentID = parentCounter
Me.ParentRecordsBuffer.ParentRecord = Row.Column0
nextRowIsParent = False
Else
' Current row contains child record.
childCounter += 1
Me.ChildRecordsBuffer.AddRow()
Me.ChildRecordsBuffer.ChildID = childCounter
Me.ChildRecordsBuffer.ParentID = parentCounter
Me.ChildRecordsBuffer.ChildRecord = Row.Column0
End If
End If
End Sub
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
int static_Input0_ProcessInputRow_childCounter = 0;
int static_Input0_ProcessInputRow_parentCounter = 0;
bool static_Input0_ProcessInputRow_nextRowIsParent = false;
// If current row starts with separator characters,
// then following row contains new parent record.
if (Row.Column0.StartsWith("***"))
{
static_Input0_ProcessInputRow_nextRowIsParent = true;
}
else
{
if (static_Input0_ProcessInputRow_nextRowIsParent)
{
// Current row contains parent record.
static_Input0_ProcessInputRow_parentCounter += 1;
this.ParentRecordsBuffer.AddRow();
this.ParentRecordsBuffer.ParentID = static_Input0_ProcessInputRow_parentCounter;
this.ParentRecordsBuffer.ParentRecord = Row.Column0;
static_Input0_ProcessInputRow_nextRowIsParent = false;
}
else
{
// Current row contains child record.
static_Input0_ProcessInputRow_childCounter += 1;
this.ChildRecordsBuffer.AddRow();
this.ChildRecordsBuffer.ChildID = static_Input0_ProcessInputRow_childCounter;
this.ChildRecordsBuffer.ParentID = static_Input0_ProcessInputRow_parentCounter;
this.ChildRecordsBuffer.ChildRecord = Row.Column0;
}
}
}
See Also
Creating a Synchronous Transformation with the Script Component
Creating an Asynchronous Transformation with the Script Component