Compartir a través de


Migrating UTF8 data from Oracle to SQL Server using SSMA for Oracle

Before we could begin our Data Migration, you’ll need to clearly understand what is UTF8 is the 8-bit encoding of Unicode?

It is a variable-width encoding and a strict superset of ASCII. This means that each and every character in the ASCII character set is available in UTF-8 with the same code point values. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding.

Oracle’s support for Unicode

Oracle started supporting Unicode as a database character set in version 7.Unicode characters can be stored in an Oracle database in two ways.

 

Unicode database

The end user can create a Unicode database that enables you to store UTF-8 encoded characters as SQL CHAR datatypes (CHAR, VARCHAR2, CLOB, and LONG). For this you have to specify the database character set as UTF-8 when creating the database.

SQL Server support for Unicode

SQL Server started supporting Unicode as database character set from version 7. It uses UCS-2 character encoding for storing Unicode data.

Now let’s move on to how to Migrate Oracle database with UTF8 data to SQL Server using SSMA.

Unicode database solution

 

If database is Unicode database, then Oracle is configured to store multi-byte strings in VARCHAR2 / CHAR columns or variables. In this case, you can use customized mappings to map the character types to Unicode types in SQL Server.

Oracle

SQL Server 2005/2008

VARCHAR2

nvarchar

CHAR

nchar

LONG, CLOB

nvarchar(max)

 

Otherwise, non-ASCII strings can be distorted during data migration

 

Unicode data type solution

For Unicode datatype solution, source strings declared as national (NVARCHAR2 and NCHAR) are automatically mapped to nvarchar and nchar. Large object types like nclob is automatically mapped to nvarchar(max). So there is no need to change the default mappings for the above datatypes.

Customizing Data Type Mappings

You can customize type mappings at the project level, object category level (such as all tables), or object level. Settings are inherited from the higher level unless they are overridden at a lower level. For example, if you map smallmoney to money at the project level, all objects in the project will use this mapping unless you customize the mapping at the object or category level.

 

The following procedure shows how to map data types at the project, database, or object level:

To map data types follow the below steps:

To customize data type mapping for the whole project, open the Project Settings dialog box:

 

a. On the Tools menu, select Project Settings.

b. In the left pane, select Type Mapping. The type mapping chart and buttons appear in the right pane.

 

Or, to customize data type mapping at the database, table, view, or stored procedure level, select the database, object category, or object in Oracle Metadata Explorer:

 

c. In Oracle Metadata Explorer, select the folder or object to customize.

d. In the right pane, click the Type Mapping tab.

 

Author : Ajay(MSFT), SQL Developer Engineer, Microsoft

Reviewed by : Smat (MSFT), SQL Escalation Services, Microsoft

Comments

  • Anonymous
    September 04, 2012
    I work at Microsoft CSS and I've had quite bit of an issue with the Data Migration from Oracle to SQL Azure. I was hammering the ends to finish my Migration Demo Projects. Many Many thanks for creating this subtle yet precise informative article for the Data Migration Folks out here on the field. Thank you Mr A!

  • Anonymous
    February 28, 2013
    SSMA has converted my (Unicode) VARCHAR2(N) to NVARCHAR(4xN). What I want is the same as Oracle, i.e. if I have a field that allows up to 10 Unicode characters (VARCHAR2(10)). Isn't that NVARCHAR(10)? Or in SQL Server can I only specify the length in bytes? Which would mean no real control. My NVARCHAR(40) could have 40 single byte characters or only 10 4-byte characters.

  • Anonymous
    March 01, 2013
    Duncan, This is by design. Conversion from Oracle's NVARCHAR2 -> SQL NVARCHAR doubles the length because, in SQL Server, NVARCHAR is a count of bytes rather than a count of characters. In the case of UTF-16, we have to account for surrogate pairs (a second 2 bytes of character data whose presence is indicated by the first two bytes.

    • Anonymous
      March 01, 2016
      I have oracle 11G and 12c and I have used Sun solaris 10 to host the DB. DB has used US7ASCII and WE8ISO8859P1 character set. I want to migrate this Oracle DB to MS SQL 2012. What type of character set/local will be used?
  • Anonymous
    March 03, 2013
    I don't believe NVARCHAR length is the count of bytes. Try this: CREATE TABLE tmp_unicode (col1 NVARCHAR(6)); INSERT INTO tmp_unicode (col1) VALUES (N'ердийн'); SELECT col1, LEN(col1), DATALENGTH(col1) FROM tmp_unicode; DROP TABLE tmp_unicode;

  • Anonymous
    March 05, 2013
    Prior to Oracle 11, nvarchar2 column lengths are measured in characters. SQL Server nvarchar columns are (and always have been) measured in bytes. We must account for the possibility of surrogate pairs (see en.wikipedia.org/.../Surrogate_pair) of up to two bytes length in some of the various UTF encodings. The only way we could prevent this is to actually scan the data at schema conversion time to determine there are no surrogate pairs present.

  • Anonymous
    March 10, 2013
    I don't understand. You said "SQL Server nvarchar columns are (and always have been) measured in bytes". My SQL demonstrates that is not the case - it inserts 12 bytes in a NVARCHAR(6) field. Can you explain that please.

  • Anonymous
    January 31, 2014
    Debarchan, Can you explain what happens in case of surrogate pairs?

  • Anonymous
    August 15, 2017
    The comment has been removed