Fuzzy Lookup and Fuzzy Grouping in SQL ServerĀ Integration Services 2005
Surajit Chaudhuri, Kris Ganjam, Venky Ganti, Vivek Narasayya, and Theodore Vassilakis
Microsoft Corporation
September 2005
Applies to
Microsoft SQL Server 2005
Summary: This paper describes the Fuzzy Lookup and Fuzzy Grouping transformations that are part of SQL Server 2005 Integration Services (SSIS). These SSIS transformations are useful for improving the data quality of existing data as well as new data that is being loaded into your database. Fuzzy Lookup matches input records that are "dirty" (because of misspellings, truncations, missing or inserted tokens, null fields, unexpected abbreviations, and other irregularities) with clean records in a reference table. Fuzzy Grouping detects similarities between input rows and determines which rows are fuzzy duplicates by using their string values. Fuzzy Lookup and Fuzzy Grouping are useful primitives that can simplify a variety of data cleaning and preparation tasks that are frequently encountered in data warehousing. By customizing them for your domain, you can leverage general search and clustering algorithms inside the SSIS Designer while avoiding complex custom code. (13 printed pages)
Contents
Introduction
Getting Started with Fuzzy Lookup
Getting the Most Out of Fuzzy Lookup
Getting Started with Fuzzy Grouping
Getting the Most Out of Fuzzy Grouping
Understanding Performance
Conclusion
For More Information
Note This paper refers to the latest beta release of SQL Server 2005. The content is subject to change prior to the final release.
Introduction
Real-world data is "dirty" because of misspellings, truncations, missing or inserted tokens, null fields, unexpected abbreviations, and other irregularities. As a consequence, a large percentage of time and money that is spent on data warehousing projects goes toward the extract, transform, and load (ETL) phase. During ETL, new data is cleaned, standardized, and made consistent with existing data. The Fuzzy Lookup and Fuzzy Grouping transformations, available in Microsoft SQL Server 2005, help make the ETL process more resilient to several common errors observed in real data. They address generic matching ** and grouping problems without requiring an expert collection of domain-specific rules and scripts. By customizing Fuzzy Lookup and Fuzzy Grouping for your domain, you can leverage generic data cleaning algorithms inside the Business Intelligence Development Studio and avoid building complex custom rules and code.
Fuzzy Lookup enables you to match input records with clean, standardized records in a reference table. The matching process is resilient to errors that are present in the input records. Fuzzy Lookup returns the closest match and indicates the quality of the match. For example, customer information (name and address) that is input during a new sales transaction may not match exactly with any record in the Customers reference table, which consists of all current customers, because of typographical or other errors in the input data. Fuzzy Lookup returns the best matching record from the Customers reference table even if no exact match exists, and provides measures to indicate the match quality.
Fuzzy Grouping enables you to identify groups of records in a table where each record in the group potentially corresponds to the same real-world entity. The grouping is resilient to commonly observed errors in real data, because records in each group may not be identical to each other but are very similar to each other. For example, Fuzzy Grouping is useful for grouping together all records in a Customers reference table that describe a single real customer.
Fuzzy Lookup and Fuzzy Grouping provide easy-to-use solutions to complex, commonly encountered data cleaning problems. While they bear some relation to existing approaches, such as soundex, rule-based systems, edit-distance-based systems, and full-text search, Fuzzy Lookup and Fuzzy Grouping have several advantages:
- Fuzzy Lookup and Fuzzy Grouping use a custom, domain-independent distance function that takes into account the edit distance (for example, "hits" is distance 2 from "bit"), the number of tokens, token order, and relative frequencies. As a result, Fuzzy Lookup and Fuzzy Grouping achieve much finer discrimination than full-text searches because they capture a more detailed structure of the data.
- Because they are purely token-driven, Fuzzy Lookup and Fuzzy Grouping do not have a language-dependent component like soundex.
- Because they use more than just edit distance, Fuzzy Lookup and Fuzzy Grouping are not as easily misled by transpositions and can detect higher level patterns than an approach that uses only edit distance.
- Fuzzy Lookup and Fuzzy Grouping are tightly integrated with SSIS, which makes them easily usable with little or no custom programming for ETL tasks with SQL Server 2005.
The following sections provide a step-by-step guide to using and understanding Fuzzy Lookup and Fuzzy Grouping and cover some implementation and performance aspects of these transformations that are useful for users. This paper is intended to supplement Books Online by explaining certain aspects of Fuzzy Lookup and Fuzzy Grouping in more detail. For more information about options and configuration parameters, see the Books Online entries.
Getting Started with Fuzzy Lookup
Fuzzy Lookup can help find data in large tables when your data has a corrupt or incomplete string key. For example, if you want to look up customer information by name and address, you can use Fuzzy Lookup to find the information, even if your input does not match exactly what is stored in your reference table. The simplest package for use with Fuzzy Lookup consists of a single SSIS data flow task that contains a source, a Fuzzy Lookup transform, and a destination (Figure 1).
Figure 1. The simplest Fuzzy Lookup package
To construct the simplest Fuzzy Lookup package:
- Open Business Intelligence Development Studio.
- Create a new Integration Services Project, add a new package, click the Data Flow tab, and then accept the add a data flow item option.
- From the Control Flow Items section in the Toolbox, drag a Data Flow Task onto the control Flow surface. Double-click the new Data Flow Task or select the Data Flow tab.
- On the Data Flow surface, drag the OLE DB Source adapter from the Data Flow Sources section of the Toolbox. Drag a Fuzzy Lookup Transformation from the Data Flow Transformations section of the Toolbox and an OLE DB Destination adapter from the Data Flow Destinations section. Select the Source and drag the green arrow to the Fuzzy Lookup to create a path between the two. Also create a path between Fuzzy Lookup and the Destination by selecting Fuzzy Lookup and dragging the green arrow to the Destination.
- Double-click the OLE DB Source transform and configure it to point at your new data by selecting a connection and the input table that contains reference data that incoming records will be matched to.
- Double-click Fuzzy Lookup to open the custom user interface (UI). From the Reference table name drop-down menu, select the connection and table to which you want the transform, to your already warehoused reference data, to point.
- On the Columns tab, drag items you want to compare from Available Input Columns (from the OLE DB source) to Available Lookup Columns (from the reference table). For example, you might want to compare StreetAddress in the input with Address in the reference table.
- Select the check boxes for all items in Available Lookup Columns, and then click OK.
- Point the OLE DB Destination to a connection for which you can write a new table, and then click New. Accept the default creation statement, and you are now ready to run Fuzzy Lookup.
- To run the package you just created, right-click its name in the Solution Explorer window, and then select Execute.
SSIS Designer runs the package and provides detailed visual feedback about the pipeline. Depending on the size of the reference data, you may notice a delay while the error-tolerant index (ETI) is built. The ETI is the main data structure that Fuzzy Lookup uses at run time.
After the ETI is built, all the input rows are processed and the results are written to the destination. SSIS Designer gives you feedback about the progress of the pipeline by displaying the number of rows processed by each component. You can also place a Data Viewer on the pipeline by right-clicking the path between Fuzzy Lookup and the OLE DB destination. This allows you to visualize in real-time which rows Fuzzy Lookup has matched to your input rows. In addition to matching tuples, Fuzzy Lookup outputs a record-level similarity, individual column-level similarities, and a confidence score for each match result. For more information about these scores, see Interpreting the Results later in this paper.
Getting the Most Out of Fuzzy Lookup
The main stages of running Fuzzy Lookup are building the ETI, performing the lookup, and examining the output. The following sections provide further detail about each of these stages.
Understanding the Error-Tolerant Index
Fuzzy Lookup uses the Error-Tolerant Index (ETI) to find matching rows in the reference table. Each record in the reference table is broken up into words (also known as tokens), and the ETI keeps track of all the places in the reference table where a particular token occurs. In the address example, if your reference data contains 13831 N.E. 8th St, the ETI will contain entries for 13831, N, E, 8th, and St.. In addition, Fuzzy Lookup indexes substrings, known as q-grams, so that it can better match records that contain errors. The more unique tokens and the more rows in the reference table, the more entries and longer the occurrence lists in the ETI. The ETI will be roughly as big as your reference table. For more information about how the ETI grows in size with the reference data, see Understanding Performance later in this paper. The tokenization process is controlled by the Fuzzy Lookup custom property Delimiters. For example, if you want to index N.E. instead of N and E, remove the period from the list of delimiters. The consequence is that N.E. appears as a single token in the ETI and will be looked up as a unit at run time. Because delimiters are applied globally, First.Avenue would also appears as a single token.
Because ETI construction for a very large reference table may take a non-trivial amount of time, Fuzzy Lookup offers the option of storing an ETI on the server and re-using it at a later date. This option takes a snapshot of the reference table and allows you to avoid re-building an ETI every time you run Fuzzy Lookup. If your ETI takes too long to re-build for every run, consider creating it once and re-using it in subsequent runs. To do this, select Store new index on the Reference Table tab, and then specify a table name.
If you would like to store your ETI, but your reference data changes from time to time, you can also enable Maintain stored index. This feature installs a trigger on the reference table that detects modifications and propagates them to the ETI, keeping it up-to-date. If you do not install table maintenance, Fuzzy Lookup will match against a snapshot of the reference table as it existed when the ETI was created.
What Happens at Run Time
At run time, takes an input row and tries to find the best match or matches in the reference table as efficiently as possible. By default, this is done by using the ETI to find candidate reference records that share tokens or q-grams in common with the input. The best candidates are retrieved from the reference table and a more careful comparison is made between the two records. Once there are no more candidates that could be better than any match found so far, Fuzzy Lookup stops and moves on to the next input row.
For Fuzzy Lookup to find a match in the reference table using the ETI, the input and target record must share at least one token or q-gram in common which is stored in the ETI. For reference table records consisting of only a single short word, it is possible that Fuzzy Lookup may be unable to match a dirty input record that contains a spelling mistake because there is no common token or q-gram stored in the ETI. Be aware, also, that Fuzzy Lookup indexes only a subset of the all the possible q-grams in a given record for efficiency reasons. Fuzzy Lookup may fail to find a match due to this sampling process, although matches will be found with a high degree of probability if the records contain many q-grams. For datasets which have attributes whose values are predominantly a single short token, one alternative, if Fuzzy Lookup is having trouble finding matches which you think it should find, is to set the Exhaustive component property to True. This will cause Fuzzy Lookup to ignore the ETI and instead directly compare the input record to each and every record in the reference table. This approach is prohibitively time-consuming for large reference tables, so only attempt exhaustive search on small reference tables.
Each match returned by Fuzzy Lookup has multiple scores associated with it which quantitatively describe how good a match the returned reference table record is to the input record. Each score is a number between 0.0 and 1.0. Perhaps the most important score, is the record-level similarity score. This score measures the overall similarity between the records across all fuzzy match columns that were defined. A score of 1.0 means that the records matched exactly on each of the match columns, while scores less than 1.0 indicate progressively more dissimilar matches. A record-level similarity score of 0.0 indicates that Fuzzy Lookup was unable to find a match in the reference table. As mentioned above, this could be because there were no common tokens or q-grams between the input and target reference table record that were indexed in the ETI. In addition to the record-level similarity, Fuzzy Lookup returns column-level similarity scores which measure how similar the values in a particular column for the input and match result. As described below, these column-level scores can be used to fine tune the match quality and for further downstream processing of match results.
Fuzzy Lookup additionally returns an estimate of the confidence for each match returned. This can be used to decide whether or not to automatically accept or reject a match. For instance, the reference table might contain values "E. Virginia" and "W. Virginia". If the dirty input was simply "Virginia", both reference records would have very high Similarity, but the difference between their similarity values would be very small. This indicates that Fuzzy Lookup could not find a clear winner and that you may need to manually review the match results for this input record. Such decisions can be automated by adding a Conditional Split Transformation after the Fuzzy Lookup which makes an accept/reject decision based upon the values of similarity and confidence.
In determining best matches, the most important parameter is the MinSimilarity threshold. You can set this custom property by using the Fuzzy Lookup UI. A reference tuple will be returned only if it has a Similarity that is greater than or equal to the MinSimilarity threshold. By setting a high similarity requirement, Fuzzy Lookup will consider fewer candidates and, as a result, may not return any matches. If you set MinSimilarity low, Fuzzy Lookup will consider more candidates and may be more likely to find a match, but the search could take longer.
Note that you can set MinSimilarity and both the record-level and also at the column-level for each individual column. Any match result return must meet the thresholds set at ALL levels and for ALL columns. For instance, you might set a record-level MinSimilarity of 0.5, but require that ZipCode has MinSimilarity 0.9 and Name has MinSimilarity 0.4. Fuzzy Lookup will only return results that meet all of those criteria.
The factors that determine similarity scores include:
- The number of token or character insertions, deletions, substitutions, and re-orderings that need to be made to the input tuple to match a given reference tuple. For example, the input 122 First Lane will likely be considered closer to the reference 122 First Ln than the input 22 N.E. 1st Ln & Leary Way.
- The token frequencies from the reference table. Highly frequent tokens are generally considered to give little information about the goodness of the match. Relatively rare tokens are considered characteristic of the row in which they occur.
Setting the right threshold depends on the nature of your application and of your data. If you require a close match between your inputs and your reference, you should consider setting a high value for MinSimilarity, such as 0.90. If you are doing an exploratory project, you may be interested in examining weak matches as well as close matches, so you should set MinSimilarity to a lower value, such as 0.1. There is no firm rule that you can use to determine this range, so it is recommended that you experiment with your data set. Looking at the output from several runs can suggest optimal values to set. For example, you perform a first run by using a threshold of 0.1. You observe that a certain input is matched with a certain output with similarity 0.2. If the tuples are too dissimilar for your application (see Interpreting the Results for details), you can set the MinSimilarity to 0.3 for your next run and exclude the match as too dissimilar. Repeating this process for a few iterations on a small test set can help you determine what is appropriate for your application.
If you want to view more than the single best match for each input, set the MaxOutputMatchesPerInput property to a value larger than one. Fuzzy Lookup will then return up to that many matches for each input row. Note that increasing the value of this property may increase the time it takes to process each input row. For more information about the impact on performance, see Understanding Performance later in this paper.
Interpreting the Results
Interpreting your results depends on your application goals. If you want to show that some similar matches exist, you should set a low similarity threshold requirement and filter for high-confidence matches. When the reference table has close matches for an input tuple, the similarity is high. If there is a single record among all reference tuples that closely matches the input tuple, the confidence score is also high. Therefore, you can use similarity and confidence values to decide how you want to further process the results of a Fuzzy Lookup.
You may also want to understand why certain rows were not retrieved. The main reason is the ETI and the Fuzzy Lookup retrieval strategy. When Fuzzy Lookup indexes a token like committee, it also indexes sub-token elements comm., ommi, mmit, mitt, itte, ttee. This scheme helps to speed up retrieval and to recover from input errors. For example, if committee occurs in the reference data but comittee (with a single m) is in the input, Fuzzy Lookup may be able to find the correct reference row by looking up the sub-token mitt, even though the full input token is not indexed. This is one reason that the Fuzzy Lookup index is termed "error-tolerant."
However, the retrieval process is not perfect. For example, if the reference token Pattel is misspelled in the input as Patel (with a single t), none of the input token fragments (Pate and atel) will match the indexed reference fragments (Patt, atte, and ttel). As a result, Fuzzy Lookup must rely on the other tokens present in the tuple to perform the correct retrieval. If no other tokens are present in the row, Fuzzy Lookup will be unable to recover the correct reference row. Nonetheless, for rows that contain more than a single token, Fuzzy Lookup is generally able to retrieve several candidate rows based on tokens and fragments in the input.
Getting Started with Fuzzy Grouping
You can use Fuzzy Grouping to detect "fuzzy" or approximate duplicates in a collection of rows that have string attributes. For example, you could use Fuzzy Grouping to consolidate customer tables from different departments. The simplest package for use with Fuzzy Grouping consists of a single SSIS data flow task that contains a source, a Fuzzy Grouping transform, and a destination (Figure 2).
Figure 2. The simplest Fuzzy Grouping package
To construct the simplest Fuzzy Grouping package:
Open the SSIS Designer.
Drag the OLE DB source and destination transforms onto a data flow and connect them by using an instance of Fuzzy Grouping.
Point the OLE DB source at the table that contains the data with potential duplicates by selecting a connection and table name.
Note This table must contain some string columns for Fuzzy Grouping to analyze.
Double-click Fuzzy Grouping to open the custom UI, select the check boxes for all items in Available Input Columns (from the OLE DB source), and then click OK.
Point the OLE DB destination to a connection for which you can write a new table, and then click New. Accept the default creation statement and you are now ready to run Fuzzy Grouping.
To run the package that you just created, right-click its name in the Solution Explorer window, and then select Execute.
Depending on the size of your input data, you may experience a delay while Fuzzy Grouping buffers the data. Before rows flow further down the pipeline, Fuzzy Grouping invokes Fuzzy Lookup to build an ETI for the input data and stores the ETI on the temporary connection. As a result, sizeable objects may be placed on that connection. After the ETI on the input is built, all the input rows are processed and results are written to the destination. For more information about how the grouping is performed, see Getting the Most Out of Fuzzy Grouping later in this paper.
By default, Fuzzy Grouping outputs some additional columns called _key_out and _key_in. As rows flow through the pipeline, Fuzzy Grouping assigns an ID to each of them, which is the _key_in. When Fuzzy Grouping groups a certain collection of rows, it determines which row should be the representative. It then assigns all the _key_out columns of the rows in the group to the _key_in value of the representative. As a result, if you want to write only the representative rows to your output, you can filter the Fuzzy Grouping output through a conditional split transform that selects only rows where _key_in equals _key_out.
Getting the Most Out of Fuzzy Grouping
Fuzzy Grouping has fewer external parameters to tune than Fuzzy Lookup; nonetheless, understanding some of its internals can help you achieve optimal performance. Fuzzy Grouping uses Fuzzy Lookup under the covers to perform the grouping. For instance, Fuzzy Grouping passes its tokenization string intact to Fuzzy Lookup. At run-time, Fuzzy Grouping uses to Fuzzy Lookup to build a temporary ETI against the input data and uses it to determine which input rows are close to each other. Depending on the number of results it gets back and the resulting similarities between records, it generates groups.
Just as with Fuzzy Lookup, Fuzzy Grouping requires you to set the MinSimilarity threshold. It is important to remember that tuples will be grouped only if their similarity to the group representative is above the threshold set in the UI. Thus, if you run Fuzzy Grouping with a low threshold value and find that tuples that are too dissimilar are being grouped, increase the similarity. Fuzzy Grouping reports the similarity of each tuple to the representative tuple for the group. It is not possible to influence the choice of representative tuple at this time, except by altering the MinSimilarity threshold. If you want to split a group, set the required similarity higher than that reported number. If too few tuples are being grouped, you can perform a binary search on similarity values to determine how low to set the similarity threshold for certain tuples to be grouped. For example, if 0.9 results in too few groups and 0.7 in too many, try 0.8.
Note There is no way to force Fuzzy Grouping to group two tuples. Even setting the similarity to zero will not necessarily return a single group with all the tuples. Because group quality depends heavily on the particular semantics of your data, experimentation is the only way to find the right settings for your application.
Alternatively, you could attempt to use Fuzzy Lookup to build an ETI on your input data and perform a lookup for the row in question by using a high value for the number of matches to return and a high threshold for required similarity. However, Fuzzy Grouping incorporates the additional intelligence of appropriately grouping tuples into sets of duplicates along with a suggestion for a canonical representative.
Understanding Performance
Despite the simple interfaces, Fuzzy Lookup and Fuzzy Grouping are complex processes, and understanding their performance requires some analysis. The following sections explain common performance issues and provide sample measurements. As you will see, the primary determinant of Fuzzy Lookup and Fuzzy Grouping performance is data size. For Fuzzy Lookup and Fuzzy Grouping, this means the number of input rows, tokens, and bytes. For Fuzzy Lookup, there are additional considerations regarding the size of the reference data. The secondary determinant is the computing resources available, which includes memory, space on the database server, network bandwidth, and the way these resources are set up.
Setup Considerations
To obtain the best performance from Fuzzy Lookup and Fuzzy Grouping, it is useful to follow some guidelines regarding memory use and architecture.
To minimize memory requirements on your computer:
- Use the more lightweight DTExec.exe rather than the full SSIS Designer to execute packages in production.
- Drop unused columns in your pipeline because they require memory.
If you elect to run the SSIS client and the server on the same computer, you will avoid potential network problems but may encounter memory contention. You can use sp_configure and set the set max server memory setting to 256 to mitigate memory contention between the two processes. You can also set the MaxMemoryUsage Fuzzy Lookup custom property using the Advanced Editor in the SSIS designer.
All experimental trend-lines reported in the following sections were obtained using the following setup:
- A server that has an Intel Pentium III 1-gigahertz (GHz) processor, 512 megabytes (MB) of RAM, and Microsoft Windows Server 2003 Enterprise Edition.
- SQL Server running on the same machine as the SSIS client.
For recurring Fuzzy Lookup tasks in which the reference table is considerably larger than the typical input table, you should consider pre-computing the index. In those cases, re-building the index could dominate the running time of the actual lookup and make it worthwhile to manage the additional table.
By default, Fuzzy Lookup will load the ETI and reference table into available memory before starting to process rows. If you only have a few rows to process in a particular run, you can reduce this time by setting the WarmCaches property to False.
For large Fuzzy Lookup input tasks, you may want to consider using multiple SSIS clients together with a central server. In this setup, the central server hosts the reference table and the pre-computed ETI. Each SSIS client performs Fuzzy Lookup on a piece of the large input table tasks by using the central reference table and ETI. After the Fuzzy Lookups are complete, you re-merge all the output tables produced by the individual SSIS clients. You can also do this in a single package by using the conditional split transform to partition your input into multiple parts and then send each part into a separate Fuzzy Lookup transform. Since SSIS uses multiple threads for such pipelines, the matching process will run in parallel if your system has multiple processors. Note that, as of this writing, the following workaround is necessary to achieve full performance in this scenario. After each conditional split output, you must send the data into a Union All transform before feeding into each Fuzzy Lookup. This forces the pipeline to treat each output asynchronously and thereby use a separate thread for each conditional split output.
Note This architecture does not apply to Fuzzy Grouping.
Measurements
The greatest impact on Fuzzy Lookup and Fuzzy Grouping performance comes from the size of the data used. In the case of Fuzzy Lookup, this means the size of the reference and input tables. In the case of Fuzzy Grouping, this means the size of the input table. The size of the input matters in two ways:
- The number of rows and columns has the greatest impact on performance. The more data you have, the more resources Fuzzy Lookup and Fuzzy Grouping require. The figures in the following sections show specific data for various scenarios.
- The average number of tokens per string column on which a fuzzy match is performed also has an impact on performance. Fuzzy transforms are not meant for document retrieval. For longer fields (greater than 20 tokens), it might be more efficient to use the SQL Server full-text indexing features.
The following sections present some detailed graphs for Fuzzy Lookup and Fuzzy Grouping performance. These graphs are intended to suggest trends for the transforms and not to give absolute numbers. Some graphs express the proportion of increase or decrease over a baseline measurement. In those cases, the graph can be used to infer roughly what input size increase would lead to a doubling in running time.
Fuzzy Lookup Performance
The two main variables that determine Fuzzy Lookup performance are the size of the reference data and the size of the input data. These variables correspond to the two phases of Fuzzy Lookup: building the ETI and actually performing the lookup. In general, these tasks are linear in the size of their respective inputs. Depending on the size of your input, the actual lookup time may be smaller or larger than the ETI building time.
Figure 3 illustrates the time required to build the ETI versus the size of the reference table. As explained earlier, building the ETI can be performed once and the results saved, so as to avoid paying the cost every time. The size of the resulting ETI is at most twice the size of the indexed reference columns. You can track progress by observing the progress messages reported in the Execution Results tab of the SQL Server Business Intelligence Development Studio.
Figure 3. ETI creation time versus reference table size
The size of the ETI also affects the run time. The more data the ETI contains, the more expensive it is to use it because it represents a larger search space. Figure 4 illustrates the run time performance for Fuzzy Lookup versus the size of the reference table for a fixed input size. The curve in Figure 4 also depends on the distribution of tokens to various rows which can play a large role in lookup times. Having more memory can mitigate the effects of a large reference table and ETI. At run time, Fuzzy Lookup caches portions of the ETI in memory up to the constraints specified in the MaxMemoryUsage custom property. You must use the Advanced Editor to set this limit.
Figure 4. Fuzzy lookup runtime versus reference table size
After data size, the greatest impact on performance comes from the remaining transform parameters, such as number of matches to return, similarity threshold required, and number of columns on which to match.
The more matches Fuzzy Lookup is required to return, the slower the search is. This is because the search continues into the list of candidate tuples until enough matches are found.
The greater the match similarity that is required, the faster the Fuzzy Lookup search will be. This is because Fuzzy Lookup can discard candidate matches more aggressively during searches that require a high similarity.
Fuzzy Grouping Performance
Fuzzy Grouping depends on fewer variables than Fuzzy Lookup. The main variable that affects performance for Fuzzy Grouping is the size of the input data. Figure 5 shows the cost as the input size becomes large. The non-linearity in the early part of the graph is due to the ETI build-time. As the match threshold is increased, execution time eventually drops because no close tuples are found to create groups. This is illustrated in Figure 6. Figure 6 also shows that the distribution of the data plays a large role and may result in some non-linearities in running-time. Be aware that running a large Fuzzy Grouping task will result in large temporary objects on the server that you specify as your temporary connection. During the Fuzzy Grouping pre-processing step, you may track progress by viewing the progress messages reported in the Execution Results tab of the SQL Server Business Intelligence Development Studio.
Figure 5. Fuzzy Grouping time versus input size for MinSimilarity=0.8
Figure 6. Fuzzy Group time versus MinSimilarity threshold for 20K rows
Conclusion
The Fuzzy Lookup and Fuzzy Grouping transforms are domain-independent primitives that can be useful in data cleaning and preparation. They employ a token-based notion of distance that you can use to perform fine-grained lookup and grouping operations on your data. Further details on the distance function and the overall architecture of the component can be found in the research papers cited below. Because Fuzzy Lookup and Fuzzy Grouping are implemented as SSIS components, you can use them transparently in your larger ETL processes and take advantage of SSIS infrastructure such as logging, events, and error tracking, without having to use custom programming and its associated expense. While some resources are required to compute the transforms, Fuzzy Lookup and Fuzzy Grouping automate operations that were once difficult or impossible without significant manual intervention and customization of tools.
For More Information
Research Papers
- Eliminating Fuzzy Duplicates in Data Warehouses. Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002), Hong Kong. Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti.
- Robust and Efficient Fuzzy Match for Online Data Cleaning. ACM SIGMOD 2003, San Diego. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani.