SQL Server Data Mining Programmability
Bogdan Crivat
Microsoft Corporation
March 2005
Applies to:
Microsoft SQL Server 2005
SQL Server Data Mining
Summary: Learn more about the new APIs for data mining in SQL Server 2005, and get an introduction to several common development scenarios. (50 printed pages)
Contents
Introduction
Communicating with Microsoft Analysis Services 2005
General Purpose APIs
OLE DB as a General Purpose API
Analysis Management Objects – AMO
Progress Notifications: Working with Trace Objects
Data Mining Without a Server: Local Mining Models
Extending the Functionality of SQL Server Data Mining
Suggestions for Application Scenarios
Conclusion
Appendix 1: Common Operations with Microsoft Analysis Services 2005 Server and the Protocol Format of the Requests
Appendix 2: Using OLE DB from Ado.NET and C++
Introduction
With Microsoft SQL Server 2005, the collection of statistics techniques and machine learning algorithms generically known as data mining is brought up to a new level. The most important change is that, in SQL Server 2005, data mining changes its target audience. Besides being a scientific lab instrument, addressing a limited number of highly skilled individuals, SQL Server Data Mining now gains ubiquity as a handy tool for developers, ready to be used in a wide range of applications. Most applications, from spreadsheets to Internet games, from peer-to-peer communication systems to application servers, share one common characteristic: they have to process data. In doing this, they use certain standard APIs for accessing data. With SQL Server 2005 Data Mining, the same kind of APIs can be used to embed machine intelligence in the data processing systems.
This article walks you through the opportunities for embedding SQL Server Data Mining in your application. The paper focuses on the programmability of the suite; that is, on ways of writing code to make use of data mining or to enhance the set of features provided by the server. We will present some architectural aspects of the Microsoft SQL Server data mining solution, followed by a list of common tasks that can be executed by the data mining engine. We'll continue by enumerating the client-side programmability APIs offered with the product (Adomd.NET and OLE DB). A full section is dedicated to the Analysis Management Objects administrative API. After that, we'll present the ways of expanding the functionality of the server by creating new stored procedures and adding new algorithm plug-ins and viewers.
Data mining applications can also be built by integrating with OLAP, Reporting Services, or Integration Services; however, this article strictly deals with accessing or enhancing data mining functionality through code.
Data mining and OLAP are both components of the same server, Microsoft Analysis Services. The client-side APIs described in this article can be used for both of these components. However, this paper focuses on data mining. The scenarios, code examples, and metadata objects described are related to data mining.
Communicating with Microsoft Analysis Services 2005
Let's start by looking at the communication protocol that is understood by Microsoft Analysis Services 2005 (and therefore by the data mining server). This protocol is implemented by the clients that are coming with Analysis Services 2005 (such as the Microsoft OLE DB Provider for Analysis Services 9.0 or Adomd.NET) and must be implemented by any other client.
Microsoft Analysis Services 2005 uses SOAP as the outermost communication layer. (For more information on SOAP, see the SOAP 1.1 W3C Note). SOAP defines a way for applications to expose a set of methods that can be invoked through XML messages. The methods exposed by Microsoft Analysis Services are defined in XML for Analysis, or XMLA.
The XMLA specification was proposed by a group of over 20 leading Business Intelligence vendors (including Microsoft Corporation, Hyperion, and SAS Institute) as a standard interface for OLAP and data mining technologies. For more details on XMLA, please check here.
XMLA defines two types of request that can be sent to a server, together with the format that the server should use in returning the responses. These request types are 'Discover' and 'Execute.' 'Discover' can be used to obtain information and metadata from the server. As an example, 'Discover' can be used to obtain a list of mining models on the server, together with their characteristics (column descriptions, algorithms, and so on). 'Execute' is used to execute commands against a server, such as creating a new catalog or mining model, training a model, or a query.
DDL and DMX
The 'Execute' command of XMLA is used for a variety of tasks. Here are a few examples:
- Create commands: such commands create new metadata objects on the server and they consist in a full or partial description of the properties of the objects to be created such as name, data bindings, columns for mining models and structures, the data mining algorithm for mining models, and so on.
- Alter commands: these commands change the properties of existing server metadata objects.
- Drop commands: these commands are used in destroying metadata objects from the server.
- Process commands: these commands initiate the training sequence of a metadata object based on previously defined bindings to a training data set.
- Statements (queries).
When an XMLA 'Execute' request describes an object (such as in Create or Alter statements) or refers an object as the target of an action (such as in Process or Drop), the content of the command consists of Microsoft Analysis Services Data Definition Language (DDL). DDL is the internal representation of metadata and metadata operations in Analysis Services 2005. Objects on the server are stored in DDL, and Business Intelligence projects consist of DDL fragments. For more details on DDL, search the SQL Server 2005 Books Online for "Analysis Services scripting language."
On the other hand, when an XMLA request is a statement, the content sent of the XMLA 'Execute' request is using a query language that, for data mining tasks, is DMX (Data Mining eXtensions), a language defined in the OLE DB for Data Mining specification, available here.
Microsoft Analysis Service 2005 can also execute statements using a different query language, MDX, designed for the OLAP components. We will focus here on DMX.
The distinction is very important: DDL is an XML-like language used by Analysis Services 2005 for describing, managing, and referring metadata objects. DDL naturally extends XMLA.
DMX is a SQL-like language designed specifically for data mining. Very much like SQL, DMX contains some constructs that allow creating and manipulating the metadata objects (think of CREATE TABLE or INSERT INTO statements in SQL; they have their equivalents, CREATE MINING MODEL and INSERT INTO <Mining Model>, in DMX). However, for the task of managing metadata objects, DMX is less flexible than DDL. DMX statements do not extend XMLA; they can be included into certain XMLA constructs.
Appendix 1 at the end of this article contains more details on the layout of the protocol understood by Microsoft Analysis Services 2005.
The SOAP part is relatively easy to handle; most development tools provide help for authoring, sending, and receiving SOAP envelopes. But then, whether a request is DDL or DMX, formatting it properly in an XMLA statement is not particularly handy, so developers need something better than just an XML stream for communication. This is where the programming APIs save the day. These APIs wrap the jobs of authoring XMLA requests and parsing XMLA responses, and expose to the developer a more logical view of the communication, handling the internal details. There are a few different APIs that can be used with Microsoft SQL Server Data Mining. Choosing one over another depends on the type of requests to be performed by the client and on the environment of the client.
The next section will analyze the typical client requests in a data mining client/server communication. The following sections will then present the data mining programming APIs and the differences between them, based on how they can handle these typical requests and on the client environment they support.
Data Mining Tasks
The SQL Server Data Mining server supports multiple kinds of requests. For each request type we will present the DDL command that describes it as well as the DMX equivalent, where available:
- Metadata discovery: these requests allow a client application to iterate over the metadata objects already existing on the server and select one or more useful ones, based on such properties as columns of the mining models, algorithm used, and dataset used in training. The XMLA specification defines the Discover command for this kind of request, and there is not a direct DMX or DDL equivalent.
- Metadata definition: these requests consist in defining data structures on the server, structures that represent logically the problem space. These data structures are mining structures and mining models and they are organized in catalogs (databases). The XMLA command for such requests is Execute, containing DDL Create or Alter statements with the definition of the object to be created or altered. The DMX CREATE statement may be used instead, although it is less flexible than the DDL statements.
- Models Training: these requests launch the training process for the data mining models on the server. The training may use predefined bindings (included in the metadata definition), or may describe the training dataset only for the current processing operation. The XMLA command is again Execute, with a Process DDL statement inside. In DMX, training can be achieved with the INSERT INTO command.
- Querying the models: these requests include (but are not limited to) what is often called scoring or predicting in the data mining literature. We will define querying as making use of the patterns found by mining models. These requests consist in sending DMX statements to the server.
- Subscribing to progress notifications: these requests are useful to display a progress indicator during a lengthy processing operation or to inspect the state of the server at a given moment. Such requests cannot be created in DMX. They involve subscribing to a trace, a request that is transmitted as an XMLA Execute command with a DDL Subscribe statement inside.
With these tasks in mind, we can now enumerate the programming APIs and their characteristics. The last task in the list, working with progress notifications, makes the subject of a separate section of this article, so it will not be described for each API.
General Purpose APIs
We will explore further two general purpose APIs: Adomd.NET and OLE DB. They differ mostly by the environment they target: OLEDB (OLE for Databases) addresses mostly unmanaged applications (although it can be used inside managed applications as well), while Adomd.NET addresses explicitly managed applications. As we will see below, these APIs are very similar in structure. Also they are based on well known database access paradigms.
Adomd.NET as a General Purpose API
Adomd.NET is the managed version of the older ADOMD library provided with Microsoft Analysis Services 2000. This library was specifically designed as a client for Microsoft Analysis Services 2005. It implements the ADO.NET data access paradigm (including the standard interfaces, such as IDbConnection, IDbCommand, IDataReader, and so on) and extends that paradigm with many features specific to Microsoft Analysis Services 2005.
In the .NET framework, the System.Data namespace enables you to build components that efficiently manage data from multiple data sources. For more information, please consult the MSDN documentation for the System.Data namespace. Among many features, System.Data defines a set of interfaces to be implemented by .NET data providers that access relational databases. Now, Microsoft Analysis Services is a multi-dimensional database, functionally a superset of a relational database. Adomd.NET is a .NET data provider, implementing the standard System.Data interfaces for .NET data providers, enhanced for Analysis Services 2005, adding new features that are not used in the relational scenario.
Adomd.NET is installed as part of SQL Server 2005 Connectivity Components and will also be available as a stand-alone download. One needs to have the Adomd.NET data provider installed on your machine before using it in applications.
The code examples in this section use C# 2.0, coming with Visual Studio 2005. After a simple "translation" process, they should work with any managed language in the Visual Studio suite (such as Visual Basic .NET or Visual C++ .NET)
To use Adomd.NET in an application, one must start by adding a reference to the Microsoft.AnalysisServices.AdomdClient.dll assembly and then use the reference with code such as this:
using Microsoft.AnalysisServices.AdomdClient;
The next step is to connect to a Microsoft Analysis Services 2005 server:
AdomdConnection conn = new AdomdConnection();
conn.ConnectionString = "Data Source=localhost; " +
"Initial Catalog=MyCatalog";
conn.Open();
In the list of data mining tasks, the first one was discovering metadata objects. As we mentioned at the time, discovering metadata is implemented through the XMLA Discover command. Adomd.NET provides two easy ways to discover metadata. The first way is very similar to AMO:
foreach (MiningModel model in conn.MiningModels)
{
Console.WriteLine(model.Name);
}
The Microsoft Analysis Services team put in a lot of effort to ensure that the commonly used schemas are now exposed as collections of objects in the Adomd.NET Connection class. With code like the previous, you can get access to the collection of mining structures in a database, the list of mining models in a structure or in the whole database, and columns for both models and structure. Going beyond the capabilities of AMO, Adomd.NET allows even browsing the hierarchical model content with code like this:
foreach (MiningContentNode node in model.Content)
{
foreach( MiningContentNode in node.Children )
{
// Use the node properties
}
}
Beyond data mining, the collections exposed by the Adomd.NET Connection object cover the OLAP schemas in Microsoft Analysis Services 2005. You can, therefore, explore Cubes and Dimensions in the same way.
The second way of discovering server metadata is more similar to the OLE DB approach; that is, execute a Discovery command sending a schema GUID and a set of restrictions, and get back a tabular result that can be traversed. Some of the schemas return hierarchical tabular results (nested tables). This is why the result of such a discovery operation is, in Adomd.NET, a DataSet object (as opposed to ADO.NET, which returns a data table). A DataSet allows expressing relations between tables, and therefore a DataSet can contain the nested tabular result from some discovery operations. The following code snippet discovers the content of a mining model and uses the DataSet relations to get the nested rows corresponding to the NODE_DISTRIBUTION nested table of the schema:
Guid modelContent = new Guid("{3ADD8A76-D8B9-11D2-8D2A-00E029154FDE}");
object[] arRestrictions = new object[3];
// Restrict the discovery to the content of model DecisionTree1 in
// catalog MyCatalog. The restrictions are specified by ordinal.
// The second restriction (MODEL_SCHEMA) is ignored here
arRestrictions[0] = "MyCatalog";
arRestrictions[1] = null;
arRestrictions[2] = "DecisionTree1";
DataSet dsContent = conn.GetSchemaDataSet(
modelContent, arRestrictions);
// Consistency check: ensure that there is exactly 1 relation
Debug.Assert( dsContent.Relations.Count == 1);
DataRelation relation = dsContent.Relations[0];
DataTable topTable = relation.ParentTable;
foreach (DataRow row in topTable.Rows)
{
// Use columns of the top level table
Console.WriteLine("NODE_CAPTION=" + row["NODE_CAPTION"]);
Console.WriteLine("NODE_UNIQUE_NAME" + row["NODE_UNIQUE_NAME"]);
// Fetch the nested rows, as described by the relation
DataRow[] distRows = row.GetChildRows(relation);
foreach (DataRow distRow in distRows)
{
// Use columns of the nested table
Console.WriteLine(distRow["ATTRIBUTE_VALUE"]);
}
}
Very much like OLE DB, Adomd.NET provides command objects that can be executed to send requests to the server. These requests can be DDL or DMX; therefore they can be used to create or alter existing metadata objects (DDL and DMX), for training mining structures and models, as well as for querying existing models (DMX only). One can differentiate here between metadata creation and processing, on one side, and DMX queries on the other side. The difference is that, while DMX queries are expected to return a tabular result (hierarchical or not), the metadata creation and processing statements only return a success or failure code. Adomd.NET exposes a few versions of the Execute method, two of them being particularly useful for success/failure requests and for requests that return tabular data.
We'll exemplify with a DDL process statement, but the same code can be used with any DDL statement (such as Alter, Create, or Drop).
AdomdCommand cmd = new AdomdCommand();
cmd.Connection = conn;
cmd.CommandText = " <Process " +
"xmlns=\"https://schemas.microsoft.com/analysisservices/2003/engine\">"+" <Type>ProcessDefault</Type>" +
" <Object>" +
" <DatabaseID>MyCatalog</DatabaseID>" +
" <MiningStructureID>Structure1</MiningStructureID>" +
" </Object>" +
" </Process>";
cmd.ExecuteNonQuery();
Similarly, ExecuteNonQuery could execute a DMX statement to generate the same outcome:
"INSERT INTO MINING STRUCTURE Structure1"
When ExecuteNonQuery is invoked, the statement will return if the requests completes successfully, and throw an exception if the request fails. In the latter case, the exception contains details such as the error message returned by the server.
The second version of the Execute method is ExecuteReader. This method should be invoked when the result of the statement is known to be tabular. As we mentioned before, there are quite a few situations when the tabular result returned by Microsoft Analysis Services 2005 is hierarchically organized. For example, let's consider a mining model, Model1, which predicts the gender of a customer and the list of products used by that customer based on the customer's age and the color of his or her car. Of course, such a model might not be of much use in real life. However, it is a facile example for the concept of DMX statement returning hierarchically organized tabular content.
The following code uses an AdomdCommand object to sent a prediction query to the server and then reads both top level result (the predicted gender) and nested results (the predicted list of products):
cmd.CommandText = "SELECT Gender, Products FROM Model2 " +
"NATURAL PREDICTION JOIN " +
"( SELECT 30 AS Age, 'Black' as CarColor) AS T";
AdomdDataReader rdr = cmd.ExecuteReader();
while (rdr.Read())
{
// Make use of the top level result, the predicted Gender
Console.WriteLine( rdr.GetString(0) );
// Get a nested reader for the nested content in column 1
AdomdDataReader nestedReader = rdr.GetDataReader(1);
// Read the nested content and use the nested data
while (nestedReader.Read())
{
Console.WriteLine(nestedReader.GetString(0));
}
nestedReader.Close();
}
The call to GetDataReader will return a new instance of the AdomdDatareader class, which is initialized to provide access to the nested table included in the column with the ordinal specified as argument. If the specified column is not in fact a nested table, then the method execution will throw an exception. One can use code like the following to determine whether a column in a data reader represents a nested table or not:
if (rdr.GetFieldType(1) == typeof(AdomdDataReader) )
{
// If here, then column 1 in rdr is a nested table
}
In a real life scenario, it is most likely that predictions such as these will be performed over some user input. A common scenario is a Web application where the user specifies the age and the car color and the application code is supposed to generate a DMX query for predicting the list of products. When the developer has a good understanding of the DMX and of the data type involved, the query can be created by concatenating the string representation of the values entered by the user. However, such an approach is potentially dangerous, as it leaves room for DMX injection or for development errors. A much better approach is to parameterize the query and pass the use input as parameter values. Adomd.NET provides great support for parameterized DMX queries, such as the following. The code for reading the result was not included, as it is the same as in the previous code snippet.
cmd.CommandText = "SELECT Gender, Products FROM Model2 " +
"NATURAL PREDICTION JOIN " +
"( SELECT @Age AS Age, @Color as CarColor) AS T";
AdomdParameter paramAge, paramColor;
paramAge = cmd.CreateParameter();
paramAge.Direction = ParameterDirection.Input;
paramAge.ParameterName = "Age";
cmd.Parameters.Add(paramAge);
paramColor = cmd.CreateParameter();
paramColor.Direction = ParameterDirection.Input;
paramColor.ParameterName = "Color";
cmd.Parameters.Add(paramColor);
cmd.Parameters["Age"].Value = 30; // User Input here
cmd.Parameters["Color"].Value = "Black"; // User Input here
AdomdDataReader rdr = cmd.ExecuteReader();
Note that the parameters do not have to be scalar values. Consider such DMX statements as
INSERT INTO MyModel(Col1, Col2)
OPENQUERY(DataSource, "SELECT ColA, ColB FROM Table") AS T
Or
SELECT PREDICT(Products, 5) FROM MyModel NATURAL PREDICTION JOIN
OPENQUERY(DataSource, "SELECT ColA, ColB FROM Table") AS T
Both these statements use an OPENQUERY function to describe some tabular content that is to be used by the server during execution (for training a mining model, respectively for prediction). Using Adomd.NET, one can replace the OPENQUERY with a parameter and pass some tabular content to the server. The queries would look like:
INSERT INTO MyModel(Col1, Col2)
@MyTabularContent AS T
Or
SELECT PREDICT(Products, 5) FROM MyModel NATURAL PREDICTION JOIN
@MyTabularContent AS T
For such cases, the MyTabularContent parameter can be a DataTable object or an implementation of the IDataReader .NET interface. While DataTable may be easier to use, the IDataReader interface has the advantage of not needing to keep all the data in memory. An example of IDataReader implementation is a SqlDatareader object returned by the execution of a SQL query in the client application. Of course, a SQL query could also be requested inside an OPENQUERY function, but this requires that the server has access to the SQL database. Tabular parameters were designed specifically for two scenarios:
- The data is in a database visible to the client, but not visible to the server (IDataReader should be used for these scenarios).
- The data is in the memory of the client application (DataTable can be used here).
A unique feature of Adomd.NET is its ability to return certain columns with XML content. Microsoft Analysis Services 2005 returns such columns for particular queries. An example of such a query is the following:
SELECT * FROM MyModel.PMML
This statement will return the content of the MyModel model in the PMML 2.1 format (assuming that the data mining algorithm used by MyModel supports PMML). Besides meta-information such as the model name and the size of the PMML buffer, the result of this statement contains a specific column, MODEL_PMML, which contains PMML (an XML format specifically designed by the Data Mining Group, for representing mining models content).
This XML content can be accessed as a string, but Adomd.NET has the ability to detect such XML columns (based on specific column type information sent by the server) and expose them as a System.Xml.XmlReader object. Such columns can also be read as strings. The code below demonstrates using this feature and accessing a column with XML content both as a string and as an XmlReader.
cmd.CommandText = "SELECT * from [MyModel].PMML";
AdomdDataReader rdr = cmd.ExecuteReader();
// Traverse the response and read column 5, MODEL_PMML
while (rdr.Read())
{
// Get the column value
object objXMLValue = rdr.GetValue(5);
if (objXMLValue is System.Xml.XmlReader)
{
//cast to an XML reader
System.Xml.XmlReader pmmlReader =
(System.Xml.XmlReader)objXMLValue;
//read the PMML here
}
// Get the same column as a string
string strPMMLAsString = rdr.GetString(5);
}
Another feature of Adomd.NET, Sequential Access to the server response, is discussed and exemplified in the Progress Notifications: Working with Trace Objects section.
Adomd.NET is the most flexible and the easiest to use client API for Microsoft Analysis Services 2005. It is also the recommended solution for writing .NET applications that target Microsoft Analysis Services. It provides a wide range of features, from AMO-like access to server metadata objects, it allows execution of both DDL and DMX statements, and it has full support for parameters in DMX statements.
OLE DB as a General Purpose API
On the Microsoft Windows operating systems, OLE DB is one of the most frequently used APIs for data access. OLE DB is a specification for OLE objects. These objects implement a set of standard interfaces. The standardization of these interfaces allows one single programming model to access practically any data source. Microsoft SQL Server Data Mining comes with a provider for OLE DB. This provider is what translates the OLE DB programming model into the data mining world.
For more details on the OLE DB, please visit the MSDN OLE DB page.
OLE DB can be used directly from various programming languages. In native (unmanaged) C++, the OLE DB objects can be created and methods can be invoked directly. Also, the ATL Consumer Templates provide a set of helpful classes for OLE DB consumer applications. From Visual Basic or VBA, OLE DB providers can be used through ActiveX Data Objects (ADO), a simplified wrapper. From the managed languages, one can use the ADO.NET library to call into OLE DB providers.
The OLEDB programming model revolves around a few objects. First, a data source must be initiated and directed to the server. Then, a communication session must be initiated on that data source. The third important part of the OLE DB architecture is the command object, which wraps a server request. This server request is sent as part of the session. The server response is the fourth major object involved in the OLE DB architecture. Most of the data mining tasks can be implemented through OLE DB command objects. For example, a DMX statement can be sent through a command object and so can be sent a DDL statement for creating a new metadata object. The metadata discovery statements cannot be sent through a command. OLE DB defines an interface, IDBSchemaRowset, implemented by the session object, for this purpose. We should note here that ADO and ADO.NET expose a single Connection object that encapsulates both the Data Source and the Session OLE DB objects. The functionality exposed by both these standard OLE DB objects is available in the Connection object in ADO or ADO.NET.
All the OLE DB objects (data source, session, command, and response) are usually implemented in a COM component called an OLE DB provider.
Figure 1 below describes the architecture of an OLE DB solution:
Figure 1. Architecture of an OLE DB solution
As one can see from the diagram, a prerequisite to using OLE DB for connecting to a server is to have the respective OLE DB provider installed on the client machine. The OLE DB provider for Microsoft Analysis Services 2005 is installed as part of the Connectivity Components during SQL Server 2005 setup.
An OLEDB connection is always initiated by means of a connection string. The connections string contains a set of semi-colon separated name–value pairs, called properties, which describe the specific parameters of the OLE DB connection to be initiated. All OLE DB wrappers (such as ATL consumer templates, ADO, or ADO.NET) initiate a connection based on a connection string. Here is an example of OLE DB connection string that can be used for connecting to Microsoft Analysis Services 2005:
"Provider=MSOLAP; Data Source=localhost; "
The first property, Provider, describes the specific provider to be instantiated.
"MSOLAP" is the name of the OLE DB provider for Microsoft Analysis Services. Note that this is a version-independent name. If multiple Analysis Services OLE DB providers are installed on the same machine (such as the one coming with Analysis Services 2005 and the one coming with Analysis Services 2000), then a more explicit name may be used to specify the version: "MSOLAP.2" for the Analysis Services 2000 version, "MSOLAP.3" for the Analysis Services 2005 version. The second property, "Data Source," indicates the data source to connect to. It is usually a machine name but it may have different semantics. For example, it can be a file name or an Internet URL, as we'll see later in this article, in the "HTTP Pump" section. You can find out more on the properties supported by the OLE DB provider for Analysis Services by searching this topic in the SQL Server Books Online.
The following code snippet presents usage of OLE DB from a VBA application, using the ADO wrapper. Some functionally similar code in C# (through ADO.NET) and in unmanaged C++ is available in Appendix 2 of this article, Using OLE DB from ADO.NET and C++.
For using the ADO wrapper, add a reference to your Visual Basic (or VBA) project, pointing to the Microsoft ActiveX Data Object library (the most recent library version on your computer). The following code snippets were tested using version 2.8.
1 Dim Conn As ADODB.Connection
2
3 Set Conn = New ADODB.Connection
4 Conn.Open ("Provider=MSOLAP.3; Data Source=localhost;" _
5 & "Initial Catalog=MyCatalog")
A new ADODB Connection object is created and initialized (lines 1 and 2). As we mentioned, the ADO Connection object encapsulates two OLE DB objects: the datasource and the session. In the connection string, you will notice one property that was not discussed before: "Initial Catalog." It identifies the database on the server to be used by this connection. Let's use this connection object for accomplishing the data mining tasks we enumerated before.
For metadata discovery, the ADO Connection exposes (by wrapping the IDBSchemaRowset interface of the OLE DB session object) the OpenSchema function.
OpenSchema takes 3 parameters:
- An enumeration describing the schema to be discovered.
- A set of restrictions to be applied to the discovery operation.
- A GUID describing the provider specific schema, optionally.
We try to discover the mining models that reside on the server, in the MyCatalog database. There is not a predefined schema enumeration for the Data Mining Models schema (it is a provider-specific schema), therefore the first parameter will be adSchemaProviderSpecific. The second parameter, the set of restrictions, contains the name of the catalog to search for models ("MyCatalog"). The third parameter contains the string form of the GUID that identifies the Mining Models schema for the Analysis Services OLE DB provider.
6 Const DMSCHEMA_MINING_MODELS="{3add8a77-d8b9-11d2-8d2a-00e029154fde}"
7 Dim Restrictions()
8 Restrictions = Array("MyCatalog")
9 Dim rsSchema As ADODB.Recordset
10 Set rsSchema = Conn.OpenSchema(
adSchemaProviderSpecific,
Restrictions,
DMSCHEMA_MINING_MODELS)
The "OLE DB for Data mining" specification contains complete definitions for the specific schemas supported by the OLE DB provider for Analysis Services, including the columns that are returned by the Discovery requests and the restrictions that can be sent.
OpenSchema returns an ADODB.Recordset object. The Recordset object encapsulates a tabular server response. We'll see how to traverse such an object and how to extract information from it. For this particular code snippet, the purpose is to enumerate the names of the mining models. From the OLE DB for data mining specification one knows that, in the Mining Models schema, each row corresponds to a mining model and also contains a column named "MODEL_NAME" with the name of the mining model.
The following code shows how to look for a specific column name in a Recordset object and how to later extract the information from that column.
11 ' identify the column of interest (MODEL_NAME)
12 Dim iModelNameColumn As Integer
13 For iModelNameColumn = 0 To rsSchema.Fields.Count - 1
14 If rsSchema.Fields(iModelNameColumn).Name = "MODEL_NAME" Then
15 GoTo Found
16 End If
17 Next
18 Error (1)
19 Found:
20 ' read from the return Recordset
21 rsSchema.MoveFirst
22 While Not rsSchema.EOF
23 Debug.Print rsSchema.Fields(iModelNameColumn).Value
24 rsSchema.MoveNext
25 Wend
As you can see, the code first traverses all the fields of the Recordset object. Each field represents a column in the response. Among these fields, we look for the index of that one named MODEL_NAME. If such a column is found, the code starts traversing the rows, otherwise it raises an error.
For traversing the rows, the Recordset pointer is first moved to the beginning of the data. Then, rows are read one by one. For each, the value of the field MODEL_NAME is extracted and used.
Some of the data mining tasks (such as creating new metadata objects or training existing objects) can be executed by sending DDL statements to the server. Let's see how a DDL statement can be sent using OLE DB through ADO. We'll use the same connection object declared above and introduce a new ADO wrapper of an OLE DB object, the ADODB.Command object:
26 Dim Cmd As ADODB.Command
27 Set Cmd = New ADODB.Command
28 Cmd.ActiveConnection = Conn
29
30 Dim strProcessDDLStmt As String
31 strProcessDDLStmt = "" _
32 & " <Process
https://schemas.microsoft.com/analysisservices/2003/engine"">" _
33 & " <Type>ProcessStructure</Type>" _
34 & " <Object>" _
35 & " <DatabaseID>CheckInTestDB</DatabaseID>" _
36 & " <MiningStructureID>Structure1</MiningStructureID>" _
37 & " </Object>" _
38 & " </Process>"
39
40 Cmd.CommandText = strProcessDDLStmt
41 Cmd.Execute
42
The Command object executes in the context of an active connection. The active connection is specified at line 28 above. Generally, the Command contains a statement to be executed. This is set as the CommandText property (line 40). When ADO wraps the OLE DB provider for Analysis Services 2005, the CommandText property supports both DMX statements and DDL statements (as previously shown). The Command execution is initiated with the Execute method (line 41). Execution returns usually an ADODB.Recordset object (the same used for tabular server response in the Discovery code snippet above). However, for a processing operation there is no server response, other than success or failure. If an error occurs, ADO throws an exception stopping the execution of the code. Therefore, if line 42 is reached the command executed successfully.
With the previous code, by changing the CommandText property to a different DDL statement, such as Alter, Create, or Drop, or to a DMX statement, such as CREATE MINING MODEL or INSERT INTO, one can accomplish most of the data mining tasks. The only one that needs additional details is querying a mining model. DMX Querying is different from statements without response because:
- It returns tabular results.
- The tabular results may contain multiple levels (nested tables).
- DMX supports parameters.
We'll start with a DMX query that returns a single-level tabular result, and use the objects we have declared so far (the command and the connection):
43 Cmd.ActiveConnection = Conn
44 Cmd.CommandText = "SELECT NODE_CAPTION FROM DecisionTree1.CONTENT" _
45 &"where NODE_TYPE=2"
46
47 Dim rs As ADODB.Recordset
48 Set rs = Cmd.Execute()
49 rs.MoveFirst
50 While Not rs.EOF
51 Debug.Print rs.Fields(0).Value
52 Wend
53 rs.Close
As mentioned before, the DMX statement is transmitted through the CommandText property of the ADODB.Command object. The Recordset traversal is done in the same way as in the Discovery code snippet. Given that the query used here (line 44) only returns a column, there is no actual need to find the column among the recordset fields; rather, it can be identified by its index, 0 (line 51). Please also note the disposal of the Recordset object at line 53. It is very important, as the Command object cannot be used for subsequent statements while a Recordset is active.
The OLE DB specification also allows for the returning of more complicated result sets through the use of table-valued columns or "nested tables." To demonstrate the support for nested tables in a server response, we shall make the DMX statement more complex by adding a second column to the query, NODE_DISTRIBUTION. Therefore, the new query will look like:
"SELECT NODE_CAPTION, NODE_DISTRIBUTION FROM DecisionTree1.CONTENT WHERE NODE_TYPE=2"
NODE_DISTRIBUTION is a typical case of a nested table. Per the OLE DB for data mining specification, the NODE_DISTRIBUTION column of the model content contains the distributions of certain attribute values in the node represented by the current line. For example, in a decision tree predicting the hair color, for each tree node, this column is describing how many cases have black hair, how many have blonde hair, and how many have brown hair.
With the new column, the command execution does not change. Actually, the only code that requires changing is the Recordset traversal code. The change has to deal with the new table column. It is easy to notice in the code above that a column value is returned by the Recordset as a Field Value property. If the column in case is a nested table, the column value will be a new Recordset, traversing the nested table.
Therefore, the following lines of code should be included right after line 51.
52 Debug.Assert( rs.Fields(1).Type = adChapter)
53 Dim nestedRS As ADODB.Recordset
54 Set nestedRS = rs.Fields(1)
55 nestedRS.MoveFirst
56 While Not nestedRS.EOF
57 Debug.Print nestedRS.Fields(0).Value
58 Wend
59 nestedRS.Close
The assertion at line 52 simply ensures that the column is of the correct type before attempting to cast its value to a nested Recordset. The condition of the assertion can also be used to determine whether a specific column is or not a nested table.
With the ability to traverse nested tables, any result from a data mining query can now be used in your application.
DMX also supports parameters in the queries. A parameter can replace any value in a DMX query. For instance, a parameter can replace the 2 value of the NODE_TYPE property in the where clause of the query above. In data mining applications there are a few scenarios where parameters are very useful, such as generating singleton queries (see the "Adomd.NET" section for more details on singleton queries).
To change a value to a parameter in a DMX query, one starts by replacing the value's occurrence with the parameter indicator: @ followed by a unique parameter name. Our query (line 45 in the above VBA code snippet) becomes:
45 &"where NODE_TYPE=@typeParam"
Then, before executing the command, the following code should be inserted to ensure proper handling of the new parameter. Note that using a parameter will not change the format of the server response, although the actual data may differ a lot with different values of the parameter. Therefore all the code presented above for traversing a Recordset can be left untouched.
46 Cmd.NamedParameters = True
47
48 Dim typeParameter As ADODB.Parameter
49 Set typeParameter = Cmd.CreateParameter()
50 typeParameter.Direction = adParamInput
51 typeParameter.Name = "typeParam"
52 typeParameter.Value = 2
53 typeParameter.Type = adInteger
54 typeParameter.Size = 4
55
56 Cmd.Parameters.Append typeParameter
The following steps are important when using parameters with the OLE DB provider for Analysis Services:
- The Command must be notified to use named parameters (line 46). The OLE DB provider for Analysis Services only supports named parameters.
- The parameters must have a name that matches the indicator used in the query, less the @ prefix (line 51).
- Only input parameters are supported (line 50).
- Type and Size of a parameter should be declared (lines 53, 54).
A good understanding of the OLE DB for data mining specification is required for fully using the data mining features in Microsoft Analysis Services 2005. Once the data mining queries are designed, OLE DB is a complete and versatile API for implementing them as well as for managing the data mining server. Please review Appendix 2 of this article for code snippets of using OLE DB from unmanaged C++ or C# through ADO.NET.
Analysis Management Objects – AMO
As the name suggests, AMO is an API intended for management tasks. It is perfectly fit for a detailed description of the properties of a metadata object. AMO is an object model that closely matches the Data Definition Language (DDL) used by Microsoft Analysis Services to internally describe its metadata objects. AMO allows iteration over existing metadata objects, creation of new objects, as well as the altering of existing ones. In the AMO object model, each and every property that can be expressed as in DDL can be inspected and changed. AMO can also be used to inspect or change the server properties, including registering or un-registering plug-in algorithms, or enabling/disabling registered data mining algorithms. AMO is useful for metadata definition and discovery as well as for the training of server objects. However, it does not provide a way to execute query statements.
Before moving to details, we should mention that AMO is a managed library. Therefore, it is intended to be used in applications developed in a Common Language Runtime (CLR) compatible programming language, such as C#, Managed Extensions for C++, or Visual Basic .NET. To use AMO, one should install the SQL Server 2005 Connectivity Components on the client machine.
To use AMO, you should start by adding a reference to the Microsoft.AnalysisServices.dll library in your applications. With this, the AMO object model is now available. Start by connecting to a server:
Microsoft.AnalysisServices.Server server = new Server();
Server.Connect("localhost");
Once the connection happens, the metadata objects on the server can be iterated, in a hierarchical manner:
Databases dbCollection = server.Databases;
foreach( Database db in dbCollection )
{
MiningStructures structCollection = db.MiningStructures;
foreach( MiningStructure struct in structCollection)
{
Console.WriteLine( struct.Name );
}
}
To change an existing metadata object, just modify its properties, as in the following:
model.Algorithm="Microsoft_Decision_Trees".
then call "Update".
model.Update();
The call to Update will propagate your changes on the server and refresh the local collections.
Very much the same, to add a new metadata object on the server, create a new member in the respective collection:
MiningStructure myStructure;
MiningModel myModel = myStructure.Models.Add();
Then populate this object:
myModel.Name = "New Model"
myModel.Algorithm = "Microsoft_Clustering"
then propagate the changes to the server with an Update() call:
myModel.Update();
An Update call will return an exception whenever the operation cannot be completed successfully. Common situations are:
- The set of properties for the object that is Updated is incomplete or inconsistent.
- The server cannot update the object because the current user does not have enough permissions.
By catching the exception, one can figure out the source of the problem and fix it.
AMO is great for dealing with the Analysis Services 2005 server metadata, but it does not provide support for some of the data mining tasks previously described. It does not support subscribing to progress notifications or querying mining models (in general, it does not support DMX statements execution). If your application is intended to browse and explore an Analysis Services 2005 server, the AMO is the best way to go. However, if besides this you need to execute statements or display progress notifications, then AMO is not enough. The next section describes two general purpose APIs, which cover completely the features exposed by the Analysis Services servers.
Progress Notifications: Working with Trace Objects
Notifications are sent by Microsoft Analysis Services when various events occur on the server. Such events include user login or logout, start of a query execution, end of a query execution, or progress notifications issued by various server objects during processing. By receiving these notifications, an administrator can inspect the state of the server at any given moment. Also, a user that is processing a server object may get an idea of what is happening and an estimation of the progress so far.
In this section we'll discuss how an application can receive such notifications from the Microsoft Analysis Services 2005 server. We'll start by describing how to discover the notifications that can be emitted by a server. Then, we'll briefly present how an application can subscribe to receive those server notifications that are deemed of interest. Eventually, we'll propose a general programming model for applications that need to deal with progress notifications.
Before moving further, we must mention that SQL Server 2005 comes with a built in client for these notifications, the Profiler. Using the Profiler, a user can visually choose the notifications of interest and then inspect these notifications. Also, the Profiler application allows recording the server notifications, saving them for later inspection, and it integrates with the Performance Counter to allow an administrator to correlate the server notifications with the evolution of various performance indicators. Most applications will not need to handle server notifications in code. This section is intended for those developers who need to write advanced user-interface clients for data mining.
Each notification emitted by the server looks like a tabular row. This row contains some general information, such as the class of the notification, a timestamp, the process ID of the server, and the name of the server. Other columns are specific to certain events, such as Progress Total for a progress notification or start time, and end time and duration for a job that is executed on the server. All the events that are emitted on the server are collected into a virtual table, which contains one row for each event emitted and the union of all the columns supported by any server event. That means: if column A is only supported for a specific event, it will be present in all the event rows in this virtual table, but it will be empty for non-A events. This table was described as virtual because it is not present in memory and it cannot be directly interrogated. Periodically, this table is saved into a "flight-recorder" file that is only accessible to administrators and can provide valuable information about the cause of a problem that occurred at some moment in the past, when no one was watching the server events.
The set of all the columns in this virtual table can be discovered with regular XMLA discovery. The XMLA name of this schema is DISCOVER_TRACE_COLUMNS and its GUID is {a07ccd18-8148-11d0-87bb-00c04fc33942}. Each row in this schema describes a column of the event table we mentioned and it has exactly one column, containing the XML description of that event column. Such an XML description looks like:
<COLUMN>
<ID>0</ID>
<TYPE>1</TYPE>
<NAME>EventClass</NAME>
<DESCRIPTION>Event Class is used to categorize events.</DESCRIPTION>
<FILTERABLE>false</FILTERABLE>
<REPEATABLE>false</REPEATABLE>
<REPEATEDBASE>false</REPEATEDBASE>
</COLUMN>
From the purpose of this whitepaper, ID, Name, and Description are properties of interest.
ID is a numeric identifier of the column. As we'll see in discussing events, an event definition refers a column by ID. The name of the column and the description are important for the client. Based on these, the client application can determine which columns are of interest and which can be ignored. The column with the ID 0 is the most important column in the virtual table of server notifications, because it defines the event type. All the notifications emitted by Microsoft Analysis Services 2005 contain this column.
Discovering the events that can be emitted by the server is a task similar with the column discovery. The XMLA schema to be used is DISCOVER_TRACE_EVENT_CATEGORIES and its GUID is {a07ccd19-8148-11d0-87bb-00c04fc33942}. Each row in this schema describes an event category that can be emitted by the server together with all the events in that category, while the unique column contains the XML definition of the event category, which looks like the following:
<EVENTCATEGORY>
<NAME>Queries Events</NAME>
<TYPE>0</TYPE>
<DESCRIPTION>Collection of events for queries.</DESCRIPTION>
<EVENTLIST>
<EVENT>
<ID>9</ID>
<NAME>Query Begin</NAME>
<DESCRIPTION>Query begin.</DESCRIPTION>
<EVENTCOLUMNLIST>
<EVENTCOLUMN>
<ID>0</ID>
</EVENTCOLUMN>
<EVENTCOLUMN>
<ID>2</ID>
</EVENTCOLUMN>
...
</EVENTCOLUMNLIST>
</EVENT>
<EVENT>
<ID>10</ID>
<NAME>Query End</NAME>
<DESCRIPTION>Query end.</DESCRIPTION>
<EVENTCOLUMNLIST>
<EVENTCOLUMN>
<ID>0</ID>
</EVENTCOLUMN>
<EVENTCOLUMN>
<ID>2</ID>
</EVENTCOLUMN>
...
</EVENTCOLUMNLIST>
</EVENT>
....
</EVENTLIST>
</EVENTCATEGORY>
An event category, therefore, contains multiple events with various IDs. Each of these events has its own list of columns and, as mentioned before, all of them contain the column 0, which provides the event class.
Another special column is column 1, event subclass, which is shared by some of the events. The event subclass column allows the trace user, for example, to distinguish between a ProgressStart event issued by data mining and the same event issued by processing of an OLAP dimension. When column 1 is present in the definition of an event, its definition is more complex than the other columns, such as below:
<EVENTCOLUMN>
<ID>1</ID>
<EVENTCOLUMNSUBCLASSLIST>
<EVENTCOLUMNSUBCLASS>
<ID>1</ID>
<NAME>Process</NAME>
</EVENTCOLUMNSUBCLASS>
<EVENTCOLUMNSUBCLASS>
<ID>2</ID>
<NAME>Merge</NAME>
</EVENTCOLUMNSUBCLASS>
...
</EVENTCOLUMNSUBCLASSLIST>
</EVENTCOLUMN>
This more complex definition describes the meaning (event subclass name) for any value that can appear in column 1 in the context of the current event.
Once an application decided which events and which event columns are of interest, it can subscribe to receive this information as notifications. The subscription is performed by creating a trace object on the server. A trace object is functionally similar to a view over the virtual table containing all the server notifications. Besides, a trace object has all the attributes of a server object. It can be created, altered, or destroyed, and access to it can be restricted by a set of permissions. A trace declaration specifies the events that are of interest (similar to a WHERE clause in a relational view) and the columns to be returned for those events. The Trace definition DDL may contain more advanced filtering options, of the form "return only those rows where column C contains a specific value," but these filtering options are not the object of this article. These advanced features are documented in the SQL Server 2005 Books Online, under the "Trace Element (Analysis Services Scripting Language)" section.
Let's consider an application that trains a mining model and chooses to subscribe only to the progress report events. These are:
- Progress Report Begin (having 5 as event ID)
- Progress Report Current (having 7 as event ID)
- Progress Report End (having 6 as event ID)
- Progress Report Error (having 8 as event ID)
For each of these events, the columns of interest are:
- Column 0, EventClass.
- Column 9, ProgressTotal, representing an estimated total number of steps for completion of the current task (only available for Progress Report Current).
- Column 10, IntegerData, which in data mining progress notifications contains the current step in the completion of the current task (only available for Progress Report Current).
- Column 42, TextData, which contains a verbose description of the current step.
The following DDL statement creates this trace:
<Create
xmlns="https://schemas.microsoft.com/analysisservices/2003/engine">
<ObjectDefinition>
<Trace>
<ID>DemoTrace</ID>
<Name>DemoTrace</Name>
<Events>
<Event>
<EventID>5</EventID>
<Columns>
<ColumnID>0</ColumnID>
<ColumnID>42</ColumnID>
</Columns>
</Event>
<Event>
<EventID>6</EventID>
<Columns>
<ColumnID>0</ColumnID>
<ColumnID>42</ColumnID>
</Columns>
</Event>
<Event>
<EventID>7</EventID>
<Columns>
<ColumnID>0</ColumnID>
<ColumnID>9</ColumnID>
<ColumnID>10</ColumnID>
<ColumnID>42</ColumnID>
</Columns>
</Event>
<Event>
<EventID>8</EventID>
<Columns>
<ColumnID>0</ColumnID>
<ColumnID>42</ColumnID>
</Columns>
</Event>
</Events>
</Trace>
</ObjectDefinition>
</Create>
Once the trace is created, the client application may subscribe to this trace. The trace subscription is functionally similar to executing a server query that returns a tabular result. It will return a table containing the selected columns for the selected events. As opposed to regular server queries, a Subscribe statement will keep returning events, row by row, until an Unsubscribe command is sent to the server. Only then will the Subscribe command terminate. Progress notifications are not of much use after the operation is completed, they are needed while the operation is in progress. This is why a possible model for dealing with progress notifications is to use two different threads.
The first thread creates the trace and then signals the second thread to subscribe.
The second thread subscribes and makes sure that it sets the necessary execution properties to access the response table in a forward only manner, row by row, as they are coming from the server, without having to wait for the response to be completely collected.
After this, the first thread issues the processing command and waits for its completion.
Meanwhile, the second thread receives progress notifications and displays them in the user interface.
When the processing operation is completed, with either success or failure, the first thread must also unsubscribe from the trace and delete it.
Upon deletion of the trace, the second thread will finally receive the end of table notification and will terminate.
We saw in the previous sections how a statement can be issues both in OLE DB and in Adomd.NET. To get the response as forward only, in order to get rows before completion of the transmission, the command execution has to receive certain API specific flags. In OLE DB, for instance, the DBPROP_CANFETCHBACKWARDS and DBPROP_CANSCROLLBACKWARDS properties must be explicitly set on FALSE. In Adomd.NET, the ExecuteReader function takes an optional parameter, CommandBehavior, which should be set to SequentialAccess for a forward only execution mode.
The following code is the skeleton for the main thread, followed by a function containing the code for the second thread.
AdomdConnection conn = new AdomdConnection();
conn.ConnectionString = "Data Source=localhost; " +
"Initial Catalog=MyCatalog";
conn.Open();
AdomdCommand cmd = new AdomdCommand();
cmd.Connection = conn;
cmd.CommandText = // Copy here the DDL fragment for creating a trace!!
// Execute the statement that creates the query
cmd.ExecuteNonQuery();
// launch a new thread that will read the trace
Thread traceThread = new Thread(new ThreadStart(TraceProc));
traceThread.Start();
// Continue with the processing operation
cmd.CommandText = // Processing statement,DDL or DMX, here
// Execute the processing command
try{
cmd.ExecuteNonQuery();
}
catch (System.Exception ex){
// Report the error
Console.WriteLine(ex.Message);
}
finally{
// delete the trace, regardless of the outcome of processing
cmd.CommandText =
"<Delete " +
"xmlns=\"https://schemas.microsoft.com/analysisservices/2003/engine\">"+
" <Object>" +
" <TraceID>DemoTrace</TraceID>" +
" </Object>" +
"</Delete>";
cmd.ExecuteNonQuery();
}
// The trace reader thread procedure
public static void TraceProc()
{
// open a new AdomdConnection to the same server
AdomdConnection conn = new AdomdConnection ();
conn.ConnectionString = "Data Source=localhost; " +
"Initial Catalog=MyCatalog";
conn.Open();
AdomdCommand traceCmd = new AdomdCommand();
traceCmd.Connection = conn;
// The command text is the DDL Subscribe command
traceCmd.CommandText = "<Subscribe " +
"xmlns=\"https://schemas.microsoft.com/analysisservices/2003/engine\">"+
" <Object>" +
" <TraceID>DemoTrace</TraceID>" +
" </Object>" +
"</Subscribe>";
// Get a forward-only data reader
AdomdDataReader traceRdr =
traceCmd.ExecuteReader(CommandBehavior.SequentialAccess);
try{
// read from trace and display notifications
while (traceRdr.Read()){
string textData = traceRdr.GetString(3);
if (textData != null)
Console.WriteLine(textData);
}
}
catch(System.Exception {
// exception reading from trace: might be "The trace was deleted
// on the server" signaling termination of the trace
}
finally{
// Close the UI element
Console.WriteLine("Trace Completed");
}
}
As we mentioned before, the Profiler tool coming with SQL Server is very easy to use, creates traces easily and exposes even the advanced features of the trace objects, such as rows filtering.
Data Mining Without a Server: Local Mining Models
Microsoft Analysis Services 2005 provides a limited set of functionalities inside an "in-process" server. That is a way of loading server code into the memory space of your own application, without actually opening a connection to a server. This functionality can be referred as "local mining models" by data mining-oriented documentation or "local cubes" by OLAP-oriented documentation, or simply as "local server." We will use "local server" inside this article.
The purpose of this server, as far as data mining is concerned, is to allow simple operations to be performed in an off-line mode. Useful scenarios for this feature include (but are not limited to) embedded applications.
For example, using the real server, one can design and train a very useful mining model that partitions Web queries into a set of clusters, such as "Fast queries with small payload", "Slow queries with small payload," "Slow queries with large payload." Such a mining model could be used in a load-balancing solution that is forwarding new queries, based on the cluster they are determined to belong to, to one or another Web server. However, the load-balancing solution would have to execute queries on the Analysis Services server to determine the cluster that contains an incoming query, which is potentially time-consuming and also raises some security issues (the Analysis Services machine should be in the same network zone with the load-balancing solution, which resides typically outside a firewall). The in-process server is the solution here. Once the trained mining model is transported from the actual server to the machine that hosts the load-balancing solution, this one can use the in-process server to execute queries. The queries are faster (they are in the same process, rather than sent over the network) and the Microsoft Analysis Services server is secure, protected by a firewall.
The Microsoft Analysis Services keeps all its metadata (databases with mining structures and mining models) in a so-called Data folder, which is set up by the installation package. The local server keeps all its metadata inside a single file, which usually has the .cub extension. Once connected to a local server, an application can create databases, mining structures, and models and use them very much like on the real server. All this metadata will be created and stored in the .cub file. If the application is shut down, when the local server is re-initialized with the old .cub file, all existing metadata is preserved and loaded.
All data access APIs described previously can connect to a local server. All the developers have to do is to replace in the connection strings the server name ("localhost" in the code samples) with the name of the .cub file to be used. An Adomd.NET example is presented below:
AdomdConnection conn = new AdomdConnection();
conn.ConnectionString = "Data Source=c:\\test.cub; ";
conn.Open();
Similar code for AMO looks like:
Microsoft.AnalysisServices.Server server = new Server();
server.Connect("c:\\test.cub");
One should remember that, upon creation, there is no database in a .cub file.
Therefore, one of the first things to do is to create a new database. An exception to this rule is the DMX CREATE statement. When CREATE is used to create a new mining model or a new mining structure and no database is available, a new catalog is created automatically in the .cub file.
After connection, the local server can be accessed in the same way as the real server. There are, however, some limitations in what the local server can provide:
- Local servers do not support multiple simultaneous connections (therefore, as discussed in the previous section, progress notifications cannot be received).
- Local servers do not support database backup and restore.
- Local servers do not support the DMX IMPORT statements. (They do support, however, the DMX EXPORT statement).
- Local servers only support a limited set of data mining algorithms (Microsoft_Decision_Trees and Microsoft_Clustering).
Extending the Functionality of SQL Server Data Mining
We discussed so far how various data access APIs can be used to execute statements on the server. The statements were either in Data Definition Language (DDL) or in DMX, the SQL-like language for data mining. In this section, we'll discuss ways to extend the server functionality by:
- Adding new functions and stored procedures.
- Adding new algorithms on the server.
- Adding new model viewers (client side).
Extending DMX with the Adomd.NET Server
Adomd.NET Server is a managed library that is installed together with the Microsoft Analysis Services 2005 server product. It can, therefore, be used from any managed language (such as C# or Visual Basic .NET). Adomd.NET server can be used to create two kinds of server extensions: functions and stored procedures.
A stored procedure is a stand-alone execution unit. It can be invoked by itself, with a DMX statement like this:
CALL TestAssembly.MyNamespace.MyClass.MyStoredProcedure()
A stored procedure may return a tabular result, like any DMX query, or may just perform server side operations, returning success or failure.
A server function is a helper execution unit that takes part in a regular DMX query. Such a query would look like this:
SELECT Function1() FROM MyModel WHERE Function2()
In Microsoft Analysis Services 2005, in order to create a function or a stored procedure, one has to create a managed class library (or assembly), then deploy it on the server. The assembly can be made visible to the whole server (server administrative permissions are required) or only inside a database (database administrative permissions are required). The assembly objects have some security features, including permission set (describing the permissions to be granted to the .NET code) and impersonation information (what account should be impersonated when code from that assembly is executed). The SQL Server 2005 Books Online contains details on the permission sets for assemblies and on the impersonation mode. For the purpose of this article, we will use the Safe permission set and impersonate the current user. Each public method of a public class in the assembly can be used as a server function or as a stored procedure. The difference between those is that a server function should return something (either a scalar or tabular content), while a stored procedure can return either an empty result (signifying success of execution) or tabular content (DataTable objects or objects that implement the IDataReader interface). Stored procedures cannot return scalar value (actually, they can, but scalar values will not be returned to the caller as result of a CALL DMX statement).
A fully qualified name must be used when referring a managed stored procedure or server function. The BNF form of a fully qualified name looks like the following:
<AssemblyName>[.<Namespace>]+[.<ClassName>].<MethodName>
AssemblyName is the name of the assembly, as defined when the assembly is deployed on the server. It is mandatory. <Namespace> is the simple (or nested) namespace that contains the target class and method. <ClassName> is the class containing the method, while <MethodName> is the actual method. <MethodName> is also mandatory. Note that <Namespace> and <ClassName> are not mandatory. These fields must be used when there is an ambiguity between multiple functions/stored procedures in the same assembly. A fully qualified name (including namespace and class name) is used for all the statements in this section.
When creating a server function or a stored procedure, it is important to keep in mind that these objects may be invoked in parallel, for multiple users. The class instances are not shared between multiple calls, so no assumptions should be made about the state of a class when a stored procedure or function is invoked. Rather, the state should be re-initialized according to the parameters passed to the method. And speaking of parameters: these can only be scalar values. A server function or stored procedure cannot take a table as parameter.
We'll show how a simple server function can enrich the user experience in the case of a DMX query. It is used in the context of a mining model, Model1, which has a predictable Gender column with values F and M. With the queries we try to improve the likes of the following:
SELECT Gender FROM Model1 [PREDICTION JOIN ... AS T]
These queries would return a column containing M or F.
The following server function contains code that converts a gender indicator to a more readable string. It will return Male for M and Female for F.
namespace TestSP
{
public class StoredProc
{
public string RecodeGender(string strGender)
{
if (strGender == "M") return "Male";
if (strGender == "F") return "Female";
return strGender;
}
}
}
Once the class library containing this class definition is built and deployed to the server, the server function can be invoked like the following:
SELECT TestSP.TestSP.StoredProc.RecodeGender(Gender) FROM Model1PREDICTION JOIN ... AS T
The new query would return a column containing Male or Female.
Let's focus now on the parameters of the server function: the argument is a column name. The DMX parsing engine determines that Gender is a column name, which is a predictable column, and the column evaluation happens before calling into the server function. Instead of the predictable column, one could pass a column from the input data of the PREDICTION JOIN operation (something like T.Gender). The function can be modified to take more than one parameter and perform some operation over the arguments (such as concatenating string, arithmetic operations with numeric parameters, and so on).
A simple stored procedure can be defined pretty much the same way. In the same class, we will add a new function that returns tabular content (a DataTable, in this case).
public DataTable GetWeekDays()
{
DataTable tbl = new DataTable();
tbl.Columns.Add("Day", typeof(int));
tbl.Columns.Add("Name", typeof(string));
object[] row = new object[2];
row[0] = 0;
row[1] = "Sunday";
tbl.Rows.Add(row);
row[0] = 1;
row[1] = "Monday";
tbl.Rows.Add(row);
...
row[0] = 6;
row[1] = "Saturday";
tbl.Rows.Add(row);
return tbl;
}
The stored procedure can be invoked with this DMX statement:
CALL TestSP.TestSP.StoredProc.GetWeekDays()
Let's investigate now a more complex scenario. Using the [College Plans] mining model, which predicts the college plans of a student based on parent encouragement and IQ, we need to associate each prediction with an explanation. The prediction can be obtained with the previous query. The prediction reason can be found in the mining model content, as the NODE_DESCRIPTION column associated to each node. The node that determined a prediction can be determined using the general DMX function PredictNodeId. Therefore, the query can be modified to look like this:
SELECT CollegePlans, PredictNodeId(CollegePlans) FROM [College Plans]
[PREDICTION JOIN ...]
For the scope of this section, we can ignore the PREDICTION JOIN part of the query as it is not involved in the design of server functions and stored procedure.
PredictNodeId returns the NODE_UNIQUE_NAME for the node that determined the prediction. The NODE_DESCRIPTION associated with a NODE_UNIQUE_NAME can be determined by executing a content query such as the following:
SELECT NODE_UNIQUE_NAME, NODE_DESCRIPTION FROM [College Plans].Content
Now, to get a prediction together with the prediction explanation, one needs to join the result of the first query and the result of the second query using the NODE_UNIQUE_NAME in both results as the key of the join. DMX, at least in the current version, does not support JOIN clauses.
However, one can write a server function that returns the node description taking a node unique name as argument. Assuming that such a function is already defined and that it is named GetNodeDescription, the query becomes something like this:
SELECT
CollegePlans,
TestSP.TestSP.StoredProc.GetNodeDescription(
PredictNodeId(CollegePlans) )
FROM [College Plans] PREDICTION JOIN ...
As we mentioned before, the DMX execution engine will evaluate PredictNodeId before calling into the server function. Hence, the server function will receive the node unique name as argument and all it has to do is to somehow access the server's current context, browse the content of the mining model, and determine the node description for the respective node.
Here is where Adomd.NET Server becomes useful. To use it, the class library must contain a reference to a specific assembly that exposes the server internals to custom stored procedures and functions. This assembly is named msmgdsrv.dll and it is located in the folder where the Microsoft Analysis Services 2005 server is installed.
Once the reference is added, it should be used with code like this:
using Microsoft.AnalysisServices.AdomdServer;
By using this assembly, the server context is already available as the static Context class.
Each method in the assembly can now use Context to access server's current context.
Having a Context class around is very much like having an active AdomdConnection object around in an application that uses the Adomd.NET client library. In the section about Adomd.NET we discussed the AMO-like way Adomd.NET exposes server metadata objects. In the client API, the metadata objects are retrieved by means of discovery, while in the server Adomd.NET library, they are accessed directly into the server memory space.
Back to the function we try to write, the Context object provides all we need to write it and it should look like this:
public string GetNodeDescription(string nodeUniqueName)
{
MiningContentNode node =
Context.CurrentMiningModel.GetNodeFromUniqueName(nodeUniqueName);
return node.Description;
}
What happens here:
- The Context "knows" which is the CurrentMiningModel, because the function invocation happens during the execution of a DMX statement and that statement has a current mining model (the [College Plans] model, in our example).
- The MiningModel object returned as CurrentMiningModel exposes a method, GetNodeFromUniqueName, that returns a MiningContentNode based on the node unique name.
- The MiningContentNode has a Description property that contains the NODE_DESCRIPTION value for that node.
Once the assembly is deployed, for a student encouraged by parents and with an IQ of 110, the result of the query that uses the server function is the following:
College Plans = Does not plan to attend
Node Description = "ParentEncouragement='Encouraged' and IQ >= 101"
And our goal is achieved; in a single query we get the information that would otherwise require a JOIN, which Analysis Services 2005 does not support.
The Context class exposed by the Adomd.NET server library contains all the collections exposed for discovery purposes by the client side AdomdConection object. Therefore, in a server function or stored procedure, the developer can traverse the content of one or more mining models, or even of OLAP objects such as Cubes and Dimensions. An important internal difference between server functions and stored procedures is that CurrentMiningModel is only available in a server function. This is because the DMX parsing engine cannot determine the current mining model for a statement like this:
CALL MyStoredProcedure()
Even if the stored procedure takes the mining model name as an argument (a usual approach), there is no simple way to describe which of the parameters should be considered the target mining model. Besides, a stored procedure might simply not support the concept of current mining models, because it might have to deal with multiple models.
The data mining tools coming with Microsoft Analysis Services 2005 use stored procedures to reduce the amount of traffic between server and client. Most of the mining models viewers require a full content traversal to identify the patterns that are to be exposed. Without stored procedures, the whole mining model content would have to be downloaded to the client, kept in memory, and traversed to find those visually interesting patterns. Using stored procedures, the content traversal happens on the server (where the model content is already loaded in memory) and the stored procedure results contain only the information required by the viewers (such as the differences between two clusters, in the Microsoft Clustering Viewer, or those rules containing a specific term, in the Microsoft Association Rules viewer).
Preserving the similarity with the client Adomd.NET library, on the server one also defines an AdomdCommand object. This object can be used to execute various DMX queries in the current database. One difference is that the server side command object can only execute DMX statements (it does not support DDL or OLAP MDX queries). The server AdomdCommand is similar to its client side counterpart: it exposes ExecuteReader and ExeuteNonQuery methods and it supports parameters and even tabular parameters.
The following code snippet shows how to instantiate a server side command object and how to execute a query:
public IDataReader ExecuteContentQuery(string strModel)
{
AdomdCommand cmd = new AdomdCommand();
cmd.CommandText = "SELECT NODE_UNIQUE_NAME, NODE_RULE " +
"FROM [" + strModel + "].CONTENT";
return cmd.ExecuteReader();
}
This mechanism can be used to encapsulate the DMX queries issued by an application and replace them with statements like this:
CALL TestSP.testSP.StoredProc.ExecuteContentQuery( "College Plans" )
Please note that the server currently does not support stored procedures that return nested tabular content. Stored procedures can, however, return tabular content (such as DataTable or IDataReader objects) that is serialized by the server as a regular tabular query response. This rule has an exception: stored procedures can also return DataSet objects. When this happens, the DataSet object is completely serialized as a string. Regardless of the structure of the DataSet, the response from the server will contain exactly one line and one column, containing the string serialization of the DataSet. The client side is responsible to restore the serialization string into a new DataSet object. The client Adomd.NET API does this automatically.
The DMX statements that are executed through the server side AdomdCommand object do not have to be queries. Statements such as CREATE MINING MODEL or INSERT INTO can also be used. If a mining model is created or dropped during the execution of a stored procedure, the data mining-related collections exposed by the Context class need to be refreshed (this is done with a call to the Refresh method exposed by these collections).
Stored procedures and server functions are a powerful extension to the DMX framework. They can improve the performance of content viewers and present the set of rules inside a trained mining model in completely new ways. Also, they can enhance the DMX queries and the flexibility of the language.
Plug-in Algorithms and Content Viewers
Microsoft Analysis Services 2005 provides a powerful set of algorithms that can be used to solve a wide range of business problems. However, specific problems often require data mining algorithms that are not included in a particular suite. This is why the server provides a plug-in framework for extending the set of algorithms. A new algorithm must be a COM server that implements a specific set of interfaces and makes use of another set of interfaces exposed by the server. Such a plug-in algorithm will automatically take advantage of many features of the new data mining platform:
- metadata definition and management
- granular access permissions model
- highly scalable query infrastructure
- training and prediction case set tokenization
- DMX query language and parsing
- Support for stored procedures and server functions
Basically, a new algorithm need not worry about much else besides finding patterns in a training set or scoring an input case. The SQL Server Data Mining: Plug-In Algorithms whitepaper provides details about authoring such plug-in algorithms.
On the client side, the data mining team tried to provide generic algorithm viewers. These viewers should display visually useful information for each algorithm, regardless of the business problem that it solves. Data mining applications that solve a specific problem may need special viewers, which perform better for that particular problem. Also, plug-in algorithms may not expose a content that can be handled by the built-in viewers. This is why the data mining tools expose a way of plugging in a custom model viewer. For details on this, please read A Tutorial for Constructing a Plug-in Viewer.
Suggestions for Application Scenarios
This section's intent is to provide a few quick suggestions for developers to deal with some common scenarios encountered in data mining applications. The hint list is not exhaustive and some of the hints may have limited value. However, these hints represent a summary of the knowledge accumulated by the data mining development team in writing various applications that make use of Microsoft Analysis Services 2005.
One general suggestion: the OLE DB for Data Mining specification contains all the details on DMX, and understanding DMX is the fastest way to get productive with data mining in Microsoft Analysis Services.
Application Scenario 1: Simple Data Mining Prediction
The purpose of such applications is to perform simple predictions.
- The simplest way to write code that executes data mining queries (predictions included) is to use Adomd.NET. The section on Adomd.NET contains code that can be copied into your applications and modified.
- A very common scenario is a prediction that is based on a single case (singleton prediction). If the data is coming from the user input, it is strongly recommended to use parameterized statements to avoid malformed DMX, DMX injection from the user, and so on.
- If the application is doing more than prediction (allows the user to select a model, columns, and the like), the collections exposed by Adomd.NET are the easiest way for discovering server metadata
Application Scenario 2: Web Application—Using Rules in Existing Models on a Server
Even more than in the previous scenario, a Web application should send all the user input as query parameters. Besides possible errors in generating DMX, a Web application is exposed to malicious users. Somebody will most likely try to inject DMX into your statements. Although DMX is carefully designed so as to minimize the damage of injection (batches of statements are not permitted and a SELECT statement cannot be chained in any way with a DELETE or DROP statement), a malicious user might still get access to the drill through store, or at least generate a malformed query that has the potential to slow down or even crash your application.
Web applications are running under clearly defined user accounts. A public Web application cannot impersonate the remote user. Therefore, the user account under which such an application is running must be carefully chosen and the permission for that account must be minimal.
Performance of a Web application will increase if a pool of connections to the Microsoft Analysis Service server is maintained, compared to the case when a new connection is created for each request.
On the other hand, connections in a pool may timeout. One should make sure that a connection is still active before returning it from the pool.
Microsoft Analysis Services 2005 comes with a set of Web controls (samples optionally installed during setup) that can display the content of mining models in a rich graphical form similar to the one used in the data mining viewers. These controls can be viewed in action here.
For associative predictions, a common type of Web data mining applications, new syntactic elements in DMX (available only with Microsoft Analysis Services 2005) dramatically increase the performance. Typically, the DMX for performing associative predictions in Microsoft Analysis Service 2000 were using the TopCount function to sort the prediction results (the list of products likely to be bought by the current user) based on prediction probability. In Microsoft Analysis Service 2005, TopCount is no longer needed. A query like the following will provide the same results, only faster:
SELECT Predict([Products], 5) FROM Model PREDICTION JOIN <Current products>
Application Scenario 3: Create and Train a New Model on a Server for Current Data
Deploying a mining model on a server actually requires database administrative permissions. Session mining models (objects with the life span limited to the current session) do not require administrative permissions (if this feature is enabled by the server administrator). Session objects do not overload the server with useless metadata, so they should be used when possible.
If the training set is available to the client applications, there are two ways of training a mining model:
- Data can be staged in a relational database that is visible both to the server (for reading) and the client application (for writing). Once the data is in the database, the model may be trained by generating bindings that point to the staged data or by issuing an INSERT INTO statement with an OPENQUERY pointing to the staged data. INSERT INTO statements are easier to generate, but their performance is lower than processing with DDL bindings. This is because DDL bindings allow parallel processing of model columns by issuing queries in parallel, while OPENQUERY is bound to execute the query only once and store the data into a temporary store, then process the columns one by one. If the data set is not very large, the performance difference is hardly visible.
- Data can be pushed as a tabular parameter to an INSERT INTO statement. Adomd.NET has great support for that, and data can be pushed from an in-memory table or from an implementation of the IDataReader interface. The performance advantage of processing with DDL bindings is lost, but the statement is much easier to generate and sometimes staging is simply not possible, because the client and the server do not share a network segment where a relational database can be visible to both.
Conclusion
We hope we covered the most important aspects of data mining programmability in Microsoft Analysis Services 2005. The product is not yet completed and some features may still change by the time it is done. The MSDN SQL site contains information in about SQL Server 2005 general, and about the data mining part in particular.
The data mining team site is always up to date with the latest information, tips and tricks, whitepapers, and downloadable code samples.
The data mining newsgroups, microsoft.public.sqlserver.datamining and microsoft.private.sqlserver2005.analysisservices.datamining, are monitored carefully by the data mining team, so they are also a great source of information.
Appendix 1: Common Operations with Microsoft Analysis Services 2005 Server and the Protocol Format of the Requests
The structure of messages sent to Microsoft Analysis Services 2005 is described in Figure 2 below:
Fig. 2. Structure of MSAS2005 messages
The following examples contain some common Analysis Services request payloads, highlighting the SOAP, XMLA, and the DDL or DMX payloads.
Example 1. Discovers databases on the server, using pure XMLA
SOAP <Envelope xmlns="https://schemas.xmlsoap.org/soap/envelope/">
SOAP <Body>
XMLA <Discover xmlns="urn:schemas-microsoft-com:xml-analysis">
XMLA <RequestType>DBSCHEMA_CATALOGS</RequestType>
XMLA <Restrictions>
XMLA <RestrictionList/>
XMLA </Restrictions>
XMLA <Properties />
XMLA </Discover>
SOAP </Body>
SOAP </Envelope>
Example 2. Creates a new mining model on the server. Uses the XMLA Execute command, with a DDL content (Create and below)
SOAP <Envelope xmlns="https://schemas.xmlsoap.org/soap/envelope/">
SOAP <Body>
XMLA <Execute xmlns="urn:schemas-microsoft-com:xml-analysis">
XMLA <Command>
DDL <Alter
DDL AllowCreate="true"
DDL ObjectExpansion="ExpandFull"
DDL xmlns="https://schemas.microsoft.com/analysisservices/2003/engine"
DDL >
DDL <Object>
DDL <DatabaseID>TT</DatabaseID>
DDL <MiningStructureID>TestModel</MiningStructureID>
DDL </Object>
DDL <ObjectDefinition>
DDL <MiningStructure
DDL xmlns:xsd="http://www.w3.org/2001/XMLSchema"
DDL xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
DDL <ID>TestModel</ID>
DDL <Name>TestModel</Name>
DDL ...
DDL </MiningStructure>
DDL </ObjectDefinition>
DDL </Alter>
XMLA </Command>
XMLA <Properties />
XMLA </Execute>
SOAP </Body>
SOAP </Envelope>
Example 3. Executes a DMX query on a Mining Model
SOAP <Envelope xmlns="https://schemas.xmlsoap.org/soap/envelope/">
SOAP <Body>
XMLA <Execute xmlns="urn:schemas-microsoft-com:xml-analysis">
XMLA <Command>
XMLA <Statement>
DMX SELECT Cluster() FROM CL
XMLA </Statement>
XMLA </Command>
XMLA <Properties />
XMLA </Execute>
SOAP </Body>
SOAP </Envelope>
Appendix 2: Using OLE DB from Ado.NET and C++
Unmanaged C++
Unmanaged C++ is the native environment for working with OLE DB providers. While this requires some COM knowledge and a good understanding of the OLE DB model, the ATL Consumer Templates library hides the complexity of COM and provides an easy way to write applications. Using ATL Consumer Templates is not as easy as using an Analysis Services specific API such as ADOMD.NET, but it is the way to go for unmanaged applications. The unmanaged C++ code in this section assumes that the user has a working knowledge of the ATL library and of the OLE DB interfaces, and a good understanding of the OLE DB for data mining specification.
For the code below, we will assume that we have a predefined macro, CHECK_ERROR, which handles COM errors.
A very simple (albeit not very useful) implementation of this macro is below:
#define CHECK_ERROR(x) hr=(x);if(FAILED(hr)) return 1;
We'll present some C++ code that executes the same tasks that were performed through ADO in the section on OLE DB, less the section on parameters (for reasons of space).
A Data Source object and a session are initialized like this:
CoInitialize(NULL);
{
HRESULT hr;
CDataSource dataSrc;
CSession session;
LPOLESTR szConnStr =
L"Provider=MSOLAP.3; Data Source=localhost;"
L"Initial Catalog=MyCatalog";
CHECK_ERROR(dataSrc.OpenFromInitializationString(szConnStr));
CHECK_ERROR(session.Open(dataSrc));
Note the mandatory call to CoInitialize(NULL) at the beginning of the program. CoInitialize initializes the COM library and must be invoked in the current thread before attempting to load an OLE DB provider.
ATL Consumer Templates provide a generic way of accessing discovery schemas:
CRestrictions<
CDynamicAccessor, 1,
&DMSCHEMA_MINING_MODELS> schemaModels;
schemaModels.SetBlobHandling(DBBLOBHANDLING_NOSTREAMS);
CHECK_ERROR( schemaModels.Open( session, _T("MyCatalog") ));
CHECK_ERROR( schemaModels.MoveFirst() );
do
{
if( hr == S_OK )
{
wchar_t* szModelName = NULL;
if( schemaModels.GetValue((DBORDINAL)3, &szModelName) )
{
printf("%S\n", szModelName);
}
}
hr = schemaModels.MoveNext();
}while(hr == S_OK);
DMSCHEMA_MINING_MODELS is the GUID of the Mining Models provider specific schema used by Analysis Services. It is defined in the oledbdm.h file, which is part of the OLE DB for Data Mining specification. The CRestrictions class instantiated at the beginning of the block uses a generic accessor (CDynamicAccessor) to bind automatically to all the columns returned by the discovery operations.
The MoveFirst/MoveNext/GetValue operations are rather similar in functionality to the ADO counterparts described above in the OLE DB section.
If a specific schema is to be used multiple times in an application (or in multiple applications) then a dedicated accessor should be defined, to simplify and optimize the data access. The following code shows how such an accessor should look, and defines a specialized version of the generic CRestrictions used above, a version dedicated to discovering mining models. The CMiningModelInfo, the accessor for the Mining model schema, should be completely populated with all the columns of interest from the Mining models schema.
class CMiningModelInfo
{
public:
// Constructors
CMiningModelInfo ()
{
memset(this, 0, sizeof(*this));
}
// Attributes
TCHAR m_szCatalog[129];
TCHAR m_szSchema[129];
TCHAR m_szName[129];
TCHAR m_szType[129];
GUID m_guidModel;
TCHAR m_szDescription[129];
...
TCHAR m_szService[129];
...
// Binding Map
BEGIN_COLUMN_MAP(CMiningModelInfo)
COLUMN_ENTRY(1, m_szCatalog)
COLUMN_ENTRY(2, m_szSchema)
COLUMN_ENTRY(3, m_szName)
COLUMN_ENTRY(4, m_szType)
COLUMN_ENTRY(5, m_guidTable)
COLUMN_ENTRY(6, m_szDescription)
...
COLUMN_ENTRY(10, m_szService)
...
END_COLUMN_MAP()
};
typedef CRestrictions<CAccessor<CMiningModelInfo>, 7,
&DMSCHEMA_MINING_MODELS > CMiningModels;
7 is the number of restrictions supported by the Mining models schema, according to the OLE DB for Data Mining specification.
Using the classes above and CMiningModels instead of the generic CRestrictions<CDynamicAccessor>, the code for fetching the model names changes from this:
wchar_t* szModelName = NULL;
if( schemaModels.GetValue((DBORDINAL)3, &szModelName) )
printf("%S\n", szModelName);
to this:
printf("%S\n", schemaModels.m_szName);
Using the same dynamic accessor mechanism (or by designing a specific accessor for common tasks), one can execute commands (DDL or DMX statements).
A Command should be initialized like this:
CStringW strDMXStmt;
strDMXStmt = L"SELECT NODE_CAPTION FROM DecisionTree1.CONTENT";
CCommand<CDynamicAccessor> cmdDMX;
cmdDMX.SetBlobHandling( DBBLOBHANDLING_NOSTREAMS );
CHECK_ERROR( cmdDMX.Open( session, strDMXStmt.GetBuffer(), NULL) )
The Open method of the CCommand class takes a string as the second argument. The string can be, when connecting to Microsoft Analysis Services 2005, a DMX statement or a DDL statement. Once the command was executed (the Open method), the response format can be inspected by invoking GetColumnInfo. This will make the following pCol parameter point to a vector containing information for every column (such as name, size, or data type). This information can be used to identify the ordinals of the columns to be fetched.
DBORDINAL ulColumns = 0;
DBCOLUMNINFO* pCol=NULL;
LPOLESTR ppStrings = NULL;
CHECK_ERROR( cmdDMX.GetColumnInfo(&ulColumns, &pCol, &ppStrings) );
With the target columns identified (column 0 of type string, in the preceding case), one can traverse the server response and fetch the useful data:
wchar_t* pszBuffer = NULL;
CHECK_ERROR( cmdDMX.MoveFirst() );
do
{
wchar_t* szNodeCaption = NULL;
if( cmdDMX.GetValue((DBORDINAL)1, &szNodeCaption) )
{
printf("%S\n", szNodeCaption);
}
hRet = cmdDMX.MoveNext ();
}while(hRet == S_OK );
If the query returns a nested table, this will be exposed for each row as a new IRowset object. The child IRowset object can be obtained from the top level rowset through the IParentRowset interface, which is mandatorily exposed by hierarchical rowsets.
Let's assume that the query DMX statement looks like this:
"SELECT NODE_CAPTION, NODE_DISTRIBUTION FROM DecisionTree1.CONTENT"
The loop for reading the data in this rowset should look like the following. As an example, we will read the value in column 2 of the nested NODE_DISTRIBUTION rowset that, according to the OLE DB for Data Mining specification, contains a mining attribute value as a VARIANT.
// external loop, reading the top level rowset
do
{
// Acquire the IParentRowset interface of the top level rowset
CComPtr<IParentRowset> spParentRowset;
CHECK_ERROR(cmdDMX.m_spRowset.QueryInterface( &spParentRowset));
// Use a dynamic Accessor for reading the nested rowset
CAccessorRowset< CDynamicAccessor> nestedRowset;
nestedRowset.SetBlobHandling( DBBLOBHANDLING_NOSTREAMS );
// Get the pointer to the child rowset into the nestedRowset
// variable
CHECK_ERROR( spParentRowset->GetChildRowset(
NULL,
(DBORDINAL)2,
nestedRowset.GetIID(), (IUnknown**)nestedRowset.GetInterfacePtr() ) );
// Bind the dynamic accessor to the child rowset
CHECK_ERROR( nestedRowset.Bind() );
CHECK_ERROR( nestedRowset.MoveFirst() );
// Internal loop, for the nested (child) rowset
while (hr == S_OK )
{
VARIANT varTmp;
::VariantInit( &varTmp );
if( nestedRowset.GetValue((DBORDINAL)2, &varTmp) )
{
// Use the VARIANT here
}
// Move forward in the nested loop
CHECK_ERROR( nestedRowset.MoveNext() );
}
// hr is not OK, but it is not a failure either:
// assume it is DB_S_ENDOFROWSET or add extra validation code
// Move forward in the top level rowset
hr = cmdDMX.MoveNext ();
}while(hr == S_OK );
The nestedRowset variable, of type CAccessorRowset< CDynamicAccessor >, binds automatically to all the columns and allows easy access to the values of all the fields of the nested rowset. It exposes the methods MoveFirst, MoveNext, and GetValue very much like the cmdDMX top command object. The reason is that the CCommand class derives from the CAccessorRowset class. They just wrap different rowsets: the cmdDMX object wraps the top level rowset resulting from the command execution, while the nestedRowset object wraps a child rowset.
For details on setting OLE DB command parameters in consumer templates, many samples and documentation are available on the MSDN site.
ADO.NET
ADO.NET was designed as a replacement for ADO in the managed applications. While it is convenient and easy to use it in managed applications, the preferred solution for connecting to Microsoft Analysis Services 2005 from a managed application is ADOMD.NET, which is specifically designed for this purpose. An important limitation of ADO.NET is that it does not support named parameters. That is, it only supports parameters defined by their ordinals; therefore ADO.NET cannot be used to issue DMX statements with parameters.
ADO.NET can be used by adding a reference to System.Data.dll to your managed project and using that reference with code like the following:
using System.Data.OleDb;
As we mentioned before, ADO.NET is similar to ADO in the fact that it uses a single connection object to wrap both Data Source and Session OLE DB objects. Using ADO.NET starts by initializing an ADO.NET connection and configuring it to use an OLE DB connection string:
OleDbConnection conn = new OleDbConnection();
conn.ConnectionString = "Provider=MSOLAP.3; Data Source=localhost; " +
"Initial Catalog=MyCatalog";
conn.Open();
Once the connection is initialized, it can be used for discovering objects on the server. The code below fetches the list of mining models from the Analysis Services server:
Guid miningModels = new Guid("{3ADD8A77-D8B9-11D2-8D2A-00E029154FDE}");
object[] arRestrictions = new object[1];
arRestrictions[0] = "CheckInTestDB";
DataTable tblModels =
conn.GetOleDbSchemaTable(miningModels, arRestrictions);
foreach( DataRow row in tblModels.Rows )
{
string modelName = row["MODEL_NAME"];
// Use the modelName here
}
As in the previous code shows, we use a single restriction, the catalog name, to limit the discovery to the mining models in catalog MyCatalog.
DMX and DDL statements can be executed very much as in ADO, through a command object. However, the ADO.NET Command object exposes more versions of the Execute method.
One of these versions is ExecuteReader. ExecuteReader is returning a Data Reader object (implementing the .NET IDataReader interface). A Data Reader object can be used to traverse a tabular result from the server. If the tabular result contains a nested table, that can be obtained as a different instance of Data Reader, for each row. The code below executes a DMX query that returns both a top level column (NODE_CAPTION) and a nested table column (NODE_DISTRIBUTION). Then it traverses both the top level and the nested result, showing how to fetch information from the server response:
OleDbCommand cmd = new OleDbCommand();
cmd.Connection = conn;
cmd.CommandText = "SELECT NODE_CAPTION, NODE_DISTRIBUTION FROM " +
" DecisionTree1.CONTENT WHERE NODE_TYPE=2";
OleDbDataReader rdr = cmd.ExecuteReader();
while (rdr.Read())
{
// Use the top level fields
Console.WriteLine(rdr.GetString(0));
// handle the nested reader
// First, make sure that the object in column 1 is a
// nested reader
object objNestedRowset = rdr.GetValue(1);
Debug.Assert(objNestedRowset is IDataReader);
OleDbDataReader nestedReader = (OleDbDataReader)objNestedRowset;
// Now, traverse the nested reader
while (nestedReader.Read())
{
Console.WriteLine(nestedReader.GetValue(1).ToString() );
}
}
Very much like the ADO Command object, the ADO.NET OleDbCommand object must use an active Connection object. It has a CommandText property that can be assigned a DDL or a DMX statement. In the ADO.NET reader object (OleDbDataReader), a column is accessed by its ordinal. The column value can be retrieved as a typed object, when the type is known, as in the code line that makes use of the top level fields. A column value may also be retrieved as a generic object. In this case, the type of the object can be evaluated after retrieval, as in the code lines that handle the nested reader.
Another version of the Execute method is ExecuteNonQuery. This method is particularly useful when executing a DDL statement or a DMX statement that is not expected to return results other than success or failure status, such as CREATE MINING MODEL or INSERT INTO. The code below uses ExecuteNonQuery to launch a DDL Process.
cmd.CommandText = " <Process " +
"xmlns=\"https://schemas.microsoft.com/analysisservices/2003/engine\">"+
" <Type>ProcessStructure</Type>"+
" <Object>"+
" <DatabaseID>CheckInTestDB</DatabaseID>"+
" <MiningStructureID>Structure1</MiningStructureID>"+
" </Object>"+
" </Process>";
cmd.ExecuteNonQuery();
Very much like in ADO, if the execution of the statement fails then an exception is thrown. The C# exception handling allows the developer to catch the exception and investigate the reason for the failure.