Polybase against large MongoDB performance is very slow
I have sql2022 server installed and polybase enabled. I have created a few external tables against one big mongoDB table for testing.
CREATE DATABASE SCOPED CREDENTIAL stgmdbUser1 WITH IDENTITY = 'polybaseUse1', Secret = 'xxxxxxx';
go
CREATE EXTERNAL DATA SOURCE mongo_stgmdb01
WITH (LOCATION = 'mongodb://stgmdb01:27017',CONNECTION_OPTIONS = 'ssl=false', CREDENTIAL = stgmdbUser1 , PUSHDOWN = ON )
go
The mongoDB is community version, running version 5.0.13. The documents in the collection is big, has arrays and objects imbedded. I created one external table:
simple_tbl: it maps _id nvarchar(256), app_id (int) and time_stamp (datetime2) (I have to change _id from NVARCHAR(24) to NVARCHAR(256) because I ran into error the complaining the mapped filed is too narrow)
"select top 10* from simple_tbl1" returns in 3 seconds, ok, not too bad.
But if I run with a filter that is indexed properly in mongoDB, even _id, it returns between 1minute 40 seconds to 2 minutes.
select * from simple_tbl where [_id] = N'c0f70418-50f7-4d8f-b9b8-09aea3e987af';
I followed this link [polybase-how-to-tell-pushdown-computation][1], enabled trace flag 6408, used query hint OPTION (FORCE EXTERNALPUSHDOWN) to make sure the filtering is done on the external database, I also checked query plan and dynamic view, it all shows pushdown is happening. To make sure sql server has good connection to the mongoDB, I filtered the collection by using the same _id with a tool "MongoDB Compass" on the SQL server, it returns result immediate, which is expected. I repeated the same process with another column "time_stamp" to filter, the same behavior. "time_stamp" is indexed in MongoDB as well.