Polybase access to Mongodb big tables is very slow

Luca V 96 Reputation points
2021-09-01T14:58:24.493+00:00

Hi all,

we have installed on SQL 2019 Developer Edition the Polybase features.
After that, we've created an external data source towards our Mongodb server, where we've stored lots of big tables containing applicative logs.

*CREATE DATABASE SCOPED CREDENTIAL mongo_creds WITH IDENTITY = 'XXXXXXXXXXXX', Secret = 'XXXXX!';
go
CREATE EXTERNAL DATA SOURCE MongoDBSource
WITH (LOCATION = 'mongodb://nnn.nnn.nnn.nn:27017',
CREDENTIAL = mongo_creds,
CONNECTION_OPTIONS = 'ssl=false;')
go
CREATE EXTERNAL TABLE...

*

It works, but we have experienced many performance problems in querying large tables (about 50-100 GB) on Mongodb, as soon as we try to query with conditions.
The "select top 100 *" is lightning fast, but running a query "...where field = 15" can take more than twenty minutes. On any table, despite the same query on Mongo lasts just a second.
In addition, we have noticed that when we run the query from SQL, the system tries to "download" the entire table to the swap files on the Windows machine ( C: \ Windows \ ServiceProfiles \ NetworkService \ AppData \ Local \ Temp)...
Anyone have any idea what we could try to speed up queries and avoid this kind of swap?

Thanks in advance.

Luca

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,363 questions
{count} votes

Accepted answer
  1. Luca V 96 Reputation points
    2021-09-02T07:26:04.157+00:00

    Hello,
    yes, we created indexes and they're very useful when you run the query on Mongodb... Running queries from Sql Server have no benefit at all.
    I think the point is what Sql seems to do every time we run a query: uses the swap directory to download the dataset to be queried...
    If I check the sys.external_data_sources table, we defined correctly the pushdown parameter (= ON), so the query activity should be beared by the Mongo dataserver ...

    ...I just found on the Microsoft manuals that pushdown vs Mongodb community dataserver is not active in many situations:

    If the join can be done at the external data source, this reduces the amount of data movement and improves the query's performance. Without join pushdown, the data from the tables to be joined must be brought locally into tempdb, then joined.

    Maybe this could be the answer...

    Luca


1 additional answer

Sort by: Most helpful
  1. Seeya Xi-MSFT 16,461 Reputation points
    2021-09-02T02:46:40.493+00:00

    Hi @Luca V ,

    Have you built indexes on large tables? For tables with such large data, full table scans should be avoided as much as possible.
    Here is an article about improving performance: https://docs.mongodb.com/manual/tutorial/optimize-query-performance-with-indexes-and-projections/

    Best regards,
    Seeya


    If the response is helpful, please click "Accept Answer" and upvote it, as this could help other community members looking for similar queries.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments