SCSM DW ETL jobs - Modules 'Not Started'

Don Tedesco 1 Reputation point
2021-01-05T16:43:28.083+00:00

I have studied numerous sites about how to deal with Batch Jobs that fail or are stuck.
Process is basically:
* Stop all Batch Jobs
* Disable All Schedules
* Re-create Batch Jobs
* Run each Batch job
* Re-enable job schedules

I've also studied how to identify any Job Modules that have errors.

However, in 'my' case, none of the Batch Jobs moves past 'Running', whilst all modules continue to display 'Not Started', even after many hours.
It doesn't matter which Batch Job I run (although I understand the correct sequence 'should' be - DWMaintenance, MPSyncJob, Extract, Transform, Load - I have tried running jobs in this order, or any order - results are the same, the Batch job is stuck at 'Running' and the underlying modules all show as 'Not Started'.

How can I identify 'what' is stuck, or failing? (There are no errors!)
How long should I leave the job running? (I understand the ETL jobs stopped several weeks ago, so assume there is lot of catch up to do. But if I leave a job running, how can I know that it is still doing 'something', just taking a long time, or, it is stuck?)

Windows Event Viewer shows no errors.
Database shows no errors.

I don't know where else to look, or what else to look for.
As I say, anything I can find online simple explains how to stop/re-start batch jobs and/or find batch jobs/modules that have failed (error 7).

Anyone have any advice?

Cheers,
Don

Service Manager
Service Manager
A family of System Center products for managing incidents and problems.
209 questions
0 comments No comments
{count} votes

5 answers

Sort by: Most helpful
  1. Andreas Baumgarten 96,266 Reputation points MVP
    2021-01-05T18:28:10.227+00:00

    Maybe this is helpful:

    https://learn.microsoft.com/en-us/archive/blogs/mihai/service-manager-data-warehouse-troubleshooting

    ----------

    (If the reply was helpful please don't forget to upvote and/or accept as answer, thank you)

    Regards
    Andreas Baumgarten

    0 comments No comments

  2. Don Tedesco 1 Reputation point
    2021-01-06T09:23:03.487+00:00

    Thanks Andreas. this looks like an interesting article which goes a further step beyond all the other articles I have read before.

    It gives more information about the modules behind each batch job :-)
    I need to continue to study it further to see if I can extract any 'killer information' which will resolve my issue - but not there yet :-(

    I have turned off all schedules.
    I have created new batches for all Processes (all currently sitting in 'Not Started' Status, as expected).

    I then started the first batch with 'Start-SCDWJob -jobname DWMaintenance' at 10pm last night. it immediately changed to 'running' status.
    However, this a morning at 9am, having been 'running' for 11 hours overnight, the batch still shows 'running' status, AND the modules are also all showing 'Not Started' !
    So...on the face of it, nothing has happened!
    But I don't know why, nor where to look....
    IF 'something' had run, and/or an error was displayed, I might have an idea of where to look. But no error and no real evidence of 'anything' happening - so not even sure where else to look......
    No errors anywhere on server :-(

    (I 'have' just spotted an error in the SQL Server errorlog:
    "AppDomain 995 (DWStagingAndConfig.dbo[runtime].1034) is marked for unload due to memory pressure."
    I wonder if this is related in some way...? Am looking to see if I can increase memory on database server......)

    0 comments No comments

  3. Don Tedesco 1 Reputation point
    2021-01-07T16:38:34.947+00:00

    OK, I added 8Gb of memory to the instance.
    So no further 'memory pressure' alerts.

    But still.....no progress.

    It doesn't matter which Batch I run. It just displays 'Running' and the associated WorkItems just display 'Not Started'.
    Nothing proceeds. No errors. No movement.

    I assume that 'something' is preventing any of it from running, but have NO IDEA what it could be.
    And I can't find anywhere online that can point me any direction.

    I HAVE contacted MS in the past about a similar issue (stuck jobs) - I had to watch an MS consultant for 4 hours, on 4 occasions (that's at least 16 hours!) whilst he went through the same process of disabling schedules, re-creating batches, running them, then re-enabling schedules. It didn't work for him either!
    I only managed to get it to work (by accident!) by repeating the process over and over until 'suddenly' it kicked into action - no idea how or why :-(

    Am getting desperate now.

    Don't really want to got back to MS (though may have to) as I am sure he will just repeat the process AGAIN.......

    0 comments No comments

  4. Andreas Baumgarten 96,266 Reputation points MVP
    2021-01-07T16:46:48.337+00:00

    The SCSM Data Warehouse was a "tricky" thing since the beginning.

    Just as an option to think about:
    What data retention time is configured in your SCSM environment?
    Does it make sense to unregister/uninstall the DWH/reinstall the DWH and register again?
    You will loose all data in SCSM DWH that is deleted in the SCSM Database older than the configured retention time in the reports.

    Another option might be to contact MS support again.


    (If the reply was helpful please don't forget to upvote and/or accept as answer, thank you)

    Regards
    Andreas Baumgarten

    0 comments No comments

  5. Donato Tedesco 1 Reputation point
    2021-03-15T14:15:28.057+00:00

    OK, 2 months later, and I am STILL in pretty much the same position.....

    ie. Job marked as 'Started', but all modules showing 'Not Started'

    In January, I continued my usual activity - stopping batches, re-creating them, running them. Over and over. Until suddenly, for no obvious reason, one of them actually starts running.
    However, for the first time ever, it actually displayed an ERROR!
    Upon investigation, this was related to a known date issue whereby up until SCSM Update Rollup 5, dates beyond 31/12/2020 were not accepted, and errors arose. Well....finally, my request to update SCSM has been completed, so 'that' should no longer be an issue.

    BUT....I am back to my original issue! :-(

    I can START any job and it will show as 'Running' but all its modules will show as 'Not Started'..........
    I followed all the standard instructions for troubleshooting stuck datawarehouse jobs at:

    https://learn.microsoft.com/en-us/troubleshoot/system-center/scsm/troubleshoot-stuck-data-warehouse-jobs  
    

    Stopping batches. Re-creating them, Starting them again....

    However, at a latter stage it just says "If the job completes successfully, go to the next step."
    It doesn't give any advice (or even a 'hint') on what to do if the job does NOT complete successfully......

    I started the DWMaintenance job last Friday afternoon, and this morning (Monday) it is still sitting there showing 'Running' Status, with all its modules showing 'Not Started' - doesn't look like 'anything' has happened. Over 48 hours later....

    As I mentioned earlier, I am reluctant to call Microsoft support, since I have spend 16+hours in the past with a support person who simply reads off the same web page instructions and follows the same steps that 'I' have followed many times.
    They don't appear to be 'expert' (or even 'knowledgable') on the system as I have to sit with them whilst they read all the same documentation that I have gone through myself.

    I can't believe that I am the ONLY person that is facing the same issue (though I suspect I may be the only one who is reading 'this' message... :-( )

    Would be grateful for any new information, or suggestions on where to look next....