DFS Referral Overide and Replication not working on one of my folders in a namespace

John Hummer 11 Reputation points
2022-01-06T22:49:55.883+00:00

Hi folks! I am pulling my hair out on this one. I have two folders in a namespace "\company.local\DFS". Each of the two folders have two targets. One is a fully updated Server 2016 machine and the other is a fully updated Server 2022 Machine. One of the folders in the "\company.local\DFS" namespace is called "Users". That folder also has DFS Replication turned on. Everything on that "Users" folder is working perfectly.

The other folder under the "\company.local\DFS" namespace called "Data" is what is keeping me up at night. The biggest issue is that users will randomly connect to one of the two targets for the namespace folder, even though I have referral override settings turn on to give priority to the Server 2022 machine. After 2 days of random users still connecting to the Server 2016, I tried disabling the server 2016 target. The next day, people still randomly connected to the Server 2016 machine, despite that server target being disabled in that namespace folder. I deleted the Namespace folder and recreated it. The issue persists.

The other issue with the "Data" folder under that namespace, is that DFS Replication will not work. I can only ever get one of the servers to replicate with the other, but it never works both ways. I have searched a lot of forums and tried the following: deleting the replication and recreating it, deleting the DFSRPrivate folders, ensuring that there is a primary server for the replicated folders, adjusting the limit of the staging folder to up to 100GB (should be way more than is needed as we don't have many large files), re-seeding the folders to ensure they have the same files and folders before starting replication.

Those are the things I have tried that I can think of at the moment. Each time I setup the replication, one of the two servers (more often it is the Server 2016 machine) gives an error when a DFSR Health report is run, that it is not participating in replication. I have let it sit for a week at one point, and it still reported that error. The target folder has about 2TB of data, so not a ton. Just looking for help brainstorming what I am missing. It's very frustrating that the "users" folder under the same namespace, on the same servers, setup exactly the same way, is working perfectly. Thanks to anyone that is willing to throw some ideas out!

Windows Server
Windows Server
A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.
12,921 questions
0 comments No comments
{count} votes

7 answers

Sort by: Most helpful
  1. John Hummer 11 Reputation points
    2022-01-10T15:43:06.613+00:00

    Please tell me someone out there knows how DFS works better than me (which is not much)?

    0 comments No comments

  2. John Hummer 11 Reputation points
    2022-01-25T17:12:52.757+00:00

    Seriously... nobody has any idea how DFS works?

    0 comments No comments

  3. John Hummer 11 Reputation points
    2022-01-31T14:35:15.79+00:00

    It's getting very lonely here.

    0 comments No comments

  4. cthivierge 4,056 Reputation points
    2022-02-01T06:05:40.457+00:00

    There is 2 things in your environment. DFSN and DFSR which are not the same thing. you could have DFSN without DFSR or DFSR without DFSN.

    Let's start with DFS Namespace
    Here how the DFSN is working:

    1. A user attempts to access a link in a domain-based namespace, such as by typing \mydomain.com\DFS\Data in the Run dialog box.
    2. The client computer sends a query to the active domain controller to discover a list of root targets for the domain-based namespace.
    3. The domain controller returns a list of root targets defined for the requested namespace (using AD site connectivity and costing to identify the best namespace server to use).
    4. The client selects the first root target in the referral and sends a query to the root server for the requested link.
    5. The root server constructs a list of link targets in the referral (using AD site connectivity and costing). The order in which the link targets are sorted depends on the target selection method.
    6. The root server sends the referral to the client.
    7. The client establishes a connection to the first link target in the list.

    The client keep in cache the namespace server as well as the target but it's a max of 30 minutes by default and if the client restart, the cache is also deleted.

    Which servers are you namespace servers ?

    If you configure Override referral ordering with this setting: "First among all targets", this should connect all clients to this target if the target is available.

    What is the Namespace configuration. Is it Optimized for consistency or Optimized for scalability ? (properties of the namespace / advanced tab)

    Does this issue is for all clients or you have some clients that has no issues ?

    Why don't just remove the Windows 2016 from the DFS target list? Only the namespace, do not delete from the replication group.

    Now, for DFSR, did you get the event 4104 from the DFS Replication event log that mention that the initial replication has finished ?

    What is the type of replication ? is it both way (like a full mesh) or it's only one way (like hub and spoke) ?

    Have you ran the dfsrdiag backlog command ?

    From any server that has the RSAT for DFS, type the following command:
    dfsrdiag backlog /rgname:[ReplicationGroupName] /rfname:[ReplicationFolderName] /smem:[SourceComputer] /rmem:[DestinationComputer]
    Here is an example: dfsrdiag backlog /rgname:contoso.com\dfs\data /rfname:data /smem:WIN2016SRV01 /rmem:WIN2022SRV01
    You can change the source computer and destination computer to validate both way replication.

    If everything is ok, you should see No backlog - member <SourceComputer> is in sync with partner <DestinationComputer>

    0 comments No comments

  5. John Hummer 11 Reputation points
    2022-02-01T15:03:08+00:00

    THANK YOU! I really appreciate your help. I have disabled the target that I do not want clients to generally refer to, and that did indeed work. Although I guess I don't know why the referral override settings don't work. I do know that I waited far over half an hour, because I did miss that setting that dictates how long that sets in cache. It's nice to know it's a half an hour though.

    For the replication, when I run that command it says there are 14328 backlogged files. I have been occasionally running the "Create Diagnostic Report" action in the DFS management counsel for the replication group that isn't working. It has not finished initial replication, though it has been working on it for over 2 weeks. The backlogged file count fluctuates from a couple hundred to tens of thousands of files. The replication is set up as a full mesh, and there is no throttling set up on the replication traffic. Both servers are on site with a 1gig link between them.

    When I look at the list of files after running the command you shared, I do not see any common link between them that would give me any clues as to why they are backlogged. So I admit ignorance in how to go about figuring out why they are not allowing the initial replication process to complete. The thing that makes it even stranger, is that several of the files I see are ones that I know nobody has opened or modified for at least several months. And I did preseed all files using the robocopy command.

    Do you have any advice on how to go about finding out why so many files are causing the initial replication to not complete? Specially if they were preseeded and not touched for a long time. I have looked through the DFSR event logs, but it hasn't helped me find the cause.

    Again, I really appreciate your help!

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.