Parse specific information based on other column value using t-sql

SQL9 246 Reputation points
2021-07-25T15:03:42.963+00:00

Hi All,

I have a table with 3 columns ID(int), PhysicianName, and Text(varchar(max)). I am trying to parse a text column with multiple physicians but need to extract only one physician info based on PhysicianName column.

Here is the sample source data.

ID PhysicianName Text
1 Roddy Patty This is a combined notes of the Physician appointments on a given day. Roddy Patty Profession: General Physician ; PhysicianID: 123ABC ; Appointment Date: January 7, 2001 VisitDescription: Patient joined due to bla bla blah. Summary: Patient had bla bla blah. Michael Duke Profession: Surgeon ; PhysicianID: 567XYZ ; Appointment Date: March 10, 2011 visitDescription: Patient bla bla blah. Summary: Patient suffering from bla bla blah need immediate surgeory & treatment.

2 John Smith This is a combined notes of the Physician appointments on a given day. Jake Dawn Profession: Physician ; PhysicianID: 453AAA Appointment Date: October 09, 2009 VisitDescription: Patient joined due to bla bla blah. Summary: Patient had bla bla blah. John Smith Profession: Civil Surgeon ; PhysicianID: 123BBB ; Appointment Date: June 15, 2005 VisitDescription: Patient had bla bla blah. Summary: Patient suffering from bla bla blah need to start treatment.

Text column contains series of text with all physicians info(on a given day text column can have 1 or more physicians info). I want a query which should substring single physician info per ID based on PhysicianName column.

For example: ID = 1 , It has two Physicians in the text - Roddy Patty & Michael Duke, but I want only Roddy Patty Physician information because for ID=1 the Physician is Roddy Patty.

For PhysicianName = "Roddy Patty", the query should substring all information from text column from "Roddy Patty Profession: General Physician ; PhysicianID: 123ABC ; Appointment Date: January 7, 2001 VisitDescription: Patient joined due to bla bla blah. Summary: Patient had bla bla blah" and exclude Michael Duke info.

Need output for ID =1:

ID PhysicianName Profession PhysicianID AppointmentDate VisitDescription Summary
1 Roddy Patty General Physician 123ABC January 7, 2001 Patient joined due to bla bla blah. Patient had bla bla blah

Same thing with ID = 2, It has two Physicians in the text - Jake Dawn & John Smith, but I want only John Smith Physician information because for ID=2 the Physician is John Smith, and exclude Jake Dawn info.

For PhysicianName = "John Smith", the query should substring all information from text column from "John Smith Profession: Civil Surgeon ; PhysicianID: 123BBB ; Appointment Date: June 15, 2005 VisitDescription: Patient had bla bla blah. Summary: Patient suffering from bla bla blah need to start treatment."

Need output for ID =2:

ID PhysicianName Profession PhysicianID AppointmentDate VisitDescription Summary
2 John Smith Civil Surgeon 123BBB June 15, 2005 Patient had bla bla blah. Patient suffering from bla bla blah need to start treatment.

I know that we have to use Substring , charindex or Patindex to extract the data, I have tried it but no luck, getting Invalid length parameter passed to the LEFT or SUBSTRING function error.

SELECT SUBSTRING(Text, CHARINDEX(PhysicianName, Text), CHARINDEX('Profession:',Text) - CHARINDEX(PhysicianName, Text) + 500)

It would be helpful if someone can post the code to parse this kind of data.

Create Statement:

CREATE Table #Physician (ID INT , PhysicianName Varchar(30) , Text Varchar(max))

Insert Statement:
Insert into #Physician (ID , PhysicianName , Text )
values ( 1 , 'Roddy Patty' , 'This is a combined notes of the Physician appointments on a given day. Roddy Patty Profession: General Physician ; PhysicianID: 123ABC ; Appointment Date: January 7, 2001 VisitDescription: Patient joined due to bla bla blah. Summary: Patient had bla bla blah. Michael Duke Profession: Surgeon ; PhysicianID: 567XYZ ; Appointment Date: March 10, 2011 visitDescription: Patient bla bla blah. Summary: Patient suffering from bla bla blah need immediate surgeory & treatment.' ),
(2 , 'John Smith' , 'This is a combined notes of the Physician appointments on a given day. Jake Dawn Profession: Physician ; PhysicianID: 453AAA Appointment Date: October 09, 2009 VisitDescription: Patient joined due to bla bla blah. Summary: Patient had bla bla blah. John Smith Profession: Civil Surgeon ; PhysicianID: 123BBB ; Appointment Date: June 15, 2005 VisitDescription: Patient had bla bla blah. Summary: Patient suffering from bla bla blah need to start treatment.')

drop table #Physician

Thanks in advance,
RH

Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,637 questions
{count} votes

Accepted answer
  1. Yitzhak Khabinsky 25,856 Reputation points
    2021-07-25T20:16:09.333+00:00

    Hi @SQL9 ,

    Please try the following solution. SQL Server 2016 onwards.

    It is not a picnic to deal with such unstructured data.

    SQL

    -- DDL and data population, start  
    DECLARE @Physician Table  (ID INT IDENTITY PRIMARY KEY, PhysicianName Varchar(30), Tokens VARCHAR(MAX))  
    INSERT INTO @Physician (PhysicianName, tokens ) VALUES  
    ('Roddy Patty' , 'This is a combined notes of the Physician appointments on a given day. Roddy Patty Profession: General Physician ; PhysicianID: 123ABC ; Appointment Date: January 7, 2001 VisitDescription: Patient joined due to bla bla blah. Summary: Patient had bla bla blah. Michael Duke Profession: Surgeon ; PhysicianID: 567XYZ ; Appointment Date: March 10, 2011 visitDescription: Patient bla bla blah. Summary: Patient suffering from bla bla blah need immediate surgeory & treatment.' ),  
    ('John Smith' , 'This is a combined notes of the Physician appointments on a given day. Jake Dawn Profession: Physician ; PhysicianID: 453AAA Appointment Date: October 09, 2009 VisitDescription: Patient joined due to bla bla blah. Summary: Patient had bla bla blah. John Smith Profession: Civil Surgeon ; PhysicianID: 123BBB ; Appointment Date: June 15, 2005 VisitDescription: Patient had bla bla blah. Summary: Patient suffering from bla bla blah need to start treatment.');  
    -- DDL and data population, end  
      
    SELECT ID, PhysicianName  
    	, JSON_VALUE(j, '$.Profession') AS Profession  
    	, JSON_VALUE(j, '$.PhysicianID') AS PhysicianID  
    	, JSON_VALUE(j, '$."Appointment Date"') AS [Appointment Date]  
    	, JSON_VALUE(j, '$.VisitDescription') AS VisitDescription  
    	, JSON_VALUE(j, '$.Summary') AS Summary  
    FROM @Physician  
    CROSS APPLY (SELECT STUFF(Tokens, 1, CHARINDEX(PhysicianName, tokens) - 1,'')) AS t1(c)  
    CROSS APPLY (SELECT N'{"' + REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(LEFT(c, CHARINDEX('.', c, CHARINDEX('.', c, 1) + 1)-1)  
    	, PhysicianName+SPACE(1),'')  
    	, ': ','":"')  
    	, ' ; ','","')  
    	, ' VisitDescription','","VisitDescription')  
    	, '. Summary','","Summary') + '"}') AS t2(j);  
    

2 additional answers

Sort by: Most helpful
  1. Ryan Abbey 1,181 Reputation points
    2021-07-25T20:15:38.863+00:00

    Take a step back, what does

    SELECT CHARINDEX('Profession:',Text) , CHARINDEX('Profession:',Text) - CHARINDEX(PhysicianName, Text) + 500

    return?

    That may give you a step as to why it's giving you the error

    Your search for "Profession" will always hit the first entry which is around 50 characters in... your search for the Physician name I suspect is sometimes more than 550 characters in giving you a negative length

    From what I can see, you first need to search for the physician name and return that substring (outer apply helps with this kind of reduction), then look for "Profession" in this substring to get where this physician's entry ends. From there, need to take a couple of steps back to remove the subsequent physician's name from the substring

    (I'm going to assume for simplicity here that they will exist... and for double simplicity, there is a subsequent physician, you'll need to account for when those aren't true)

    select substring(Step3, 1, len(Step3)-Step4)
    from ...
    outer apply (select charindex(PhysicianName, Text) ) w(Step1)
    outer apply (select charindex("Profession: ", Text, Step1)) x(Step2)
    outer apply (select substring(Text, Step1, Step2-Step1) ) y(Step3)
    outer apply (select charindex('.', reverse(Step3)) z(Step4)

    I haven't tested any of the above so may need a little work - some can be combined but this allows you to see the breakdown

    0 comments No comments

  2. EchoLiu-MSFT 14,581 Reputation points
    2021-07-26T06:40:30.863+00:00

    Please also check:

     ;WITH cte  
     as(SELECT *,RIGHT(Text,LEN(Text)-CHARINDEX(PhysicianName,Text)-LEN(PhysicianName)+1) string  
     FROM #Physician)  
     ,cte2 as(SELECT ID,PhysicianName,SUBSTRING(string,1,  
     CHARINDEX('.',string,CHARINDEX('Summary',string))-1) [Text]  
     FROM cte)  
     ,cte3 as(SELECT *,STUFF(REPLACE([Text],'.',';'),  
     CHARINDEX('VisitDescription',[Text])-1,1,';') [Text2]  
     FROM cte2)  
     ,cte4 as(SELECT ID,PhysicianName,LTRIM(LEFT(VALUE,CHARINDEX(':',VALUE)-1)) Title,  
     LTRIM(RIGHT(VALUE,LEN(VALUE)-CHARINDEX(':',VALUE))) Con  
     FROM cte3  
     CROSS APPLY STRING_SPLIT([Text2],';'))  
          
     SELECT * FROM cte4  
     PIVOT (MAX(Con) FOR Title in   
     (Profession,PhysicianID,[Appointment Date],VisitDescription,Summary)) p  
     ORDER BY ID  
    

    Output:

    117835-image.png

    If you have any question, please feel free to let me know.

    Regards
    Echo


    If the answer is helpful, please click "Accept Answer" and upvote it.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.