Prep Data formula for a Bell Curve

vsslasd 556 Reputation points
2021-02-10T22:56:00.813+00:00

I have some data such as this:

Declare @D table (ID varchar(7), ProjDate Date, ProjAmt numeric(19,5))  
  
Insert Into @D(ID, ProjDate, ProjAmt)  
  
VALUES  
  
('C2147',  '2021-01-01', 6.60869565217391),  
('C2147',  '2021-01-04', 6.60869565217391),  
('C2147',  '2021-01-11', 6.60869565217391),  
('C2147',  '2021-01-18', 6.60869565217391),  
('C2147',  '2021-01-25', 6.60869565217391),  
('C2147',  '2021-02-01', 6.60869565217391),  
('C2147',  '2021-02-08', 6.60869565217391),  
('C2147',  '2021-02-15', 6.60869565217391),  
('C2147',  '2021-02-22', 6.60869565217391),  
('C2147',  '2021-03-01', 6.60869565217391),  
('C2147',  '2021-03-08', 6.60869565217391),  
('C2147',  '2021-03-15', 6.60869565217391),  
('C2147',  '2021-03-22', 6.60869565217391),  
('C2147',  '2021-03-29', 6.60869565217391),  
('C2147',  '2021-04-05', 6.60869565217391),  
('C2147',  '2021-04-12', 6.60869565217391),  
('C2147',  '2021-04-19', 6.60869565217391),  
('C2147',  '2021-04-26', 6.60869565217391),  
('C2147',  '2021-05-03', 6.60869565217391),  
('C2147',  '2021-05-10', 6.60869565217391),  
('C2147',  '2021-05-17', 6.60869565217391),  
('C2147',  '2021-05-24', 6.60869565217391),  
('C2147',  '2021-05-31', 6.60869565217391)  
  
Select * from @D  

I'd like to prep this data for a bell curve, and not certain about the formula, the Bell Curve weight should peak at 75% of the way during the timeline, not the exact center.

This needs to be calculated based across the Id field, and the ProjAmt Field should start and end at zero, and have the predominate "bell" project amount around the 4/5/21 timeframe (the approximate 75% timeline factor).

What would be the formula in TSQL to convert this data appropriately ?

I'm trying to convert the flat line data noted above so that the bulk of the data appears 75% of the way into the timeline and the starting and ending points are zero, similar to the example here in this spreadsheet. I believe the formula would pertain to something similar to a Bell Curve formula the exact results aren't in green - it is the concept: weight most of the period amounts into the 75% portion of the timeline. How would I do this ?

67082-bell.png

Developer technologies | Transact-SQL
{count} votes

Accepted answer
  1. Viorel 122.6K Reputation points
    2021-02-12T10:16:22.743+00:00

    In case of multiple groups, try this script:

    declare @D table (ID varchar(7), ProjDate Date, ProjAmt numeric(19,5))  
          
    insert Into @D(ID, ProjDate, ProjAmt) values  
    ('C2147',  '2021-01-01', 6.60869565217391),  
    ('C2147',  '2021-01-04', 6.60869565217391),  
    ('C2147',  '2021-01-11', 6.60869565217391),  
    ('C2147',  '2021-01-18', 6.60869565217391),  
    ('C2147',  '2021-01-25', 6.60869565217391),  
    ('C2147',  '2021-02-01', 6.60869565217391),  
    ('C2147',  '2021-02-08', 6.60869565217391),  
    ('C2147',  '2021-02-15', 6.60869565217391),  
    ('C2147',  '2021-02-22', 6.60869565217391),  
    ('C2147',  '2021-03-01', 6.60869565217391),  
    ('C2147',  '2021-03-08', 6.60869565217391),  
    ('C2147',  '2021-03-15', 6.60869565217391),  
    ('C2147',  '2021-03-22', 6.60869565217391),  
    ('C2147',  '2021-03-29', 6.60869565217391),  
    ('C2147',  '2021-04-05', 6.60869565217391),  
    ('C2147',  '2021-04-12', 6.60869565217391),  
    ('C2147',  '2021-04-19', 6.60869565217391),  
    ('C2147',  '2021-04-26', 6.60869565217391),  
    ('C2147',  '2021-05-03', 6.60869565217391),  
    ('C2147',  '2021-05-10', 6.60869565217391),  
    ('C2147',  '2021-05-17', 6.60869565217391),  
    ('C2147',  '2021-05-24', 6.60869565217391),  
    ('C2147',  '2021-05-31', 6.60869565217391),  
      
    ('Z0000',  '2020-01-01', 9.8),  
    ('Z0000',  '2020-01-04', 9.8),  
    ('Z0000',  '2020-01-11', 9.8),  
    ('Z0000',  '2020-01-18', 9.8),  
    ('Z0000',  '2020-01-25', 9.8),  
    ('Z0000',  '2020-02-01', 9.8),  
    ('Z0000',  '2020-02-08', 9.8),  
    ('Z0000',  '2020-02-15', 9.8),  
    ('Z0000',  '2020-02-22', 9.8),  
    ('Z0000',  '2020-03-01', 9.8),  
    ('Z0000',  '2020-03-08', 9.8),  
    ('Z0000',  '2020-03-15', 9.8),  
    ('Z0000',  '2020-03-22', 9.8),  
    ('Z0000',  '2020-03-29', 9.8),  
    ('Z0000',  '2020-04-05', 9.8),  
    ('Z0000',  '2020-04-12', 9.8),  
    ('Z0000',  '2020-04-19', 9.8),  
    ('Z0000',  '2020-04-26', 9.8),  
    ('Z0000',  '2020-05-03', 9.8),  
    ('Z0000',  '2020-05-10', 9.8),  
    ('Z0000',  '2020-05-17', 9.8),  
    ('Z0000',  '2020-05-24', 9.8),  
    ('Z0000',  '2020-05-31', 9.8)  
      
    ---      
      
    declare @mindate as date = (select min(ProjDate) from @d)  
      
    ;  
    with P as  
    (  
        select ID, min(ProjDate) as mindate, sum(ProjAmt) as sumamt  
        from @D  
        group by ID  
    ),  
    X as  
    (  
        select d.*, datediff(day, mindate, ProjDate) x  
        from @D d   
        inner join P on P.ID = d.ID  
    ),  
    AC as  
    (  
        select ID, cast(min(x) as float) as a, cast(max(x) as float) as c  
        from X  
        group by ID  
    ),  
    B as  
    (  
        select distinct ID, cast(percentile_cont( 0.75 ) within group (order by x) over (partition by ID) as float) as b  
        from X  
    ),  
    R as  
    (  
        select X.ID, X.ProjDate, X.ProjAmt,  
            power(x - a, alpha - 1 ) * power(c - x, beta - 1) /   
                    (dbo.BetaApprox(alpha, beta) * power(c - a, alpha + beta - 1)) as val  
        from X  
        inner join AC on AC.ID = X.ID  
        inner join B on B.ID = X.ID  
        cross apply (values (1 + 4 * (b - a) / (c - a), 1 + 4 * (c - b) / (c - a))) t1(alpha, beta)  
    ),  
    S as  
    (  
        select ID, sum(val) as sumval  
        from R  
        group by ID  
    ),  
    F as  
    (  
        select R.ID, R.ProjDate, R.ProjAmt, val * sumamt / sumval as NewProjAmt, sumamt  
        from R  
        inner join P on P.ID = R.ID  
        inner join S on S.ID = R.ID  
    )  
    select F.ID, F.ProjDate, F.ProjAmt as OldProjAmt, NewProjAmt, replicate( '*', NewProjAmt), sumamt as OldSum, sum(NewProjAmt) over (partition by ID) as NewSum  
    from F  
    order by ID, ProjDate  
    

    This experiment uses PERCENTILE_CONT. Maybe you can also use ‘b = a + (c - a) * 0.75’ instead.

    0 comments No comments

3 additional answers

Sort by: Most helpful
  1. HowardPaggerton 86 Reputation points
    2021-10-12T10:31:40.483+00:00

    I just wanted to add a minor amendment to this part of code (works well - thanks!):

     declare @mindate as date = (select min(ProjDate) from @d)
    
     ;
     with E1 as
     (
         select *
         from @D
         cross apply (values (datediff(day, @mindate, ProjDate))) t1(x)
     ),
     E2 as
     (
         select cast(min(x) as float) as a, cast(max(x) as float) as c
         from E1
     ),
     E3 as
     (
         select distinct E2.*, cast(percentile_disc( 0.75 ) within group (order by x) over () as float) as b
         from E1, E2
     ),
     E4 as
     (
         select *,
             power(x - a, alpha - 1 ) * power(c - x, beta - 1) / 
                 (dbo.BetaApprox(alpha, beta) * power(c - a, alpha + beta - 1)) as val
         from E1, E3
         cross apply (values (1 + 4 * (b - a) / (c - a), 1 + 4 * (c - b) / (c - a))) t1(alpha, beta)
     )
     select ID, ProjDate, val * 100 / max(val) over (partition by null) as ProjAmt, replicate('*', val * 1000)
     from E4
     order by ProjDate
    

    BTW it's a bit strange how it coincides with some numpy sqrt errors in Python...

    2 people found this answer helpful.
    0 comments No comments

  2. Viorel 122.6K Reputation points
    2021-02-11T22:08:30.663+00:00

    Check some experimental scripts:

    declare @D table (ID varchar(7), ProjDate Date, ProjAmt numeric(19,5))
    
    insert Into @D(ID, ProjDate, ProjAmt) values
    ('C2147',  '2021-01-01', 6.60869565217391),
    ('C2147',  '2021-01-04', 6.60869565217391),
    ('C2147',  '2021-01-11', 6.60869565217391),
    ('C2147',  '2021-01-18', 6.60869565217391),
    ('C2147',  '2021-01-25', 6.60869565217391),
    ('C2147',  '2021-02-01', 6.60869565217391),
    ('C2147',  '2021-02-08', 6.60869565217391),
    ('C2147',  '2021-02-15', 6.60869565217391),
    ('C2147',  '2021-02-22', 6.60869565217391),
    ('C2147',  '2021-03-01', 6.60869565217391),
    ('C2147',  '2021-03-08', 6.60869565217391),
    ('C2147',  '2021-03-15', 6.60869565217391),
    ('C2147',  '2021-03-22', 6.60869565217391),
    ('C2147',  '2021-03-29', 6.60869565217391),
    ('C2147',  '2021-04-05', 6.60869565217391),
    ('C2147',  '2021-04-12', 6.60869565217391),
    ('C2147',  '2021-04-19', 6.60869565217391),
    ('C2147',  '2021-04-26', 6.60869565217391),
    ('C2147',  '2021-05-03', 6.60869565217391),
    ('C2147',  '2021-05-10', 6.60869565217391),
    ('C2147',  '2021-05-17', 6.60869565217391),
    ('C2147',  '2021-05-24', 6.60869565217391),
    ('C2147',  '2021-05-31', 6.60869565217391)
    
    ---    
    
    declare @mindate as date = (select min(ProjDate) from @d)
    
    ;
    with E1 as
    (
        select *
        from @D
        cross apply (values (datediff(day, @mindate, ProjDate))) t1(x)
    ),
    E2 as
    (
        select cast(min(x) as float) as a, cast(max(x) as float) as c
        from E1
    ),
    E3 as
    (
        select distinct E2.*, cast(percentile_disc( 0.75 ) within group (order by x) over () as float) as b
        from E1, E2
    ),
    E4 as
    (
        select *,
            power(x - a, alpha - 1 ) * power(c - x, beta - 1) / 
                (dbo.BetaApprox(alpha, beta) * power(c - a, alpha + beta - 1)) as val
        from E1, E3
        cross apply (values (1 + 4 * (b - a) / (c - a), 1 + 4 * (c - b) / (c - a))) t1(alpha, beta)
    )
    select ID, ProjDate, val * 100 / max(val) over (partition by null) as ProjAmt, replicate('*', val * 1000)
    from E4
    order by ProjDate
    

    Also add a helper function:

    create or alter function dbo.BetaApprox( @x float, @y float ) 
    returns float
    as begin
        return sqrt(2 * pi()) * power(@x, @x - 0.5) * power(@y, @y - 0.5) / 
            power(@x + @y, @x + @y - 0.5)
    end
    

    It generates a distribution using values between 0 and 100 that looks like a Bell Curve. The last column illustrates it.

    /*
    ID      ProjDate   ProjAmt                
    -----------------------------------------------------------------------------
    C2147   2021-01-01 0                      
    C2147   2021-01-04 0,00531370950663901    
    C2147   2021-01-11 0,203758600080973      
    C2147   2021-01-18 0,988632019136461      
    C2147   2021-01-25 2,70638824348182       
    C2147   2021-02-01 5,62448556581808       
    C2147   2021-02-08 9,92360872580418       *
    C2147   2021-02-15 15,6924770761584       **
    C2147   2021-02-22 22,9236818630141       ***
    C2147   2021-03-01 31,5099647059236       ****
    C2147   2021-03-08 41,2406198529949       ******
    C2147   2021-03-15 51,7977898587582       *******
    C2147   2021-03-22 62,7524344869001       *********
    C2147   2021-03-29 73,5597089766708       ***********
    C2147   2021-04-05 83,5533802415433       ************
    C2147   2021-04-12 91,9386951773832       *************
    C2147   2021-04-19 97,7826874920967       **************
    C2147   2021-04-26 100                    ***************
    C2147   2021-05-03 97,330157206678        **************
    C2147   2021-05-10 88,2963805916995       *************
    C2147   2021-05-17 71,1162094744645       **********
    C2147   2021-05-24 43,4361976275036       ******
    C2147   2021-05-31 0                      
    */
    

  3. vsslasd 556 Reputation points
    2021-02-12T20:26:51.457+00:00

    Thank you - the totals all add up which is great...

    I will open another question on this topic , which pertains to the density or steepness of the Bell curve - so that can be adjusted too.

    Thank you so very much !

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.