Prep Data formula for a Bell Curve

Question

Prep Data formula for a Bell Curve

vsslasd 556

I have some data such as this:

Declare @D table (ID varchar(7), ProjDate Date, ProjAmt numeric(19,5))  
  
Insert Into @D(ID, ProjDate, ProjAmt)  
  
VALUES  
  
('C2147',  '2021-01-01', 6.60869565217391),  
('C2147',  '2021-01-04', 6.60869565217391),  
('C2147',  '2021-01-11', 6.60869565217391),  
('C2147',  '2021-01-18', 6.60869565217391),  
('C2147',  '2021-01-25', 6.60869565217391),  
('C2147',  '2021-02-01', 6.60869565217391),  
('C2147',  '2021-02-08', 6.60869565217391),  
('C2147',  '2021-02-15', 6.60869565217391),  
('C2147',  '2021-02-22', 6.60869565217391),  
('C2147',  '2021-03-01', 6.60869565217391),  
('C2147',  '2021-03-08', 6.60869565217391),  
('C2147',  '2021-03-15', 6.60869565217391),  
('C2147',  '2021-03-22', 6.60869565217391),  
('C2147',  '2021-03-29', 6.60869565217391),  
('C2147',  '2021-04-05', 6.60869565217391),  
('C2147',  '2021-04-12', 6.60869565217391),  
('C2147',  '2021-04-19', 6.60869565217391),  
('C2147',  '2021-04-26', 6.60869565217391),  
('C2147',  '2021-05-03', 6.60869565217391),  
('C2147',  '2021-05-10', 6.60869565217391),  
('C2147',  '2021-05-17', 6.60869565217391),  
('C2147',  '2021-05-24', 6.60869565217391),  
('C2147',  '2021-05-31', 6.60869565217391)  
  
Select * from @D

I'd like to prep this data for a bell curve, and not certain about the formula, the Bell Curve weight should peak at 75% of the way during the timeline, not the exact center.

This needs to be calculated based across the Id field, and the ProjAmt Field should start and end at zero, and have the predominate "bell" project amount around the 4/5/21 timeframe (the approximate 75% timeline factor).

What would be the formula in TSQL to convert this data appropriately ?

I'm trying to convert the flat line data noted above so that the bulk of the data appears 75% of the way into the timeline and the starting and ending points are zero, similar to the example here in this spreadsheet. I believe the formula would pertain to something similar to a Bell Curve formula the exact results aren't in green - it is the concept: weight most of the period amounts into the 75% portion of the timeline. How would I do this ?

EchoLiu-MSFT 14,621 Reputation points

2021-02-11T07:45:02.4+00:00

Hi @vsslasd ,

Sorry, I don’t understand your needs very well. Could you please provide a visualized expected output?

Regards
Echo
vsslasd 556 Reputation points

2021-02-11T17:14:37.267+00:00

Hello EchoLiu

I'm trying to convert the flat line data noted above so that the bulk of the data appears 75% of the way into the timeline and the starting and ending points are zero, similar to the example here in this spreadsheet. I believe the formula would pertain to something similar to a Bell Curve formula the exact results aren't in green - it is the concept: weight most of the period amounts into the 75% portion of the timeline. How would I do this ?
vsslasd 556 Reputation points

2021-02-11T17:15:42.437+00:00

Hello EchoLiu

I'm trying to convert the flat line data noted above so that the bulk of the data appears 75% of the way into the timeline and the starting and ending points are zero, similar to the example here in this spreadsheet. I believe the formula would pertain to something similar to a Bell Curve formula the exact results aren't in green - it is the concept: weight most of the period amounts into the 75% portion of the timeline. How would I do this ?
Stefan Hoffmann 621 Reputation points

2021-02-11T17:16:59.303+00:00

To fit the amount of something into a bell curve, you need a probability distribution of it over a further variable. I only see timeline data with a constant.

So please rephrase your question.
vsslasd 556 Reputation points

2021-02-11T17:32:03.49+00:00

Thank you, Stefan.
I appreciate your help. To answer your question, we want to peak the data Amount at 80% when the data timeline is 75% of the way through start to end.

To put this more in layperson's terms, We start a project on 2/1 but work doesn't actually start on this particular phase until we get to week period which is 75% into the project. We need to ramp up work gradually until that week and then we want 80% of the work to be performed within those few weeks which would constitute 30-35% of the project timeline.

Accepted answer

3 additional answers

Your answer

EchoLiu-MSFT 14,621 Reputation points

2021-02-11T07:45:02.4+00:00

Hi @vsslasd ,

Sorry, I don’t understand your needs very well. Could you please provide a visualized expected output?

Regards
Echo
vsslasd 556 Reputation points

2021-02-11T17:14:37.267+00:00

Hello EchoLiu

I'm trying to convert the flat line data noted above so that the bulk of the data appears 75% of the way into the timeline and the starting and ending points are zero, similar to the example here in this spreadsheet. I believe the formula would pertain to something similar to a Bell Curve formula the exact results aren't in green - it is the concept: weight most of the period amounts into the 75% portion of the timeline. How would I do this ?
vsslasd 556 Reputation points

2021-02-11T17:15:42.437+00:00

Hello EchoLiu

I'm trying to convert the flat line data noted above so that the bulk of the data appears 75% of the way into the timeline and the starting and ending points are zero, similar to the example here in this spreadsheet. I believe the formula would pertain to something similar to a Bell Curve formula the exact results aren't in green - it is the concept: weight most of the period amounts into the 75% portion of the timeline. How would I do this ?
Stefan Hoffmann 621 Reputation points

2021-02-11T17:16:59.303+00:00

To fit the amount of something into a bell curve, you need a probability distribution of it over a further variable. I only see timeline data with a constant.

So please rephrase your question.
vsslasd 556 Reputation points

2021-02-11T17:32:03.49+00:00

Thank you, Stefan.
I appreciate your help. To answer your question, we want to peak the data Amount at 80% when the data timeline is 75% of the way through start to end.

To put this more in layperson's terms, We start a project on 2/1 but work doesn't actually start on this particular phase until we get to week period which is 75% into the project. We need to ramp up work gradually until that week and then we want 80% of the work to be performed within those few weeks which would constitute 30-35% of the project timeline.

Answer 1

In case of multiple groups, try this script:

declare @D table (ID varchar(7), ProjDate Date, ProjAmt numeric(19,5))  
      
insert Into @D(ID, ProjDate, ProjAmt) values  
('C2147',  '2021-01-01', 6.60869565217391),  
('C2147',  '2021-01-04', 6.60869565217391),  
('C2147',  '2021-01-11', 6.60869565217391),  
('C2147',  '2021-01-18', 6.60869565217391),  
('C2147',  '2021-01-25', 6.60869565217391),  
('C2147',  '2021-02-01', 6.60869565217391),  
('C2147',  '2021-02-08', 6.60869565217391),  
('C2147',  '2021-02-15', 6.60869565217391),  
('C2147',  '2021-02-22', 6.60869565217391),  
('C2147',  '2021-03-01', 6.60869565217391),  
('C2147',  '2021-03-08', 6.60869565217391),  
('C2147',  '2021-03-15', 6.60869565217391),  
('C2147',  '2021-03-22', 6.60869565217391),  
('C2147',  '2021-03-29', 6.60869565217391),  
('C2147',  '2021-04-05', 6.60869565217391),  
('C2147',  '2021-04-12', 6.60869565217391),  
('C2147',  '2021-04-19', 6.60869565217391),  
('C2147',  '2021-04-26', 6.60869565217391),  
('C2147',  '2021-05-03', 6.60869565217391),  
('C2147',  '2021-05-10', 6.60869565217391),  
('C2147',  '2021-05-17', 6.60869565217391),  
('C2147',  '2021-05-24', 6.60869565217391),  
('C2147',  '2021-05-31', 6.60869565217391),  
  
('Z0000',  '2020-01-01', 9.8),  
('Z0000',  '2020-01-04', 9.8),  
('Z0000',  '2020-01-11', 9.8),  
('Z0000',  '2020-01-18', 9.8),  
('Z0000',  '2020-01-25', 9.8),  
('Z0000',  '2020-02-01', 9.8),  
('Z0000',  '2020-02-08', 9.8),  
('Z0000',  '2020-02-15', 9.8),  
('Z0000',  '2020-02-22', 9.8),  
('Z0000',  '2020-03-01', 9.8),  
('Z0000',  '2020-03-08', 9.8),  
('Z0000',  '2020-03-15', 9.8),  
('Z0000',  '2020-03-22', 9.8),  
('Z0000',  '2020-03-29', 9.8),  
('Z0000',  '2020-04-05', 9.8),  
('Z0000',  '2020-04-12', 9.8),  
('Z0000',  '2020-04-19', 9.8),  
('Z0000',  '2020-04-26', 9.8),  
('Z0000',  '2020-05-03', 9.8),  
('Z0000',  '2020-05-10', 9.8),  
('Z0000',  '2020-05-17', 9.8),  
('Z0000',  '2020-05-24', 9.8),  
('Z0000',  '2020-05-31', 9.8)  
  
---      
  
declare @mindate as date = (select min(ProjDate) from @d)  
  
;  
with P as  
(  
    select ID, min(ProjDate) as mindate, sum(ProjAmt) as sumamt  
    from @D  
    group by ID  
),  
X as  
(  
    select d.*, datediff(day, mindate, ProjDate) x  
    from @D d   
    inner join P on P.ID = d.ID  
),  
AC as  
(  
    select ID, cast(min(x) as float) as a, cast(max(x) as float) as c  
    from X  
    group by ID  
),  
B as  
(  
    select distinct ID, cast(percentile_cont( 0.75 ) within group (order by x) over (partition by ID) as float) as b  
    from X  
),  
R as  
(  
    select X.ID, X.ProjDate, X.ProjAmt,  
        power(x - a, alpha - 1 ) * power(c - x, beta - 1) /   
                (dbo.BetaApprox(alpha, beta) * power(c - a, alpha + beta - 1)) as val  
    from X  
    inner join AC on AC.ID = X.ID  
    inner join B on B.ID = X.ID  
    cross apply (values (1 + 4 * (b - a) / (c - a), 1 + 4 * (c - b) / (c - a))) t1(alpha, beta)  
),  
S as  
(  
    select ID, sum(val) as sumval  
    from R  
    group by ID  
),  
F as  
(  
    select R.ID, R.ProjDate, R.ProjAmt, val * sumamt / sumval as NewProjAmt, sumamt  
    from R  
    inner join P on P.ID = R.ID  
    inner join S on S.ID = R.ID  
)  
select F.ID, F.ProjDate, F.ProjAmt as OldProjAmt, NewProjAmt, replicate( '*', NewProjAmt), sumamt as OldSum, sum(NewProjAmt) over (partition by ID) as NewSum  
from F  
order by ID, ProjDate

This experiment uses PERCENTILE_CONT. Maybe you can also use ‘b = a + (c - a) * 0.75’ instead.

Answer 2

I just wanted to add a minor amendment to this part of code (works well - thanks!):

 declare @mindate as date = (select min(ProjDate) from @d)

 ;
 with E1 as
 (
     select *
     from @D
     cross apply (values (datediff(day, @mindate, ProjDate))) t1(x)
 ),
 E2 as
 (
     select cast(min(x) as float) as a, cast(max(x) as float) as c
     from E1
 ),
 E3 as
 (
     select distinct E2.*, cast(percentile_disc( 0.75 ) within group (order by x) over () as float) as b
     from E1, E2
 ),
 E4 as
 (
     select *,
         power(x - a, alpha - 1 ) * power(c - x, beta - 1) / 
             (dbo.BetaApprox(alpha, beta) * power(c - a, alpha + beta - 1)) as val
     from E1, E3
     cross apply (values (1 + 4 * (b - a) / (c - a), 1 + 4 * (c - b) / (c - a))) t1(alpha, beta)
 )
 select ID, ProjDate, val * 100 / max(val) over (partition by null) as ProjAmt, replicate('*', val * 1000)
 from E4
 order by ProjDate

BTW it's a bit strange how it coincides with some numpy sqrt errors in Python...

Answer 3

Check some experimental scripts:

declare @D table (ID varchar(7), ProjDate Date, ProjAmt numeric(19,5))

insert Into @D(ID, ProjDate, ProjAmt) values
('C2147',  '2021-01-01', 6.60869565217391),
('C2147',  '2021-01-04', 6.60869565217391),
('C2147',  '2021-01-11', 6.60869565217391),
('C2147',  '2021-01-18', 6.60869565217391),
('C2147',  '2021-01-25', 6.60869565217391),
('C2147',  '2021-02-01', 6.60869565217391),
('C2147',  '2021-02-08', 6.60869565217391),
('C2147',  '2021-02-15', 6.60869565217391),
('C2147',  '2021-02-22', 6.60869565217391),
('C2147',  '2021-03-01', 6.60869565217391),
('C2147',  '2021-03-08', 6.60869565217391),
('C2147',  '2021-03-15', 6.60869565217391),
('C2147',  '2021-03-22', 6.60869565217391),
('C2147',  '2021-03-29', 6.60869565217391),
('C2147',  '2021-04-05', 6.60869565217391),
('C2147',  '2021-04-12', 6.60869565217391),
('C2147',  '2021-04-19', 6.60869565217391),
('C2147',  '2021-04-26', 6.60869565217391),
('C2147',  '2021-05-03', 6.60869565217391),
('C2147',  '2021-05-10', 6.60869565217391),
('C2147',  '2021-05-17', 6.60869565217391),
('C2147',  '2021-05-24', 6.60869565217391),
('C2147',  '2021-05-31', 6.60869565217391)

---    

declare @mindate as date = (select min(ProjDate) from @d)

;
with E1 as
(
    select *
    from @D
    cross apply (values (datediff(day, @mindate, ProjDate))) t1(x)
),
E2 as
(
    select cast(min(x) as float) as a, cast(max(x) as float) as c
    from E1
),
E3 as
(
    select distinct E2.*, cast(percentile_disc( 0.75 ) within group (order by x) over () as float) as b
    from E1, E2
),
E4 as
(
    select *,
        power(x - a, alpha - 1 ) * power(c - x, beta - 1) / 
            (dbo.BetaApprox(alpha, beta) * power(c - a, alpha + beta - 1)) as val
    from E1, E3
    cross apply (values (1 + 4 * (b - a) / (c - a), 1 + 4 * (c - b) / (c - a))) t1(alpha, beta)
)
select ID, ProjDate, val * 100 / max(val) over (partition by null) as ProjAmt, replicate('*', val * 1000)
from E4
order by ProjDate

Also add a helper function:

create or alter function dbo.BetaApprox( @x float, @y float ) 
returns float
as begin
    return sqrt(2 * pi()) * power(@x, @x - 0.5) * power(@y, @y - 0.5) / 
        power(@x + @y, @x + @y - 0.5)
end

It generates a distribution using values between 0 and 100 that looks like a Bell Curve. The last column illustrates it.

/*
ID      ProjDate   ProjAmt                
-----------------------------------------------------------------------------
C2147   2021-01-01 0                      
C2147   2021-01-04 0,00531370950663901    
C2147   2021-01-11 0,203758600080973      
C2147   2021-01-18 0,988632019136461      
C2147   2021-01-25 2,70638824348182       
C2147   2021-02-01 5,62448556581808       
C2147   2021-02-08 9,92360872580418       *
C2147   2021-02-15 15,6924770761584       **
C2147   2021-02-22 22,9236818630141       ***
C2147   2021-03-01 31,5099647059236       ****
C2147   2021-03-08 41,2406198529949       ******
C2147   2021-03-15 51,7977898587582       *******
C2147   2021-03-22 62,7524344869001       *********
C2147   2021-03-29 73,5597089766708       ***********
C2147   2021-04-05 83,5533802415433       ************
C2147   2021-04-12 91,9386951773832       *************
C2147   2021-04-19 97,7826874920967       **************
C2147   2021-04-26 100                    ***************
C2147   2021-05-03 97,330157206678        **************
C2147   2021-05-10 88,2963805916995       *************
C2147   2021-05-17 71,1162094744645       **********
C2147   2021-05-24 43,4361976275036       ******
C2147   2021-05-31 0                      
*/

vsslasd 556 Reputation points

2021-02-11T23:51:11.113+00:00

Thank you, I think this is close....

A couple of questions first please:
I see where the 75% marker for the timeline can be dialed up or down in your code. But what part of your code pertains to the density of the bell curve (how steep or gradual) it is ?

Also: this works great for one Id where Id=C2147, but I have many different similar Ids, and need this curve to execute over each Id in the same fashion based upon it's own start/end date and amount. Where in the code would we account for this ? Partition ?

There is an issue with the total: The Sum of the Amount column must be equal to that which is in the table: 152
Which is 22* 6.60869565217391=152

Can you please advise ?
Thank you so very much !
Viorel 122.6K Reputation points

2021-02-12T10:19:22.063+00:00

The script that processes multiple groups of data will be sent separately.

Regarding the density, I am not sure I can answer the question. Probably you should consider more approaches. Maybe R language, available in SQL, can be used too.

Answer 4

vsslasd 556

Thank you - the totals all add up which is great...

I will open another question on this topic , which pertains to the density or steepness of the Bell curve - so that can be adjusted too.

Thank you so very much !

Share via

Prep Data formula for a Bell Curve

3 additional answers

Your answer