Splitting into groups and finding averages of an array of numbers

Anonymous

I have an array of numbers have values that range from 0-50000 and I would like to find a function or way to split them up into 10 groups where their values are most similar. The number of items that can be in any group must be equal to or greater than 1. After that, I would like to find the averages for these groups.

Any suggestions?

Locked Question. This question was migrated from the Microsoft Support Community. You can vote on whether it's helpful, but you can't add comments or replies or follow the question.

0 comments

Answer accepted by question author

Anonymous

Errata.... I wrote:

Maverik wrote:

I have an array of numbers have values that range from 0-50000 and I would like to find a function or way to split them up into 10 groups where their values are most similar.

[....]

Suppose your numbers are in A1:A1000. Put the following formula into B1 and copy down through B10:

=PERCENTILE(A:A,ROWS($A$1:A1)/10)

That establishes the values for the 10%ile, 20%ile, etc.

Although that does ensure at least 1 number in each percentile, it does not reliably produce reasonable results if there is a "significant number" of repeated numbers. What is "significant"? It's a lot like "most similar": it is difficult to define rigorously.

Case in point.... I have a list of more than 1800 numbers ranging from 1 to 1554; measurements of gaps in a series of numbers. I used the above paradigm to look at the distribution. The results are correct; but they surprised me. The 10th through 70th percentile are the same number (1); and the 80th and 90th percentile are another number (2).

Clearly that kind of "grouping" would not satisfy Maverik's requirements, no matter how he defines "most similar".

As it turns out, there are only 9 distinct numbers in my list. 79% are 1; about 15.6% are 2; less than 0.05% are 3 and 4; and there is one each of 5, 43, 63, 288 and 1554.

Once I realized that lopsided distribution, obviously the 9-cell array formula =FREQUENCY(A:A,{1,2,3,4,5,43,63,288,1554}) produced the more interesting "groupings".

I think that is an unusual example. Nevertheless, as always, it is best to understand the distribution before applying any method of partitioning.

0 comments

16 additional answers

Anonymous

2011-10-14T23:51:01+00:00

PS.... I wrote:

I think there are two gotchas in your requirements: (a) what you consider "most similar" values; and (b) there must be at least 1 in each group.

One approach which satisfies #b, but not necessarily #a (because you have not defined the term) is as follows....

Suppose your numbers are in A1:A1000. Put the following formula into B1 and copy down through B10:

=PERCENTILE(A:A,ROWS($A$1:A1)/10)

That establishes the values for the 10%ile, 20%ile, etc.

While that ensures nearly the same number in each group, it does not ensure "similarity" as I said. But the percentile groupings might be a good starting point. IMHO, you would need a VBA procedure to group based on your definition of "similarity", whatever that is.

When I try to visualize the process, I would shuffle numbers between adjacent originally-percentile groups (they are no most so after we start shuffling) until the relative standard deviation of each group is ideally small, but if not small, then as close as possible.

This does not sound like an easy thing to do programatically.
Was this answer helpful?

0 comments No comments
Anonymous

2011-10-14T22:49:20+00:00

Maverik wrote:

I have an array of numbers have values that range from 0-50000 and I would like to find a function or way to split them up into 10 groups where their values are most similar. The number of items that can be in any group must be equal to or greater than 1. After that, I would like to find the averages for these groups. Any suggestions?

I think there are two gotchas in your requirements: (a) what you consider "most similar" values; and (b) there must be at least 1 in each group.

One approach which satisfies #b, but not necessarily #a (because you have not defined the term) is as follows....

Suppose your numbers are in A1:A1000. Put the following formula into B1 and copy down through B10:

=PERCENTILE(A:A,ROWS($A$1:A1)/10)

That establishes the values for the 10%ile, 20%ile, etc.

[EDIT] Caveat: This assumes that there are at least 10 numbers in column A. If there are not, you cannot achieve both of your objectives.

Then put the following formula into C1 and C2, and copy C2 down through C10:

C1: AVERAGEIF(A:A,"<="&B1)

C2: AVERAGEIFS(A:A,A:A,">"&B1,A:A,"<="&B2)

[Errata thanks to Ron Rosenfeld.]

Note: I do not have XL2010, so I cannot test those formulas. If A:A does not work, substitute A1:A1000. If the formulas still do not work or if you want formulas compatible with XL2003, use the following array formulas [*]:

C1: =AVERAGE(IF(A1:A1000<=B1,A1:A1000))

C2: =AVERAGE(IF((A1:A1000>B1)*(A1:A1000<=B2),A1:A1000))

[*] Enter an array formula by pressing ctrl+shift+Enter instead of Enter. Excel will display an array formula surrounded by curly braces in the Formula Bar, i.e. {=formula}. You cannot type the curly braces yourself. If you make a mistake, select the cell, press F2 and edit, then press ctrl+shift+Enter.

You can use the following array formulas (press ctrl+shift+Enter) to compute the relative standard deviation, which might be a measure of "similarity":

D1: =STDEVP(IF(A1:A1000<=B1,A1:A1000))/C1

D2: =STDEVP(IF((A1:A1000>B1)*(A1:A1000<=B2),A1:A1000))/C2

Copy D2 down through D10. Format D1:D10 as Percentage.

Another measure of "similarity" might be the min and max of each group. They can be computed with the following array formulas (press ctrl+shift+Enter):

E1: =MIN(IF(A1:A1000<=B1,A1:A1000))

E2: =MIN(IF((A1:A1000>B1)*(A1:A1000<=B2),A1:A1000))

F1: =MAX(IF(A1:A1000<=B1,A1:A1000))

F2: =MAX(IF((A1:A1000>B1)*(A1:A1000<=B2),A1:A1000))

Copy E2:F2 down through E10:F10.
Was this answer helpful?

0 comments No comments
Anonymous

2011-10-14T19:06:54+00:00

I have about 400 numbers that are between the values of 0 and 50000, but not evenly or even in a particular order. i.e. (135, 137, 152, 3290, 3331, 12500, 12356, etc....)

By most similar I mean to say that since 135-152 are very close numbers, I'd like them to go into 1 group. Since 3290 and 3331 are similar numbers, they go into another group, etc... I don't know how many will make it into each group, but I'd like to end up with 10 groups total.

I'm not sure how this frequency function works. I don't understand what the bins are.

Thanks for the help.
Was this answer helpful?

0 comments No comments
Anonymous

2011-10-14T13:53:10+00:00

How do you define "most similar"?

You could use the FREQUENCY function and define bins where each bin represents a certain range of values, say 10% points.

e.g. =FREQUENCY($a$1:$a$50000,{5000,10000,15000,20000,25000,30000,35000,40000,45000,50000})

This gets entered as an array into ten cells in a column. It will return the number of values in each bin.
Was this answer helpful?

0 comments No comments

Share via

Splitting into groups and finding averages of an array of numbers

16 additional answers