Categorizing a power law distribution of user participation in an online community (non spss question)

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Categorizing a power law distribution of user participation in an online community (non spss question)

whatsinaname
I have data on participation in a large scale online community.  Participation follows a power law distribution. This means that a fraction of the users are responsible for the majority of posts, i.e. a few people post a LOT while a lot of people don’t post much at all.

My question is how to segment this population to tease apart differences between high, low, and 'in between' usage?  Splitting users into groups at equal percentiles does not seem appropriate.  I have not come across an established method for this kind of segmentation.

Thoughts?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

David Marso
Administrator
First thing I would do I create a histogram and see if there are obvious clumps.

whatsinaname wrote
I have data on participation in a large scale online community.  Participation follows a power law distribution. This means that a fraction of the users are responsible for the majority of posts, i.e. a few people post a LOT while a lot of people don’t post much at all.

My question is how to segment this population to tease apart differences between high, low, and 'in between' usage?  Splitting users into groups at equal percentiles does not seem appropriate.  I have not come across an established method for this kind of segmentation.

Thoughts?

Thanks in advance!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

John F Hall
Alternatively, make an arbitrary decision as to what counts as High and what counts as Low, then use something like:

RECODE (lo thru <1st cut point> = 1)( <1st cut point> thru <2nd cut point> = 2)( <2nd cut point> thru HI = 3) INTO TESTVAR.

FREQ TEStVAR .

If you have missing values, you'll need to replace lo with lowest valid value, Hi with highest valid value and add (ELSE = SYMIS) to the RECODE command.

Forget statistics, make sociological sense first.

Advice from a died-in-the-wool Old Dog survey researcher.

John Hall

John F Hall (Mr)

Email:    [hidden email]
Website: www.surveyresearch.weebly.com

PS  Have a look at "Cyberchiefs", a book by Matthieu O'Neil (Pluto Press, 2009) about democracy and the formation of hierarchies in social networks.








-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: 19 May 2012 15:55
To: [hidden email]
Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question)

First thing I would do I create a histogram and see if there are obvious clumps.


whatsinaname wrote

>
> I have data on participation in a large scale online community.
> Participation follows a power law distribution. This means that a
> fraction of the users are responsible for the majority of posts, i.e.
> a few people post a LOT while a lot of people don’t post much at all.
>
> My question is how to segment this population to tease apart
> differences between high, low, and 'in between' usage?  Splitting
> users into groups at equal percentiles does not seem appropriate.  I
> have not come across an established method for this kind of segmentation.
>
> Thoughts?
>
> Thanks in advance!
>


--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712325.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Bruce Weaver
Administrator
In reply to this post by David Marso
You can also use the Chart Editor to display different types of distribution curves on a histogram.

http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fidh_webhelp_distribution_palette.htm

HTH.


David Marso wrote
First thing I would do I create a histogram and see if there are obvious clumps.

whatsinaname wrote
I have data on participation in a large scale online community.  Participation follows a power law distribution. This means that a fraction of the users are responsible for the majority of posts, i.e. a few people post a LOT while a lot of people don’t post much at all.

My question is how to segment this population to tease apart differences between high, low, and 'in between' usage?  Splitting users into groups at equal percentiles does not seem appropriate.  I have not come across an established method for this kind of segmentation.

Thoughts?

Thanks in advance!
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Jon K Peck
In reply to this post by John F Hall
Do you have predictors?  What are you going to do with this distribution?

If you want to just fit a power law sort of distribution, try a q-q plot with, say, Pareto, as the distribution.  Analyze > Descriptive Statistics > Q-Q plots or PPLOT ... /TYPE = Q-Q.

If you want segment interactively, try Transform > Visual Binning, or, especially if you have a target, Transform > Optimal Binning.

HTH

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621





-----Original Message-----
From: SPSSX(r) Discussion [
[hidden email]] On Behalf Of David Marso
Sent: 19 May 2012 15:55
To: [hidden email]
Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question)

First thing I would do I create a histogram and see if there are obvious clumps.


whatsinaname wrote
>
> I have data on participation in a large scale online community.
> Participation follows a power law distribution. This means that a
> fraction of the users are responsible for the majority of posts, i.e.
> a few people post a LOT while a lot of people don’t post much at all.
>
> My question is how to segment this population to tease apart
> differences between high, low, and 'in between' usage?  Splitting
> users into groups at equal percentiles does not seem appropriate.  I
> have not come across an established method for this kind of segmentation.
>
> Thoughts?
>
> Thanks in advance!
>


--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712325.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Art Kendall
In reply to this post by Bruce Weaver
you can also work with visual binning.

You can recode your variable into a new variable. Try to avoid using (... =sysmis) whenever possible. You the user decided that the value should be missing.
Art Kendall
Social Research Consultants

On 5/19/2012 10:24 AM, Bruce Weaver wrote:
You can also use the Chart Editor to display different types of distribution
curves on a histogram.

http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fidh_webhelp_distribution_palette.htm

HTH.



David Marso wrote
First thing I would do I create a histogram and see if there are obvious
clumps.


whatsinaname wrote
I have data on participation in a large scale online community.
Participation follows a power law distribution. This means that a
fraction of the users are responsible for the majority of posts, i.e. a
few people post a LOT while a lot of people don’t post much at all.

My question is how to segment this population to tease apart differences
between high, low, and 'in between' usage?  Splitting users into groups
at equal percentiles does not seem appropriate.  I have not come across
an established method for this kind of segmentation.

Thoughts?

Thanks in advance!


      

-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712338.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Rich Ulrich
In reply to this post by whatsinaname
As Jon asks, What is your purpose?  ... "teasing apart
differences..." only says a little bit. 

When I was watching over various aspects of computer
usage on mainframes in the 1980s, it was useful to me to list
some raw data -- ID, and relevant related information --
the top 10 or 20 users in a category, in descending order.
My purpose was related to "managing a scarce resource."

The Pareto-curve descriptions were useful, saying (for instance)
90% of the consumption was by 10% of the people.  Or "by 3 people". 
It is popular to use reciprocal fractions, like 90/10 or 80/20, and
it is also popular to use rounded-off cut-offs, like "the top 1%" and
"the top 10%", when those fractions account for large amounts
of the resource in question. 

However, you are referring to e-mails/posts, so that is not a
limited resource.

I does make sense to lump together the top 2 users if they are
similar in profile, or the same user under two names, but that
is while still thinking of using up a resource.  Is that sort of
reduction useful for your data summary?

Whatever the subject, it makes less sense to "lump" some fraction,
the more that the aggregated folks differ.  But what "differences"  or
what characteristics are going to be relevant to you?  - That takes us
back to the question, What is your purpose?

--
Rich Ulrich

> Date: Sat, 19 May 2012 04:41:58 -0700

> From: [hidden email]
> Subject: Categorizing a power law distribution of user participation in an online community (non spss question)
> To: [hidden email]
>
> I have data on participation in a large scale online community.
> Participation follows a power law distribution. This means that a fraction
> of the users are responsible for the majority of posts, i.e. a few people
> post a LOT while a lot of people don’t post much at all.
>
> My question is how to segment this population to tease apart differences
> between high, low, and 'in between' usage? Splitting users into groups at
> equal percentiles does not seem appropriate. I have not come across an
> established method for this kind of segmentation.
>
> Thoughts?
> ...
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

whatsinaname
Thanks for all the helpful replies.... much appreciated!

I am trying to create a model where 'degree of participation in online community' is one of many predictors of an outcome such as performance.  I have a measure of performance, just need a better understanding of how to  model participation.
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Rich Ulrich
Taking what is "meaningful" - I would say that participating
in an average of one "thread" per month is a fairly high
level of participation.  The number of threads can be more
salient than the number of posts, especially if folks do create
new Subject: lines as needed, stick to the topic, and don't
often break one topic into multiple threads.

Creating a new thread can be different from Replying.

This assumes that your data and software can readily
define a thread.

For people with the same average, regular participation
is a different commitment from sporadic.  But that might not
be easy to disentangle.

--
Rich Ulrich

> Date: Sat, 19 May 2012 12:37:31 -0700

> From: [hidden email]
> Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question)
> To: [hidden email]
>
> Thanks for all the helpful replies.... much appreciated!
>
> I am trying to create a model where 'degree of participation in online
> community' is one of many predictors of an outcome such as performance. I
> have a measure of performance, just need a better understanding of how to
> model participation.
>
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Maguin, Eugene
In reply to this post by whatsinaname
I know you asked about categorizing but I wonder if an alternative, possibly useful, variable would be the log of the posting frequency.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of whatsinaname
Sent: Saturday, May 19, 2012 3:38 PM
To: [hidden email]
Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Thanks for all the helpful replies.... much appreciated!

I am trying to create a model where 'degree of participation in online community' is one of many predictors of an outcome such as performance.  I have a measure of performance, just need a better understanding of how to model participation.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712418.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

whatsinaname
Gene, can you tell me more about why you think log of posting frequency would be a good variable?
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Bruce Weaver
Administrator
What Gene is getting at, I suspect, is that it's usually better to analyze continuous variables as continuous rather than carving them into categories.  There are lots of articles that address this issue, including this very readable one by Dave Streiner:

   http://ww1.cpa-apc.org/publications/archives/cjp/2002/april/researchMethodsDichotomizingData.asp

HTH.


whatsinaname wrote
Gene, can you tell me more about why you think log of posting frequency would be a good variable?
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Maguin, Eugene
In reply to this post by whatsinaname
As Bruce said, that is the reason. As part of that reason, with a DV like log frequency, you can look at the linearity of the relationship between the predictors and changes in the DV.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of whatsinaname
Sent: Sunday, May 20, 2012 3:23 PM
To: [hidden email]
Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question)

Gene, can you tell me more about why you think log of posting frequency would be a good variable?

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712691.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD