I have data on participation in a large scale online community. Participation follows a power law distribution. This means that a fraction of the users are responsible for the majority of posts, i.e. a few people post a LOT while a lot of people don’t post much at all.
My question is how to segment this population to tease apart differences between high, low, and 'in between' usage? Splitting users into groups at equal percentiles does not seem appropriate. I have not come across an established method for this kind of segmentation. Thoughts? Thanks in advance! |
Administrator
|
First thing I would do I create a histogram and see if there are obvious clumps.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Alternatively, make an arbitrary decision as to what counts as High and what counts as Low, then use something like:
RECODE (lo thru <1st cut point> = 1)( <1st cut point> thru <2nd cut point> = 2)( <2nd cut point> thru HI = 3) INTO TESTVAR. FREQ TEStVAR . If you have missing values, you'll need to replace lo with lowest valid value, Hi with highest valid value and add (ELSE = SYMIS) to the RECODE command. Forget statistics, make sociological sense first. Advice from a died-in-the-wool Old Dog survey researcher. John Hall John F Hall (Mr) Email: [hidden email] Website: www.surveyresearch.weebly.com PS Have a look at "Cyberchiefs", a book by Matthieu O'Neil (Pluto Press, 2009) about democracy and the formation of hierarchies in social networks. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso Sent: 19 May 2012 15:55 To: [hidden email] Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question) First thing I would do I create a histogram and see if there are obvious clumps. whatsinaname wrote > > I have data on participation in a large scale online community. > Participation follows a power law distribution. This means that a > fraction of the users are responsible for the majority of posts, i.e. > a few people post a LOT while a lot of people don’t post much at all. > > My question is how to segment this population to tease apart > differences between high, low, and 'in between' usage? Splitting > users into groups at equal percentiles does not seem appropriate. I > have not come across an established method for this kind of segmentation. > > Thoughts? > > Thanks in advance! > -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712325.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by David Marso
You can also use the Chart Editor to display different types of distribution curves on a histogram.
http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fidh_webhelp_distribution_palette.htm HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by John F Hall
Do you have predictors? What are
you going to do with this distribution?
If you want to just fit a power law sort of distribution, try a q-q plot with, say, Pareto, as the distribution. Analyze > Descriptive Statistics > Q-Q plots or PPLOT ... /TYPE = Q-Q. If you want segment interactively, try Transform > Visual Binning, or, especially if you have a target, Transform > Optimal Binning. HTH Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 -----Original Message----- From: SPSSX(r) Discussion [[hidden email]] On Behalf Of David Marso Sent: 19 May 2012 15:55 To: [hidden email] Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question) First thing I would do I create a histogram and see if there are obvious clumps. whatsinaname wrote > > I have data on participation in a large scale online community. > Participation follows a power law distribution. This means that a > fraction of the users are responsible for the majority of posts, i.e. > a few people post a LOT while a lot of people don’t post much at all. > > My question is how to segment this population to tease apart > differences between high, low, and 'in between' usage? Splitting > users into groups at equal percentiles does not seem appropriate. I > have not come across an established method for this kind of segmentation. > > Thoughts? > > Thanks in advance! > -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712325.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Bruce Weaver
you can also work with visual binning.
You can recode your variable into a new variable. Try to avoid using (... =sysmis) whenever possible. You the user decided that the value should be missing. Art Kendall Social Research Consultants On 5/19/2012 10:24 AM, Bruce Weaver wrote: ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARDYou can also use the Chart Editor to display different types of distribution curves on a histogram. http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fidh_webhelp_distribution_palette.htm HTH. David Marso wroteFirst thing I would do I create a histogram and see if there are obvious clumps. whatsinaname wroteI have data on participation in a large scale online community. Participation follows a power law distribution. This means that a fraction of the users are responsible for the majority of posts, i.e. a few people post a LOT while a lot of people don’t post much at all. My question is how to segment this population to tease apart differences between high, low, and 'in between' usage? Splitting users into groups at equal percentiles does not seem appropriate. I have not come across an established method for this kind of segmentation. Thoughts? Thanks in advance!----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712338.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
In reply to this post by whatsinaname
As Jon asks, What is your purpose? ... "teasing apart
differences..." only says a little bit. When I was watching over various aspects of computer usage on mainframes in the 1980s, it was useful to me to list some raw data -- ID, and relevant related information -- the top 10 or 20 users in a category, in descending order. My purpose was related to "managing a scarce resource." The Pareto-curve descriptions were useful, saying (for instance) 90% of the consumption was by 10% of the people. Or "by 3 people". It is popular to use reciprocal fractions, like 90/10 or 80/20, and it is also popular to use rounded-off cut-offs, like "the top 1%" and "the top 10%", when those fractions account for large amounts of the resource in question. However, you are referring to e-mails/posts, so that is not a limited resource. I does make sense to lump together the top 2 users if they are similar in profile, or the same user under two names, but that is while still thinking of using up a resource. Is that sort of reduction useful for your data summary? Whatever the subject, it makes less sense to "lump" some fraction, the more that the aggregated folks differ. But what "differences" or what characteristics are going to be relevant to you? - That takes us back to the question, What is your purpose? -- Rich Ulrich > Date: Sat, 19 May 2012 04:41:58 -0700 > From: [hidden email] > Subject: Categorizing a power law distribution of user participation in an online community (non spss question) > To: [hidden email] > > I have data on participation in a large scale online community. > Participation follows a power law distribution. This means that a fraction > of the users are responsible for the majority of posts, i.e. a few people > post a LOT while a lot of people don’t post much at all. > > My question is how to segment this population to tease apart differences > between high, low, and 'in between' usage? Splitting users into groups at > equal percentiles does not seem appropriate. I have not come across an > established method for this kind of segmentation. > > Thoughts? > ... |
Thanks for all the helpful replies.... much appreciated!
I am trying to create a model where 'degree of participation in online community' is one of many predictors of an outcome such as performance. I have a measure of performance, just need a better understanding of how to model participation. |
Taking what is "meaningful" - I would say that participating
in an average of one "thread" per month is a fairly high level of participation. The number of threads can be more salient than the number of posts, especially if folks do create new Subject: lines as needed, stick to the topic, and don't often break one topic into multiple threads. Creating a new thread can be different from Replying. This assumes that your data and software can readily define a thread. For people with the same average, regular participation is a different commitment from sporadic. But that might not be easy to disentangle. -- Rich Ulrich > Date: Sat, 19 May 2012 12:37:31 -0700 > From: [hidden email] > Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question) > To: [hidden email] > > Thanks for all the helpful replies.... much appreciated! > > I am trying to create a model where 'degree of participation in online > community' is one of many predictors of an outcome such as performance. I > have a measure of performance, just need a better understanding of how to > model participation. > |
In reply to this post by whatsinaname
I know you asked about categorizing but I wonder if an alternative, possibly useful, variable would be the log of the posting frequency.
Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of whatsinaname Sent: Saturday, May 19, 2012 3:38 PM To: [hidden email] Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question) Thanks for all the helpful replies.... much appreciated! I am trying to create a model where 'degree of participation in online community' is one of many predictors of an outcome such as performance. I have a measure of performance, just need a better understanding of how to model participation. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712418.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Gene, can you tell me more about why you think log of posting frequency would be a good variable?
|
Administrator
|
What Gene is getting at, I suspect, is that it's usually better to analyze continuous variables as continuous rather than carving them into categories. There are lots of articles that address this issue, including this very readable one by Dave Streiner:
http://ww1.cpa-apc.org/publications/archives/cjp/2002/april/researchMethodsDichotomizingData.asp HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by whatsinaname
As Bruce said, that is the reason. As part of that reason, with a DV like log frequency, you can look at the linearity of the relationship between the predictors and changes in the DV.
Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of whatsinaname Sent: Sunday, May 20, 2012 3:23 PM To: [hidden email] Subject: Re: Categorizing a power law distribution of user participation in an online community (non spss question) Gene, can you tell me more about why you think log of posting frequency would be a good variable? -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Categorizing-a-power-law-distribution-of-user-participation-in-an-online-community-non-spss-question-tp5712286p5712691.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |