SPSSX Discussion

Count distinct using aggregate

Classic

List

Threaded

21 messages Options

nessie

Re: Count distinct using aggregate

Dear David

I checked your last syntax today and it was just amazing. It was so fast that I was sure it didn’t work, but it worked like a charm and the end results contains just a fraction of the variables I got with the “original” method (as you said).

Thanks a lot. This makes it possible to interpret a large quantity of public health data which hopefully will lead to better understanding of patient trajectories.

Best regards

Lars N.

7. jun. 2015 kl. 18.49 skrev David Marso [via SPSSX Discussion] <[hidden email]>:

DATA LIST LIST /
id * di_m (A1) di_b1 (A1) di_b2 (A1) di_b3 (A1).
BEGIN DATA.
1, S, G
1, D, S
1, B, X
2, A, B
2, A
3, C
4, A
4, A
4, H, J
4, A, Y, U, J
END DATA.
VARSTOCASES /MAKE Di FROM di_m di_b1 di_b2 di_b3.
AGGREGATE OUTFILE * / BREAK id Di/ Count=N.
AGGREGATE OUTFILE * MODE ADDVARIABLES/ BREAK id/ Distinct=N.
/* First version (probably unweildy with 20K possible D(i)
/*CASESTOVARS ID=id /INDEX=Di.
CASESTOVARS ID=id .

David Marso wrote
Definitely do the restructure using VARSTOCASES. Then the identical logic applies to get your desired result.
I suspect that will be much easier than doing the vector/loop with an indefinite number of possible targets.

Jignesh Sutar wrote
You can get a distinct number of values across variables with each case
using VECTOR/LOOP, I have some code to do this but can't remember on the
fly at the moment, neither on my laptop.

You cold always restructure the dataset so you have a long format
PATIENT-DIAGNOSIS level dataset (probably preferable in most cases) and
then apply same logic as previously to get distinct diagnosis codes within
each patient, which is what I think you are trying to do?
On Sat, 6 Jun 2015 at 22:53, nessie <<a href="x-msg://2/user/SendEmail.jtp?type=node&node=5729778&i=0" target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:

> Dear David and Jignesh
>
> Thank’s a lot - I would go as far as to say you guys are geniuses.
> I still agree with Jignesh that this should be implemented as a “distinct
> count” (and maybe also a “list distinct”) under the “aggregate” function in
> SPSS, but thank’s to you I can do the same without it. I bet this could
> also be useful for a lot of other SPSS users.
>
> *One more question:*
> Is it possible to do this across multiple variables for each case?
>
> E.g.:
> A patient could have several variables describing diagnosis codes for each
> case. One for main diagnose and quite a few to describe possible bi
> diagnosis. Each code could only be present one time for each case, but the
> same code could be found in different places (main and different bi
> diagnosis variables) across cases with the same id.
>
> I want to count the number of distinct diagnosis codes for each id across
> cases and code-variables.
>
> *Example data:*
>
> DATA LIST LIST /
> id * di_m (A1) di_b1 (A1) di_b2 (A1) di_b3 (A1).
> BEGIN DATA.
> 1, S, G
> 1, D, S
> 1, B, X
> 2, A, B
> 2, A
> 3, C
> 4, A
> 4, A
> 4, H, J
> 4, A, Y, U, J
> END DATA.
>
> LIST.
>
> *The output would look like this:*
>
> id di_m di_b1 di_b2 di_b3
> 1.00 S G
> 1.00 D S
> 1.00 B X
> 2.00 A B
> 2.00 A
> 3.00 C
> 4.00 A
> 4.00 A
> 4.00 H J
> 4.00 A Y U J
>
> *Desired outcome *(the n_distinct_di is most important, the rest is a
> bonus)*:*
>
> id A B C D G H J S U X Y n_di *n_distinct_di*
> 1 1 1 1 2 1 6 *5*
> 2 2 1 3 *2*
> 3 1 1 *1*
> 4 3 1 2 1 1 8 *5*
>
> *Even more desired outcome (there is almost 20 000 different diagnosis
> codes, so this would make it easier to interpret and analyse):*
>
> id di_1 di_2 di_3 di_4 di_5 n_di *n_distinct_di*
> 1 S G D B X 6 *5*
> 2 A B 3 *2*
> 3 C 1 *1*
> 4 A H J Y U 8 *5*
>
> Thanks a lot for all you help and effort to find smart and quick
> solutions. Regardless of if you can help me with this last question or not
> your help has been priceless.
>
> Best regards
> Lars N.
>
> 5. jun. 2015 kl. 17.39 skrev David Marso [via SPSSX Discussion] <[hidden
> email] <http:///user/SendEmail.jtp?type=node&node=5729773&i=0>>:
>
> Actually the SORT and DATASET were leftover from previous solution.
> It could be as simple as follows: Note the raw data are replaced by the
> AGGREGATE.
> --
> DATA LIST FREE / id department.
> BEGIN DATA.
> 1 2 1 3 1 2 2 1 2 1 3 3 4 1 4 2 4 2 4 5
> END DATA.
>
> FORMATS department (F1).
> AGGREGATE OUTFILE * /BREAK ID department /Count=N.
> CASESTOVARS /ID=id /INDEX=department.
> COMPUTE @=$SYSMIS.
> COMPUTE n_dept=NVALID(id TO @)-1.
> MATCH FILES / FILE * / DROP @.
> LIST.
>
> Jignesh Sutar wrote
> That's a neat way of doing it also. Avoids a multi variable SORT so
> perhaps preferable in terms of speed performance with large datasets (but
> you do then have to juggle/match with multiple datasets).
>
> I do feel there still should be a more direct way of doing this in SPSS. A
> simple DISTINCT function in AGGREGATE would be very much apt for this.
>
> Please reply to the list and not to my personal email.
>
> Those desiring my consulting or training services please feel free to
> email me.
> ---
> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos
> ne forte conculcent eas pedibus suis."
> Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in
> abyssum?"
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://spssx-discussion.1045642.n5.nabble.com/Count-distinct-using-aggregate-tp5729756p5729769.html
> To unsubscribe from Count distinct using aggregate, click here.
> NAML
> <http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
> ------------------------------
> View this message in context: Re: Count distinct using aggregate
> <http://spssx-discussion.1045642.n5.nabble.com/Count-distinct-using-aggregate-tp5729756p5729773.html>
> Sent from the SPSSX Discussion mailing list archive
> <http://spssx-discussion.1045642.n5.nabble.com/> at Nabble.com.
> ===================== To manage your subscription to SPSSX-L, send a
> message to <a href="x-msg://2/user/SendEmail.jtp?type=node&node=5729778&i=1" target="_top" rel="nofollow" link="external" class="">[hidden email] (not to SPSSX-L), with no body text
> except the command. To leave the list, send the command SIGNOFF SPSSX-L For
> a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
<a href="x-msg://2/user/SendEmail.jtp?type=node&node=5729778&i=2" target="_top" rel="nofollow" link="external" class="">[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Count-distinct-using-aggregate-tp5729756p5729778.html
To unsubscribe from Count distinct using aggregate, click here.
NAML