Re: loop and do repeat problem with thousands of unique values to insert

Posted by David Marso on
URL: http://spssx-discussion.165.s1.nabble.com/loop-and-do-repeat-problem-with-thousands-of-unique-values-to-insert-tp4268902p4269412.html

"regarding vartstocase option, I'm not sure whether spss allows so many
columns. "
.... Note you are only bring back the 14 mapped fields, not thousands of variables so it shouldn't be a problem.  Hard to say whether the DO REPEAT would work with thousands of values aside from the fact that it is potentially doing an enormous number of if statements to place a given value and even after it has it will continue until the end of the list.  So if you have 1000 values it will take 14000 comparisons to fill one case. (even if all 14 values are at the beginning of the list).  

Here is a mock up of what I had in mind.
Omit any unnecessary sorts/saves depending upon your data files.  ie If you already have a sequential unique ID variable in your master file omit the COMPUTE ID=$CASENUM and SAVE.
HTH, David
data list free / strvar (a3) nummap (f4).
begin data
abc 1
def 2
ghi 3
jkl 4
bcd 5
efg 6
rst 7
ijk 8
kml 9
uvw 10
end data.
sort cases by strvar.
save outfile 'c:\temp\temp.sav'.

data list list / strvar1 to strvar4 (4A3) stuff01 to stuff20 (20f4).
begin data
abc def ghi jkl 3 2 3 4 3 4 2 4 3 2 3 4 2 2 4 3 2 6 7 5
bcd efg rst ijk 6 7 2 3 4 5 6 7 5 6 7 5 3 4 2 6 5 6 7 5
ghi kml rst uvw 3 4 2 6 7 5 7 6 2 3 4 6 7 5 6 7 5 7 2 5
end data.
compute ID=$CASENUM.
SAVE OUTFILE 'c:\temp\raw.sav'.

match files / file * / keep ID strvar1 to strvar4.
VARSTOCASES MAKE strvar FROM strvar1 TO strvar4
                   / INDEX=Index1(4)
                   / KEEP ID.
SORT CASES BY strvar.

MATCH FILES / FILE * / TABLE 'c:\temp\temp.sav' / BY strvar.
VECTOR numvars (4).
COMPUTE numvars(index1)=nummap.
AGGREGATE outfile *
    / BREAK id
   / numvars1 to numvars4=MAX(numvars1 to numvars4).
MATCH FILES FILE 'c:\temp\raw.sav' / file * / BY id.




Maurice Vergeer wrote
dear all,

thanks for your suggestions.

Regarding autorecode (David and Art's suggestion): I tried this, but
it took enormously long, so I interrupted it. The point is, there are
thousands of unique values, but appr. 4.5 million records (file size
over 3 gigabyte). So, it's large.

regarding vartstocase option, I'm not sure whether spss allows so many
columns. The values as such are not necessarily meaningful but need to
stay unique.

It appears there is no easy or obvious solution.
One option not explored yet is just inserting the string values and
numerical values in the do repeat.
This would result in a very large syntax file. This is a dirty
solution, not sure whether it's quick either.

Tonight I'll try to run one of options above and see whether it'll be
finished when I return from work tomorrow afternoon.

I'll let you kno whether it worked.

thanks again
Maurice




On Tue, Mar 29, 2011 at 20:37, David Marso <david.marso@gmail.com> wrote:
> Hi Maurice,
> If the AUTORECODE ../GROUP is not what you wish (ie your numeric codes have
> some specific meaning).
> SORT your external system file by the string variable and save it.
> Transform your master file from wide to long using VARSTOCASES retaining
> caseidentifier and string and index.
> SORT by string.
> MATCH FILES using the external file as a table with the string as a key.
> transform the file from long to wide.
> Done.
> HTH, David
> --
>
> Maurice Vergeer wrote:
>>
>> dear fellow list visitors,
>>
>> please help me with this problem.
>> I have the following syntax which works perfectly.
>>
>> It 'replaces' strings in old variables (name1 to name14) into
>> numerical ones in a new variable (newname1 to newname14).
>>
>> example:
>> vector name=name1 to name14.
>> vector newname(14).
>> loop i=1 to 14.
>> do repeat a=&quot;alpha&quot; &quot;beta&quot; &quot;gamma&quot; / b=1 2
>> 3.
>> - if name(i) = a newname(i)=b.
>> end repeat print.
>> end loop.
>>
>>
>> However, instead of three values (alpha beta and gamma) I have
>> thousands of unique string values stored in a separate system file,
>> each identified with a unique numerical code.
>> How can I insert these values in the do repeat function (after 'a='
>> and after 'b=')?
>>
>> The reason why I want to change these from string to numeric ones is
>> that I know the system file will be smaller and hopefully also faster
>> to read.
>>
>> You help is much appreciated.
>>
>> sincerely
>> Maurice
>>
>>
>>
>>
>> --
>> ___________________________________________________________________
>> Maurice Vergeer
>> Department of communication, Radboud University� � (www.ru.nl)
>> PO Box 9104, NL-6500 HE Nijmegen, The Netherlands
>>
>> Visiting Professor Yeungnam University, Gyeongsan, South Korea
>>
>> Recent publications:
>> -Vergeer, M., Hermans, L., &amp; Sams, S. (accepted for publication).
>> Online social networks and micro-blogging in political campaigning:
>> The exploration of a new campaign tool and a new campaign style. Party
>> Politics.
>> -Eisinga, R., Franses, Ph.H., &amp; Vergeer, M. (2010). Weather conditions
>> and daily television use in the Netherlands, 1996–2005. International
>> Journal of Meteorology.
>>
>> Webspace
>> www.mauricevergeer.nl
>> http://blog.mauricevergeer.nl/
>> www.journalisteninhetdigitaletijdperk.nl
>> maurice.vergeer (skype)
>> ___________________________________________________________________
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>> LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>>
>
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/loop-and-do-repeat-problem-with-thousands-of-unique-values-to-insert-tp4268902p4269231.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>



--
___________________________________________________________________
Maurice Vergeer
Department of communication, Radboud University�  (www.ru.nl)
PO Box 9104, NL-6500 HE Nijmegen, The Netherlands

Visiting Professor Yeungnam University, Gyeongsan, South Korea

Recent publications:
-Vergeer, M., Hermans, L., & Sams, S. (accepted for publication).
Online social networks and micro-blogging in political campaigning:
The exploration of a new campaign tool and a new campaign style. Party
Politics.
-Eisinga, R., Franses, Ph.H., & Vergeer, M. (2010). Weather conditions
and daily television use in the Netherlands, 1996–2005. International
Journal of Meteorology.

Webspace
www.mauricevergeer.nl
http://blog.mauricevergeer.nl/
www.journalisteninhetdigitaletijdperk.nl
maurice.vergeer (skype)
___________________________________________________________________

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"