string match

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

string match

Jon Oh
Hi,

Is there a syntax that can be used to match two or more stings and identify
the commonality?
For example:

Var1       Var2        Comm
12345     2457         245
4            134           4
2567       37            7
234          56

Given Var1 and Var2, can Comm be done on SPSS? If it is doable, can a Var3
be added to do the match?

Thank you very much!
Jon

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: string match

Albert-Jan Roskam
Hi Jon,

There are quite a few string similarity measures. The
first thing that came to my mind was n-gram indexing.
There's a Python module for that:
http://pypi.python.org/pypi/ngram/2.0.0b2

Below is an spss syntax that creates a variable with
all common units (in this case, digits).

data list / var1 1-5 (a) var2 7-11 (a) comm 12-18 (a).
begin data
12345 2457 245
4     134  4
2567  37   7
234   56
1234  1234 1234
end data.

string comm2 (a8).
loop #i = 1 to 5.
loop #j = 1 to 5.
if (substr(var1,#i,1) = substr(var2,#j,1) ) comm2 =
concat(rtrim(comm2), substr(var1,#i,1)).
end loop.
end loop.
exe.


Cheers!!
Albert-Jan


--- Jon Oh <[hidden email]> wrote:

> Hi,
>
> Is there a syntax that can be used to match two or
> more stings and identify
> the commonality?
> For example:
>
> Var1       Var2        Comm
> 12345     2457         245
> 4            134           4
> 2567       37            7
> 234          56
>
> Given Var1 and Var2, can Comm be done on SPSS? If it
> is doable, can a Var3
> be added to do the match?
>
> Thank you very much!
> Jon
>
> =====================
> To manage your subscription to SPSSX-L, send a
> message to
> [hidden email] (not to SPSSX-L), with no
> body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send
> the command
> INFO REFCARD
>



      ____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: string match

Peck, Jon
This will work as long as there are no duplicated  characters within a string (assuming that duplicates should not appear in the result).

A nice way to do this with Python would be to use its set methods.  The result would just be the intersection of the sets formed by the characters of the two input strings.  Duplicates would automatically be removed.

Here's a fragment.
If var1 = "12345" and var2 = "2457", the characters in common would just be
"".join(set(var1).intersection(set(var2)))

set(var1) creates a set with members "1", "2", etc.
The intersection operator does the calculation, and
"".join(...) recombines the members of the resulting set into a string.

HTH,
Jon Peck


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Albert-jan Roskam
Sent: Sunday, May 04, 2008 4:38 AM
To: [hidden email]
Subject: Re: [SPSSX-L] string match

Hi Jon,

There are quite a few string similarity measures. The
first thing that came to my mind was n-gram indexing.
There's a Python module for that:
http://pypi.python.org/pypi/ngram/2.0.0b2

Below is an spss syntax that creates a variable with
all common units (in this case, digits).

data list / var1 1-5 (a) var2 7-11 (a) comm 12-18 (a).
begin data
12345 2457 245
4     134  4
2567  37   7
234   56
1234  1234 1234
end data.

string comm2 (a8).
loop #i = 1 to 5.
loop #j = 1 to 5.
if (substr(var1,#i,1) = substr(var2,#j,1) ) comm2 =
concat(rtrim(comm2), substr(var1,#i,1)).
end loop.
end loop.
exe.


Cheers!!
Albert-Jan


--- Jon Oh <[hidden email]> wrote:

> Hi,
>
> Is there a syntax that can be used to match two or
> more stings and identify
> the commonality?
> For example:
>
> Var1       Var2        Comm
> 12345     2457         245
> 4            134           4
> 2567       37            7
> 234          56
>
> Given Var1 and Var2, can Comm be done on SPSS? If it
> is doable, can a Var3
> be added to do the match?
>
> Thank you very much!
> Jon
>
> =====================
> To manage your subscription to SPSSX-L, send a
> message to
> [hidden email] (not to SPSSX-L), with no
> body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send
> the command
> INFO REFCARD
>



      ____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD