DATASET vs MATCH FILES

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

DATASET vs MATCH FILES

Ron0z
Here’s a weird thing, that if anyone can offer some explanation would be
good. It has to do with the use of DATASET.

Where I don’t wish to save a system file but have the need of a data source
as part of my process I’ve found DATASETs quite useful. Well, at least until
now.

The files used in my code snippet below hold records that do in fact match
(when sorted as noted) with the following number of records in each:
VRL_hepcat: 15 records
VSR_hepcat: 24 records

This is the first time DATASETs have let me down, and I don't think I've
done anything out of the ordinary.

The following works perfectly. What I mean is, that the match files command
provides an output as expected.

DATA LIST FIXED
    FILE = VRL_hepcat
   /e313 1-10 (A) e307 11-20 (A) e534 21-26 (A) e333 27-29 (A) e339A 30-39
(A)
e354 40-49 (A) e489 50-57 (A) e464 58-63 (A)  e329 64-64 (A).
sort cases by e313 e307 e354 e489.
save outfile 'C:\hepcat\Export\VRLtmp.sys'.

(snip)

DATA LIST FIXED
    FILE = VSR_hepcat
    /e313 1-10 (A) e307 11-20 (A) e354 21-30 (A) e488 31-40 (A) e489 41-48
(A)
e446 49-49 (A) e543 50-89 (A) e333 90-92 (A) filler 101-150 (A).
sort cases by e313 e307 e354 e489.
save outfile 'C:\hepcat\Export\VSRtmp.sys'.

(snip)


match files
    /file = 'C:\hepcat\Export\VSRtmp.sys'
    /in = inVSR
    /file = 'C:\hepcat\Export\VRLtmp.sys'
    /in =inVRL
    /by e313 e307 e354 e489.
execute.

The FREQ output for inVRL shows
0 – 9
1 – 15
Total - 24
(With 15 records in common to both files, 15 is as good as it gets)



However, the following does not work. All but one record is matched

DATA LIST FIXED
    FILE = VRL_hepcat
   /e313 1-10 (A) e307 11-20 (A) e534 21-26 (A) e333 27-29 (A) e339A 30-39
(A)
e354 40-49 (A) e489 50-57 (A) e464 58-63 (A)  e329 64-64 (A).

sort cases by e313 e307 e354 e489.
dataset name VRLds.

(snip)

DATA LIST FIXED
    FILE = VSR_hepcat
    /e313 1-10 (A) e307 11-20 (A) e354 21-30 (A) e488 31-40 (A) e489 41-48
(A)
e446 49-49 (A) e543 50-89 (A) e333 90-92 (A) filler 101-150 (A).
sort cases by e313 e307 e354 e489.
dataset name VSRds.

(snip)


match files
    /file = VSRds
    /in = inVSR
    /file = VRLds
    /in =inVRL
    /by e313 e307 e354 e489.
execute.

The FREQ output for inVRL shows
0 – 10
1 – 14
Total - 24
(So, this is the weird part. There were only 14 records matched. It was the
first record in the files)




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: DATASET vs MATCH FILES

Jon Peck
This is weird.  A dataset is actually an anonymous, temporary file, so there should never be a difference between results using it and using the referenced file or data definition directly.  I would like to see the actual files to see if I can reproduce this difference.

On Sun, Jul 14, 2019 at 11:39 PM Ron0z <[hidden email]> wrote:
Here’s a weird thing, that if anyone can offer some explanation would be
good. It has to do with the use of DATASET.

Where I don’t wish to save a system file but have the need of a data source
as part of my process I’ve found DATASETs quite useful. Well, at least until
now.

The files used in my code snippet below hold records that do in fact match
(when sorted as noted) with the following number of records in each:
VRL_hepcat: 15 records
VSR_hepcat: 24 records

This is the first time DATASETs have let me down, and I don't think I've
done anything out of the ordinary.

The following works perfectly. What I mean is, that the match files command
provides an output as expected.

DATA LIST FIXED
    FILE = VRL_hepcat
   /e313 1-10 (A) e307 11-20 (A) e534 21-26 (A) e333 27-29 (A) e339A 30-39
(A)
e354 40-49 (A) e489 50-57 (A) e464 58-63 (A)  e329 64-64 (A).
sort cases by e313 e307 e354 e489.
save outfile 'C:\hepcat\Export\VRLtmp.sys'.

(snip)

DATA LIST FIXED
    FILE = VSR_hepcat
    /e313 1-10 (A) e307 11-20 (A) e354 21-30 (A) e488 31-40 (A) e489 41-48
(A)
e446 49-49 (A) e543 50-89 (A) e333 90-92 (A) filler 101-150 (A).
sort cases by e313 e307 e354 e489.
save outfile 'C:\hepcat\Export\VSRtmp.sys'.

(snip)


match files
    /file = 'C:\hepcat\Export\VSRtmp.sys'
    /in = inVSR
    /file = 'C:\hepcat\Export\VRLtmp.sys'
    /in =inVRL
    /by e313 e307 e354 e489.
execute.

The FREQ output for inVRL shows
0 – 9
1 – 15
Total - 24
(With 15 records in common to both files, 15 is as good as it gets)



However, the following does not work. All but one record is matched

DATA LIST FIXED
    FILE = VRL_hepcat
   /e313 1-10 (A) e307 11-20 (A) e534 21-26 (A) e333 27-29 (A) e339A 30-39
(A)
e354 40-49 (A) e489 50-57 (A) e464 58-63 (A)  e329 64-64 (A).

sort cases by e313 e307 e354 e489.
dataset name VRLds.

(snip)

DATA LIST FIXED
    FILE = VSR_hepcat
    /e313 1-10 (A) e307 11-20 (A) e354 21-30 (A) e488 31-40 (A) e489 41-48
(A)
e446 49-49 (A) e543 50-89 (A) e333 90-92 (A) filler 101-150 (A).
sort cases by e313 e307 e354 e489.
dataset name VSRds.

(snip)


match files
    /file = VSRds
    /in = inVSR
    /file = VRLds
    /in =inVRL
    /by e313 e307 e354 e489.
execute.

The FREQ output for inVRL shows
0 – 10
1 – 14
Total - 24
(So, this is the weird part. There were only 14 records matched. It was the
first record in the files)




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: DATASET vs MATCH FILES

Rick Oliver
While I doubt that this is part of the issue, the behavior of MATCH FILES is different when you refer to the active dataset by name rather than with an asterisk ("*"). IIRC, it creates a new unnamed dataset rather than replacing the active dataset, which can create some confusion as to which dataset subsequent commands are actually acting on.

On Mon, Jul 15, 2019 at 1:17 PM Jon Peck <[hidden email]> wrote:
This is weird.  A dataset is actually an anonymous, temporary file, so there should never be a difference between results using it and using the referenced file or data definition directly.  I would like to see the actual files to see if I can reproduce this difference.

On Sun, Jul 14, 2019 at 11:39 PM Ron0z <[hidden email]> wrote:
Here’s a weird thing, that if anyone can offer some explanation would be
good. It has to do with the use of DATASET.

Where I don’t wish to save a system file but have the need of a data source
as part of my process I’ve found DATASETs quite useful. Well, at least until
now.

The files used in my code snippet below hold records that do in fact match
(when sorted as noted) with the following number of records in each:
VRL_hepcat: 15 records
VSR_hepcat: 24 records

This is the first time DATASETs have let me down, and I don't think I've
done anything out of the ordinary.

The following works perfectly. What I mean is, that the match files command
provides an output as expected.

DATA LIST FIXED
    FILE = VRL_hepcat
   /e313 1-10 (A) e307 11-20 (A) e534 21-26 (A) e333 27-29 (A) e339A 30-39
(A)
e354 40-49 (A) e489 50-57 (A) e464 58-63 (A)  e329 64-64 (A).
sort cases by e313 e307 e354 e489.
save outfile 'C:\hepcat\Export\VRLtmp.sys'.

(snip)

DATA LIST FIXED
    FILE = VSR_hepcat
    /e313 1-10 (A) e307 11-20 (A) e354 21-30 (A) e488 31-40 (A) e489 41-48
(A)
e446 49-49 (A) e543 50-89 (A) e333 90-92 (A) filler 101-150 (A).
sort cases by e313 e307 e354 e489.
save outfile 'C:\hepcat\Export\VSRtmp.sys'.

(snip)


match files
    /file = 'C:\hepcat\Export\VSRtmp.sys'
    /in = inVSR
    /file = 'C:\hepcat\Export\VRLtmp.sys'
    /in =inVRL
    /by e313 e307 e354 e489.
execute.

The FREQ output for inVRL shows
0 – 9
1 – 15
Total - 24
(With 15 records in common to both files, 15 is as good as it gets)



However, the following does not work. All but one record is matched

DATA LIST FIXED
    FILE = VRL_hepcat
   /e313 1-10 (A) e307 11-20 (A) e534 21-26 (A) e333 27-29 (A) e339A 30-39
(A)
e354 40-49 (A) e489 50-57 (A) e464 58-63 (A)  e329 64-64 (A).

sort cases by e313 e307 e354 e489.
dataset name VRLds.

(snip)

DATA LIST FIXED
    FILE = VSR_hepcat
    /e313 1-10 (A) e307 11-20 (A) e354 21-30 (A) e488 31-40 (A) e489 41-48
(A)
e446 49-49 (A) e543 50-89 (A) e333 90-92 (A) filler 101-150 (A).
sort cases by e313 e307 e354 e489.
dataset name VSRds.

(snip)


match files
    /file = VSRds
    /in = inVSR
    /file = VRLds
    /in =inVRL
    /by e313 e307 e354 e489.
execute.

The FREQ output for inVRL shows
0 – 10
1 – 14
Total - 24
(So, this is the weird part. There were only 14 records matched. It was the
first record in the files)




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: DATASET vs MATCH FILES

Ron0z
Well Rick, that would be totally unexpected. I don’t doubt your experience,
but if a file is referenced by name, then that should be the file that’s
accessed. To do otherwise seems like a bug.

My most recent experience would seem to demonstrate that too, to some
extent.

I have attached some code and data that demonstrate the weirdness. If you
modify the code to suit where you place the data files you will see what I’m
on about.

First, note the select statement mid-way through the code (select if
(char.substr(e490,1,1) eq '4').). Run the code unaltered and note the
frequencies at the end of the code: it reads 14 matched records as evidenced
by the freq result of inVRL.

Next, comment out the same select statement and run it again. This run will
result in 15 matched records. Weird!

The interesting thing about this is that nothing follows the select
statement. It has no purpose. You will see that VRL_hepcat data is read,
sorted, a dataset is created, and records listed. Had some other command
followed the select statement something may have been expected, but with
nothing following it, it’s pointless because the next statement is where
VSR_hepcat is read. At that point, the active data should be dismissed,
except that the dataset VRLds is there to retain the data that was read.
That’s its purpose. The process seems to behave as though the select
statement were placed before the dataset command.

I’ve left some code in there which saves data to system files that are used
by match files. If you comment out the dataset commands and reinstate the
save file commands and use the appropriate match files command, you’ll see
that the  freq result of inVRL shows 15 irrespective of whether the select
command has been included in the code or not. In other words, the code and
data behave as expected.

(The select statement did have a purpose, but I've deleted it for clarity.)

Jon asked for a sample. Hopefully it is attached okay.

DATASET_test_data.zip
<http://spssx-discussion.1045642.n5.nabble.com/file/t340795/DATASET_test_data.zip>  


 



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: DATASET vs MATCH FILES

Bruce Weaver
Administrator
Those who do not use Nabble can get the uploaded file here:

http://spssx-discussion.1045642.n5.nabble.com/DATASET-vs-MATCH-FILES-tp5738150p5738156.html



Ron0z wrote
> --- snip ---
>
> I have attached some code and data that demonstrate the weirdness. If you
> modify the code to suit where you place the data files you will see what
> I’m
> on about.
>
> --- snip ---





-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: DATASET vs MATCH FILES

Jon Peck
In reply to this post by Ron0z
The syntax stream includes this command:
select if (char.substr(e490,1,1) eq '4').
That knocks out the first case in the active file.

When you do the match files with the dataset references, which are already open, that necessarily excludes the first case.  Specifying a dataset in a command refers to the open file.  It does not go back to the disk version of the file.

On Sun, Jul 14, 2019 at 11:39 PM Ron0z <[hidden email]> wrote:
Here’s a weird thing, that if anyone can offer some explanation would be
good. It has to do with the use of DATASET.

Where I don’t wish to save a system file but have the need of a data source
as part of my process I’ve found DATASETs quite useful. Well, at least until
now.

The files used in my code snippet below hold records that do in fact match
(when sorted as noted) with the following number of records in each:
VRL_hepcat: 15 records
VSR_hepcat: 24 records

This is the first time DATASETs have let me down, and I don't think I've
done anything out of the ordinary.

The following works perfectly. What I mean is, that the match files command
provides an output as expected.

DATA LIST FIXED
    FILE = VRL_hepcat
   /e313 1-10 (A) e307 11-20 (A) e534 21-26 (A) e333 27-29 (A) e339A 30-39
(A)
e354 40-49 (A) e489 50-57 (A) e464 58-63 (A)  e329 64-64 (A).
sort cases by e313 e307 e354 e489.
save outfile 'C:\hepcat\Export\VRLtmp.sys'.

(snip)

DATA LIST FIXED
    FILE = VSR_hepcat
    /e313 1-10 (A) e307 11-20 (A) e354 21-30 (A) e488 31-40 (A) e489 41-48
(A)
e446 49-49 (A) e543 50-89 (A) e333 90-92 (A) filler 101-150 (A).
sort cases by e313 e307 e354 e489.
save outfile 'C:\hepcat\Export\VSRtmp.sys'.

(snip)


match files
    /file = 'C:\hepcat\Export\VSRtmp.sys'
    /in = inVSR
    /file = 'C:\hepcat\Export\VRLtmp.sys'
    /in =inVRL
    /by e313 e307 e354 e489.
execute.

The FREQ output for inVRL shows
0 – 9
1 – 15
Total - 24
(With 15 records in common to both files, 15 is as good as it gets)



However, the following does not work. All but one record is matched

DATA LIST FIXED
    FILE = VRL_hepcat
   /e313 1-10 (A) e307 11-20 (A) e534 21-26 (A) e333 27-29 (A) e339A 30-39
(A)
e354 40-49 (A) e489 50-57 (A) e464 58-63 (A)  e329 64-64 (A).

sort cases by e313 e307 e354 e489.
dataset name VRLds.

(snip)

DATA LIST FIXED
    FILE = VSR_hepcat
    /e313 1-10 (A) e307 11-20 (A) e354 21-30 (A) e488 31-40 (A) e489 41-48
(A)
e446 49-49 (A) e543 50-89 (A) e333 90-92 (A) filler 101-150 (A).
sort cases by e313 e307 e354 e489.
dataset name VSRds.

(snip)


match files
    /file = VSRds
    /in = inVSR
    /file = VRLds
    /in =inVRL
    /by e313 e307 e354 e489.
execute.

The FREQ output for inVRL shows
0 – 10
1 – 14
Total - 24
(So, this is the weird part. There were only 14 records matched. It was the
first record in the files)




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: DATASET vs MATCH FILES

Ron0z
I had been using DATASETs under the assumption that the following code

DATA LIST
DATASET NAME A
SELECT IF something
DATASET NAME B

would create two data sets: A and B, with B having less records than A.

I ran a few tests to check it out the most relevant being that when B is
created A ceases to exist. I’m so disappointed. That seems such a pity Jon.
It also seems so non intuitive. There are so many times when having A and B
available at the same time would be useful.


However, the following code

DATA LIST
DATASET NAME A
DATA LIST
SELECT IF something
DATASET NAME B

where the DATA LIST has the identical spec results in both A and B being
created and available for use at the same time.

I hope SPSS developers are tuned into this one and might consider reviewing
the way DATASET works.






--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: DATASET vs MATCH FILES

Jon Peck
That is how it works.  This code results in two named datasets.  (If a dataset has no name, it survives if/while it is active, but disappears when another dataset is activated.)
DATA LIST list/x.
begin data
1
2
3
end data
DATASET NAME A.
DATA LIST list/x.
begin data
4
5
6
end data
SELECT IF x > 4.
DATASET NAME B.

dataset display.
Datasets Datasets Creation Timestamps Associated Files
A 16-JUL-2019 18:09:22.00
B 16-JUL-2019 18:09:23.00

B is the active dataset, but A survives because it is named.





On Tue, Jul 16, 2019 at 5:42 PM Ron0z <[hidden email]> wrote:
I had been using DATASETs under the assumption that the following code

DATA LIST
DATASET NAME A
SELECT IF something
DATASET NAME B

would create two data sets: A and B, with B having less records than A.

I ran a few tests to check it out the most relevant being that when B is
created A ceases to exist. I’m so disappointed. That seems such a pity Jon.
It also seems so non intuitive. There are so many times when having A and B
available at the same time would be useful.


However, the following code

DATA LIST
DATASET NAME A
DATA LIST
SELECT IF something
DATASET NAME B

where the DATA LIST has the identical spec results in both A and B being
created and available for use at the same time.

I hope SPSS developers are tuned into this one and might consider reviewing
the way DATASET works.






--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: DATASET vs MATCH FILES

Ron0z
The examples in the manual show that each dataset name has a different
source. mydata.sav is the source of dataset file1; moredata.sav is the
dataset of file2. That’s something I missed first time round.



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD