|
The text wizard would automatically make the names into valid SPSS names.
As for the odd characters, the usual reason for this is that the character encoding specified did not match the data. Does your raw data contain non-ascii characters? It might not be utf-8 but rather some other code page. CR and LF would not show up in the way you describe. I'm assuming that these are occuring in character data. That first line might also help to figure this out.
Can you send me a screen shot of what you are seeing in the DE and, if possible, the corresponding chunk of raw data or at least a picture?
It is possible to replace nonprinting characters, but you have to know what the codes are. And if what you see looks like a black diamond with a white question mark in it, that indicates that something was read that is not a valid utf8 character.
Thank you.
I wound up treating line 1 as a line of data rather than variable names because that line had illegal variable name characters in the data values to be treated as variable names. The question i want to ask now is about what i suppose are either true non-printing
characters (CR, Tab) or valid utf8 characters, like a curved apostrophe, which shows as a little square in data editor and as another symbol in word. Is there a way to search and replace them or edit them by reading their hex or PIB code and changing it to
printable symbol. I remember questions have come up over the years about finding and removing/changing true non-printing characters.
Gene Maguin
From: Jon Peck <[hidden email]>
Sent: Wednesday, June 19, 2019 10:11 PM
To: ,
Subject: Re: [SPSSX-L] reading long record csv files
You can do the file handle like that, but there is no need for the mode=character.
The only thing the file handle does is to convey the large lrecl. The default lrecl is only 8k. So then you
would specify /FILE="baseline" in the GET DATA command.
You do need the VARIABLES subcommand. It's strange that there is no way for the backend to pick up the variable names from the csv file, but the text wizard actually does that for you. It's probably due to
the need to specify the format for each variable. The alternative would be to use a small bit of Python code to read the first line and generate the VARIABLES subcommand, but the text wizard does that anyway. If you need that, though, I can easily write
that for you.
It went through. And thank you.
I was going to write the file handle like this, leaving out the RECFM element
file handle baseline\name='U:\string1\string2\string3\filename.csv' /
mode=character lrecl=132000.
It sounds like File->Import data will read, or maybe 'read' is the wrong word, the file. And i can specify that var names are on the line 1, start reading at line 2, etc, and paste to a syntax file. Having done this before, the file name would be the one
named above. If this is true, what does a file handle add/do?
Is it true that in the example below, removing the variables subcommand would yield an error because Get Data will not read, was not designed to read, variable names from line 1? Thus a variable subcommand is required. I think you are telling me this indirectly
but i'm not sure. I'm curious here. My job is to read the file.
Thanks, Gene Maguin
From: Jon Peck <[hidden email]>
Sent: Wednesday, June 19, 2019 5:42 PM
To: ,
Subject: Re: [SPSSX-L] reading long record csv files
Strange email address. I hope this goes through.
You can just overestimate the necessary LRECL at the cost of a small extra memory allocation (but keep it less than 2GB :-)). If all the characters are regular ASCII, the number of bytes will be the number of characters.
In the file handle statement, leave out the recfm parameter. The name parameter would be the filespec or at least the start of it. You can continue it by referencing the handle as the first part of the filespec in the GET DATA command. The
rule is that if the first part of the filespec matches a file handle, the handle is used.
GET DATA /TYPE=text is what you want. You can build the command using the text wizard (File > Import Data) , but with a huge number of variables that may be too clumsy, because the preview view might be painful, but it would generate all the
field names and formats for you. Paste the syntax so you can double check. The syntax would be something like this (I didn't use a file handle here.)
GET DATA /TYPE=TXT
/FILE="C:\data\1991GSS.csv"
/ENCODING='UTF8'
/DELIMITERS=","
/QUALIFIER='"'
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/DATATYPEMIN PERCENTAGE=95.0
/VARIABLES=
sex AUTO
race AUTO
region AUTO
happy AUTO
life AUTO
sibs AUTO
childs AUTO
If the file is actually tab delimited, you could use GET TRANSLATE to avoid the need to list the variable names.
...
Further questions.
"length in bytes". Practical question. you're given a delimited file. How do you determine the record length (or get close enough)? Specific question. So, sas says UTF8 and 132K in round numbers, can i conclude that this is byte length and not the character
length?
Next. I have a file handle setup written. Now i need, i believe, to write the command to read the file. It looks like the only real (and best) choice is Get Data. I'm assuming i can't use the menu system for this. True? So in this read operation, i'd like
to read the first line as the variable names and the second and on lines as data. But, as I read the Get Data doc for type=txt, there appears to be no way to specify line 1 as var names. True? If that's true, it appears i have to name the number of variables.
However, if the source, qualtrics, provides that information, i don't yet know how to find it and the number is fairly large, 1000 plus.
If Get Data /type=txt won't read line 1 as var names, is the only solution to read the csv file into excel and define line 1 as variable names there? If i do that, do i need the file handle command and a Get Data command or can i use the menu?
Thanks, Gene Maguin
From: Jon Peck <[hidden email]>
Sent: Tuesday, June 18, 2019 9:09 AM
To: Maguin, Eugene
Cc: SPSS List
Subject: Re: [SPSSX-L] reading long record csv files
The file handle should be sufficient. It does not need to be the exact length - just long enough to include the maximum record length. Remember that the LRECL specifies the length in bytes, not characters.
On Tue, Jun 18, 2019 at 6:57 AM Maguin, Eugene < [hidden email]> wrote:
How would i read a csv file with a very long record? The file is an export from qualtrics which puts the item text as row 2 in the export file (row 1 is var names, of course). That row is about 132K in length. The rest of the rows are 3K to 5K. Is it enough
to specify a file handle, like
file handle widefile\name='path'\mode=character\recform=v\lrecl=132000.
or is more needed?
Does the lrecl value need to be the exact value or just longer than the longest record? Related, how exactly does a person determine the lrecl value. (i know it, i hope, here from a sas attempt to read the file).
I could read row 1 to get var names, skip row 2 and start reading on row 3. Is a file handle required if i do this?
Thanks, Gene Maguin
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
--
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
|