reading long record csv files

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

reading long record csv files

Maguin, Eugene

How would i read a csv file with a very long record? The file is an export from qualtrics which puts the item text as row 2 in the export file (row 1 is var names, of course). That row is about 132K in length. The rest of the rows are 3K to 5K. Is it enough to specify a file handle, like


file handle widefile\name='path'\mode=character\recform=v\lrecl=132000.

or is more needed?


Does the lrecl value need to be the exact value or just longer than the longest record? Related, how exactly does a person determine the lrecl value. (i know it, i hope, here from a sas attempt to read the file). 


I could read row 1 to get var names, skip row 2 and start reading on row 3. Is a file handle required if i do this? 


Thanks, Gene Maguin

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: reading long record csv files

Jon Peck
The file handle should be sufficient.  It does not need to be the exact length - just long enough to include the maximum record length.  Remember that the LRECL specifies the length in bytes, not characters.

On Tue, Jun 18, 2019 at 6:57 AM Maguin, Eugene <[hidden email]> wrote:

How would i read a csv file with a very long record? The file is an export from qualtrics which puts the item text as row 2 in the export file (row 1 is var names, of course). That row is about 132K in length. The rest of the rows are 3K to 5K. Is it enough to specify a file handle, like


file handle widefile\name='path'\mode=character\recform=v\lrecl=132000.

or is more needed?


Does the lrecl value need to be the exact value or just longer than the longest record? Related, how exactly does a person determine the lrecl value. (i know it, i hope, here from a sas attempt to read the file). 


I could read row 1 to get var names, skip row 2 and start reading on row 3. Is a file handle required if i do this? 


Thanks, Gene Maguin

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: reading long record csv files

Maguin, Eugene

Thank you. 

I wound up treating line 1 as a line of data rather than variable names because that line had illegal variable name characters in the data values to be treated as variable names. The question i want to ask now is about what i suppose are either true non-printing characters (CR, Tab) or valid utf8 characters, like a curved apostrophe, which shows as a little square in data editor and as another symbol in word. Is there a way to search and replace them or edit them by reading their hex or PIB code and changing it to printable symbol. I remember questions have come up over the years about finding and removing/changing true non-printing characters.


Gene Maguin



From: Jon Peck <[hidden email]>
Sent: Wednesday, June 19, 2019 10:11 PM
To: ,
Subject: Re: [SPSSX-L] reading long record csv files
 
You can do the file handle like that, but there is no need for the mode=character.
The only thing the file handle does is to convey the large lrecl.  The default lrecl is only 8k. So then you would specify /FILE="baseline" in the GET DATA command.

You do need the VARIABLES subcommand.  It's strange that there is no way for the backend to pick up the variable names from the csv file, but the text wizard actually does that for you.  It's probably due to the need to specify the format for each variable.  The alternative would be to use a small bit of Python code to read the first line and generate the VARIABLES subcommand, but the text wizard does that anyway.  If you need that, though, I can easily write that for you.


On Wed, Jun 19, 2019 at 7:57 PM , <[hidden email]> wrote:

It went through. And thank you.


I was going to write the file handle like this, leaving out the RECFM element

file handle baseline\name='U:\string1\string2\string3\filename.csv' /

mode=character lrecl=132000.


It sounds like File->Import data will read, or maybe 'read' is the wrong word, the file. And i can specify that var names are on the line 1, start reading at line 2, etc, and paste to a syntax file. Having done this before, the file name would be the one named above. If this is true, what does a file handle add/do?


Is it true that in the example below, removing the variables subcommand would yield an error because Get Data will not read, was not designed to read, variable names from line 1? Thus a variable subcommand is required. I think you are telling me this indirectly but i'm not sure. I'm curious here. My job is to read the file.


Thanks, Gene Maguin




From: Jon Peck <[hidden email]>
Sent: Wednesday, June 19, 2019 5:42 PM
To: ,
Subject: Re: [SPSSX-L] reading long record csv files
 
Strange email address.  I hope this goes through.

You can just overestimate the necessary LRECL at the cost of a small extra memory allocation (but keep it less than 2GB :-)).  If all the characters are regular ASCII, the number of bytes will be the number of characters.

In the file handle statement, leave out the recfm parameter.  The name parameter would be the filespec or at least the start of it.  You can continue it by referencing the handle as the first part of the filespec in the GET DATA command.  The rule is that if the first part of the filespec matches a file handle, the handle is used.

GET DATA /TYPE=text is what you want.  You can build the command using the text wizard (File > Import Data) , but with a huge number of variables that may be too clumsy, because the preview view might be painful, but it would generate all the field names and formats for you.  Paste the syntax so you can double check.  The syntax would be something like this (I didn't use a file handle here.)
GET DATA  /TYPE=TXT
  /FILE="C:\data\1991GSS.csv"
  /ENCODING='UTF8'
  /DELIMITERS=","
  /QUALIFIER='"'
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /DATATYPEMIN PERCENTAGE=95.0
  /VARIABLES=
  sex AUTO
  race AUTO
  region AUTO
  happy AUTO
  life AUTO
  sibs AUTO
  childs AUTO

If the file is actually tab delimited, you could use GET TRANSLATE to avoid the need to list the variable names.
...



On Wed, Jun 19, 2019 at 3:10 PM , <[hidden email]> wrote:

Further questions. 

"length in bytes". Practical question. you're given a delimited file. How do you determine the record length (or get close enough)?  Specific question. So, sas says UTF8 and 132K in round numbers, can i conclude that this is byte length and not the character length?


Next. I have a file handle setup written. Now i need, i believe, to write the command to read the file. It looks like the only real (and best) choice is Get Data. I'm assuming i can't use the menu system for this. True? So in this read operation, i'd like to read the first line as the variable names and the second and on lines as data. But, as I read the Get Data doc for type=txt, there appears to be no way to specify line 1 as var names. True? If that's true, it appears i have to name the number of variables. However, if the source, qualtrics, provides that information, i don't yet know how to find it and the number is fairly large, 1000 plus.  


If Get Data /type=txt won't read line 1 as var names, is the only solution to read the csv file into excel and define line 1 as variable names there? If i do that, do i need the file handle command and a Get Data command or can i use the menu? 


Thanks, Gene Maguin



From: Jon Peck <[hidden email]>
Sent: Tuesday, June 18, 2019 9:09 AM
To: Maguin, Eugene
Cc: SPSS List
Subject: Re: [SPSSX-L] reading long record csv files
 
The file handle should be sufficient.  It does not need to be the exact length - just long enough to include the maximum record length.  Remember that the LRECL specifies the length in bytes, not characters.

On Tue, Jun 18, 2019 at 6:57 AM Maguin, Eugene <[hidden email]> wrote:

How would i read a csv file with a very long record? The file is an export from qualtrics which puts the item text as row 2 in the export file (row 1 is var names, of course). That row is about 132K in length. The rest of the rows are 3K to 5K. Is it enough to specify a file handle, like


file handle widefile\name='path'\mode=character\recform=v\lrecl=132000.

or is more needed?


Does the lrecl value need to be the exact value or just longer than the longest record? Related, how exactly does a person determine the lrecl value. (i know it, i hope, here from a sas attempt to read the file). 


I could read row 1 to get var names, skip row 2 and start reading on row 3. Is a file handle required if i do this? 


Thanks, Gene Maguin

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: reading long record csv files

Jon Peck
The text wizard would automatically make the names into valid SPSS names.

As for the odd characters, the usual reason for this is that the character encoding specified did not match the data.  Does your raw data contain non-ascii characters?  It might not be utf-8 but rather some other code page.  CR and LF would not show up in the way you describe.  I'm assuming that these are occuring in character data.  That first line might also help to figure this out.

Can you send me a screen shot of what you are seeing in the DE and, if possible, the corresponding chunk of raw data or at least a picture?

It is possible to replace nonprinting characters, but you have to know what the codes are.  And if what you see looks like a black diamond with a white question mark in it, that indicates that something was read that is not a valid utf8 character.

On Fri, Jun 21, 2019 at 8:59 AM , <[hidden email]> wrote:

Thank you. 

I wound up treating line 1 as a line of data rather than variable names because that line had illegal variable name characters in the data values to be treated as variable names. The question i want to ask now is about what i suppose are either true non-printing characters (CR, Tab) or valid utf8 characters, like a curved apostrophe, which shows as a little square in data editor and as another symbol in word. Is there a way to search and replace them or edit them by reading their hex or PIB code and changing it to printable symbol. I remember questions have come up over the years about finding and removing/changing true non-printing characters.


Gene Maguin



From: Jon Peck <[hidden email]>
Sent: Wednesday, June 19, 2019 10:11 PM
To: ,
Subject: Re: [SPSSX-L] reading long record csv files
 
You can do the file handle like that, but there is no need for the mode=character.
The only thing the file handle does is to convey the large lrecl.  The default lrecl is only 8k. So then you would specify /FILE="baseline" in the GET DATA command.

You do need the VARIABLES subcommand.  It's strange that there is no way for the backend to pick up the variable names from the csv file, but the text wizard actually does that for you.  It's probably due to the need to specify the format for each variable.  The alternative would be to use a small bit of Python code to read the first line and generate the VARIABLES subcommand, but the text wizard does that anyway.  If you need that, though, I can easily write that for you.


On Wed, Jun 19, 2019 at 7:57 PM , <[hidden email]> wrote:

It went through. And thank you.


I was going to write the file handle like this, leaving out the RECFM element

file handle baseline\name='U:\string1\string2\string3\filename.csv' /

mode=character lrecl=132000.


It sounds like File->Import data will read, or maybe 'read' is the wrong word, the file. And i can specify that var names are on the line 1, start reading at line 2, etc, and paste to a syntax file. Having done this before, the file name would be the one named above. If this is true, what does a file handle add/do?


Is it true that in the example below, removing the variables subcommand would yield an error because Get Data will not read, was not designed to read, variable names from line 1? Thus a variable subcommand is required. I think you are telling me this indirectly but i'm not sure. I'm curious here. My job is to read the file.


Thanks, Gene Maguin




From: Jon Peck <[hidden email]>
Sent: Wednesday, June 19, 2019 5:42 PM
To: ,
Subject: Re: [SPSSX-L] reading long record csv files
 
Strange email address.  I hope this goes through.

You can just overestimate the necessary LRECL at the cost of a small extra memory allocation (but keep it less than 2GB :-)).  If all the characters are regular ASCII, the number of bytes will be the number of characters.

In the file handle statement, leave out the recfm parameter.  The name parameter would be the filespec or at least the start of it.  You can continue it by referencing the handle as the first part of the filespec in the GET DATA command.  The rule is that if the first part of the filespec matches a file handle, the handle is used.

GET DATA /TYPE=text is what you want.  You can build the command using the text wizard (File > Import Data) , but with a huge number of variables that may be too clumsy, because the preview view might be painful, but it would generate all the field names and formats for you.  Paste the syntax so you can double check.  The syntax would be something like this (I didn't use a file handle here.)
GET DATA  /TYPE=TXT
  /FILE="C:\data\1991GSS.csv"
  /ENCODING='UTF8'
  /DELIMITERS=","
  /QUALIFIER='"'
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /DATATYPEMIN PERCENTAGE=95.0
  /VARIABLES=
  sex AUTO
  race AUTO
  region AUTO
  happy AUTO
  life AUTO
  sibs AUTO
  childs AUTO

If the file is actually tab delimited, you could use GET TRANSLATE to avoid the need to list the variable names.
...



On Wed, Jun 19, 2019 at 3:10 PM , <[hidden email]> wrote:

Further questions. 

"length in bytes". Practical question. you're given a delimited file. How do you determine the record length (or get close enough)?  Specific question. So, sas says UTF8 and 132K in round numbers, can i conclude that this is byte length and not the character length?


Next. I have a file handle setup written. Now i need, i believe, to write the command to read the file. It looks like the only real (and best) choice is Get Data. I'm assuming i can't use the menu system for this. True? So in this read operation, i'd like to read the first line as the variable names and the second and on lines as data. But, as I read the Get Data doc for type=txt, there appears to be no way to specify line 1 as var names. True? If that's true, it appears i have to name the number of variables. However, if the source, qualtrics, provides that information, i don't yet know how to find it and the number is fairly large, 1000 plus.  


If Get Data /type=txt won't read line 1 as var names, is the only solution to read the csv file into excel and define line 1 as variable names there? If i do that, do i need the file handle command and a Get Data command or can i use the menu? 


Thanks, Gene Maguin



From: Jon Peck <[hidden email]>
Sent: Tuesday, June 18, 2019 9:09 AM
To: Maguin, Eugene
Cc: SPSS List
Subject: Re: [SPSSX-L] reading long record csv files
 
The file handle should be sufficient.  It does not need to be the exact length - just long enough to include the maximum record length.  Remember that the LRECL specifies the length in bytes, not characters.

On Tue, Jun 18, 2019 at 6:57 AM Maguin, Eugene <[hidden email]> wrote:

How would i read a csv file with a very long record? The file is an export from qualtrics which puts the item text as row 2 in the export file (row 1 is var names, of course). That row is about 132K in length. The rest of the rows are 3K to 5K. Is it enough to specify a file handle, like


file handle widefile\name='path'\mode=character\recform=v\lrecl=132000.

or is more needed?


Does the lrecl value need to be the exact value or just longer than the longest record? Related, how exactly does a person determine the lrecl value. (i know it, i hope, here from a sas attempt to read the file). 


I could read row 1 to get var names, skip row 2 and start reading on row 3. Is a file handle required if i do this? 


Thanks, Gene Maguin

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]



--
Jon K Peck
[hidden email]



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD