I have become slightly involved in a project where the basic aim is to study the development of allergy and asthma among children. All children
born at the local hospital will be followed for some time, and during that time there will be several questionnaires to parents, measurements of different variables in blood, faeces, placenta and so on. All files will of course contain a key/connector for
each child. The question from the researchers now as the data is beginning to pile up is how the material should be organized. (Somewhat late in my own view…)
One suggestion has been to put everything into some database management system, such as Access or SQL. Another that files should be “local”, saved in a suitable format (SPSS, text,…) and stored in some well-organized folder structure. Since I have limited
experience of my own from working with Access or similar, especially not with large and complex data sets, and also because I know several of the researchers involved are fairly good SPSS users, I tend to lean towards the latter structure. I simply suspect
that storage as well as maintenance and analysis will be easier with such a structure. But this might be entirely wrong due to my clearly far too limited experiences. Any suggestions as to how a good structure for the management of a complex set of data sets might be built, the SPSS way or some other? Robert
Robert Lundqvist
|
Administrator
|
In general I would organize as follows:
Question setA ID Time Var1...VarP 1 1 .... 1 2 .... 1 k ... 2 1 2 2 2 k 3 1 3 2 3 k ...... Question setN ID Time Var1...VarP 1 1 .... 1 2 .... 1 k ... 2 1 2 2 2 k 3 1 3 2 3 k You can then MATCH files as needed by ID Time retaining variables needed for specific analyses. I would AVOID at all costs having a structure of the following sort although people do and REGRET IT after great suffering and much teeth gnashing! ID Q1T1 Q2T1...QpT1 Q1T2 Q2T2....QpTk .Q1Tk QqTk etc ARGH!!! If your grant has funds for an external offsite consultant I might be interested in lending a hand. PM me if that is the case. ---
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Robert L
Either scheme could work. I would go with whichever technology the team is most comfortable with. But do be careful to plan out a backup strategy for the materials and plan for auditability/reproducibility of the work through documentation, syntax files, and enhanced metadata via variable and data file attributes. As you are doubtless aware, Statistics has many built-in features to facilitate this. However, you might not be aware of the GATHERMD extension command that builds a dataset of the variables across a tree of SPSS and other format files. That can be really handy when you need to figure out which files contain particular variables of interest. Whatever the approach, it would be important to have written procedures to guide the team, especially since the project will apparently run over an extended period of time. Good luck. On Fri, Dec 2, 2016 at 5:06 AM, Robert Lundqvist <[hidden email]> wrote:
|
In reply to this post by Robert L
It might seem obvious, but be sure to store/archive the original, unmodified data in some file and location and do not overwrite the original data files. -- Tony Babinec -- ASA Council of Chapters Chair, -- Joint Statistical Meetings 2017 Program Committee From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Robert Lundqvist I have become slightly involved in a project where the basic aim is to study the development of allergy and asthma among children. All children born at the local hospital will be followed for some time, and during that time there will be several questionnaires to parents, measurements of different variables in blood, faeces, placenta and so on. All files will of course contain a key/connector for each child. The question from the researchers now as the data is beginning to pile up is how the material should be organized. (Somewhat late in my own view…) One suggestion has been to put everything into some database management system, such as Access or SQL. Another that files should be “local”, saved in a suitable format (SPSS, text,…) and stored in some well-organized folder structure. Since I have limited experience of my own from working with Access or similar, especially not with large and complex data sets, and also because I know several of the researchers involved are fairly good SPSS users, I tend to lean towards the latter structure. I simply suspect that storage as well as maintenance and analysis will be easier with such a structure. But this might be entirely wrong due to my clearly far too limited experiences. Any suggestions as to how a good structure for the management of a complex set of data sets might be built, the SPSS way or some other? Robert ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Jon Peck
Thanks for all the positive feedback. I’ve forwarded your responses to Robert Lundquist, but got a bollocking for circulating the request to the QM teaching list (attached). John The QM teachers list is about QM teaching. It is NOT a list for people to go touting for stats advice. I've had to write you before about the complaints I get from other list members about your irrelevant posts. Please take this as a final warning, or I'l reluctantly have to block you from the list. John On 2 December 2016 at 13:49, John F Hall via Quantitative_methods_teaching <[hidden email]> wrote:
***************************************************************** QM Teachers' Mailing list subscribe at [hidden email] ***************************************************************** QM teaching blog: http://qmteaching.wordpress.com/ ***************************************************************** Resources for teaching QM available at: ****************************************************************** Professor John MacInnes FAcSS Sociology School of Social and Political Science, Room 5.05 Chrystal Macmillan Building University of Edinburgh Edinburgh EH8 9LD +44 (0)131 651 3867 "The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336." |
In reply to this post by Robert L
It's not clear to me how many people are going to be "touching" this data but I think that complex data structures are best stored in a relational database which is queried to produce flat tables for analysis in SPSS. Multiple tables with cross-table keys make ad hoc analyses to further explore interesting relationships a lot easier. At a minimum you have children, multiple tests per child, families, siblings?, parents, health professionals etc. All of these define keys that could be used to extract tables. Databases are way more secure than files and relationships between tables are more explicit than files organized into folders. Presumably children, tests, etc. are being added all the time. How are you going to handle additions? If more than 2 people are involved in this enterprise, one of the them should be responsible for maintaining the data. YMMV On Fri, Dec 2, 2016 at 5:06 AM Robert Lundqvist <[hidden email]> wrote:
-- ViAnn Beadle |
My first thought was that "ONE of them should be responsible" should be emphasized. On further thought, I came up with some other considerations.
What is the scope of the project? It sounds like it involves all the doctors at the hospital who deal with infants -- or, maybe, who deal with their families.
This is a single hospital. Surely it either has or plans to have a computerized data base of lab reports, and so on, outside of the control of the researchers.
Surely, they should have access to the general records. How do they tap into that? How do FEED into that?
Maybe I am over-estimating the project, and what follows could be irrelevant: - SPSS has always suited my own needs for a research database -- a single programmer (me) preparing files for analysis. But I believe that a hospital
database of records needs to be relational in order to allow (and control) access by individuals. And if this includes a lot of general knowledge that the physicians should know, then the physicians should be able to look up a single patient through a general relational database, either the one run by the hospital, or one that operates very similarly. That is (a) good for the level of care, and (b) possibly vital for sustaining good cooperation in providing
good data. -- Rich Ulrich From: SPSSX(r) Discussion <[hidden email]> on behalf of ViAnn Beadle <[hidden email]>
...
Presumably children, tests, etc. are being added all the time. How are you going to handle additions? If more than 2 people are involved in this enterprise, one of the them should be responsible for maintaining the data.
|
In reply to this post by Robert L
To all who have responded to my previous question on data management: many thanks! There have been several responses and contributions, both on and off the list, and I thought I might try to sum up some of them.
• Most important is perhaps the emphasis on planning. Some of the responders have said that there is no single technical solution, but for there should be a (single) person responsible for the documentation and maintenance of the data. • As for the choice between a clean SPSS solution with files connected by an ID for each child/family or some relational database, there have been more responders who lean towards the database approach. However, one drawback would be that the direct labelling of variables and values/codes would then be lost, so there are arguments for an SPSS solution. (The labelling could of course be achieved by having syntax which is run each time a call to the database is made.) Another solution would be to set up the database so that labels also were imported. Any other suggestions for this critical aspect? • Via John Hall who resent the posting to mail lists for data management, some responders have pointed at REDcap (https://projectredcap.org/). • Yet another suggestion was to have a look at Colectica, which is a standalone commercial system. As of writing, we will test the Colectica solution, mainly because this seems to provide tools for import of data from SPSS, SAS and Stata together with labels, and also import of Excel files where labels and metadata can be added. This could turn out to be handy in a project as the actual one since there will be a handful of research groups working on different statistics platforms involved. Again, thanks for all contributions. It seems as if this topic, systems or structures for organization and maintenance of complex research data, deserves much more attention. Robert
Robert Lundqvist
|
One comment on the metadata issue: some users keep the variable and value labels (and other metadata) as fields in the database and extract and apply these when queries are made. Or use APPLY DICTIONARY after creating a comprehensive variable dictionary to apply after each query from the database. On Wed, Jan 18, 2017 at 12:59 AM, [hidden email] <[hidden email]> wrote: To all who have responded to my previous question on data management: many |
Free forum by Nabble | Edit this page |