Friday, April 18, 2014

Data analysis: ten tips I wish I'd known sooner





I enjoy analysing data and I've been doing it for years, but I still do things inefficiently. All the same, I have learned some things along the way. As I work through another complex dataset, I thought it worth sharing some of the most useful tips I've picked up to make life simpler and smoother.  Some are very elementary and they will mostly be of relevance to psychologists who use Excel and SPSS to do fairly straightforward analyses, though some points are more generic.



A lot of these tips have to do with deploying those frontal lobes and thinking ahead: you need to be aware of three things:


  • How you set up a dataset can make a big difference to how easy it is to analyse later on

  • You will not remember anything about your data in a few years (or even months) time

  • There are increasing moves towards data-sharing - i.e. making your data available on a public repository and you need to plan for that too.



1. Label your subjects consistently and anonymously


You will almost certainly be required to anonymise data collected from human subjects. It is amazing how often people sign up to this in their ethics application but then forget all about it. I've seen supposedly anonymised data identified by people's names, initials and/or dates of birth. Don't do it!



I find simple consecutive number codes work fine, with a prefix denoting which study the subjects  come from. There are two things to think about, in addition to anonymisation. First, will it be useful to be able to sort subjects by code number into specific groups? For instance, if you have three groups identified as young, middle-aged and old, you might think of labelling them with Y1, Y2, .... M1, M2 etc.  But in certain computing systems, this will mean that when you generate output, e.g. means for each group, they will occur in alphabetic order, so M then O then Y.  So it may be  better to use codes that will follow a natural sequence - this means when you want to paste output into a table you don't need to fiddle about with it.



Another thing to note is that if you have a program that treats codes as alphabetic, then if you have, say, 20 subjects, and you sort them, they will come out in the order: S1, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S2, S20, S3, S4, S5, S6, S7, S8, S9. This mildly irritating feature can be avoided if you ensure all codes are the same length, e.g. S01, S02, and so on.



On the other hand, if you are generating raw data files that you want to process automatically through some analysis package, make sure you have codes that can be easily read in a loop. It's much easier to tell a program to sequentially analyse files beginning with S and ending in numbers 1 to 20, than it is to type in each subject code separately.



And if you are going to analyse a whole set of files using an automated procedure, use a logical folder structure to organise your files. I've been stymied by finding that someone has carefully organised data files so that all the data for one child on various tasks are in one folder, and these folders are then within other folders that group them by age. I prefer it if all the files that are to be analysed together are kept together: provided the file-naming system is well-constructed, there should not be any chance of confusing who is who, and you can then point the analysis program just to one relevant folder, without a lot of if statements.




2. Label your variables in a consistent and intuitive fashion that will work across platforms


If you are going to end up analysing your results in SPSS, start out with variable names that SPSS will accept - i.e. no blanks, leading numbers or prohibited characters should be included. Think ahead to the paper you plan to write about the results and consider the variable names you will use there. I seldom obey my own advice here, but it's a common source of irritation to reviewers if you are inconsistent in how you refer to a variable. Much better to start as you mean to go on as far as is possible.




3. Use 'freeze panes' in Excel


This is about as basic as it gets, but I'm surprised at how many people don't know about it. In Excel, when you scroll down or across your file, the variable names or subject IDs scroll off the screen. If you place your cursor in the cell just below the variable names and just to the right of the subject IDs (i.e. the first cell of data), and select View|Freeze panes, the rows and columns above the current cell will stay put when you scroll. If you get it wrong, you can always unfreeze.



There's a rather clunky method that lets you freeze panes in SPSS, which can be useful if you want to see subject IDs while scrolled over to the right: see here for instructions


4. Keep an explanatory list of your variables


In SPSS, the 'labels' field can be useful for keeping a record of what the variable is, but it is limited in length and sometimes more detail is needed. It's worth keeping a data coding file which lists all variables and gives a brief description of what they are, what missing value codes are, and so on. This is absolutely critical if you plan to deposit data in an archive for sharing. It's good practice to work as if that is going to be the case.




5. Use one big file, rather than lots of little files, and hide variables that you aren't currently using




If you have a large dataset, it's tempting to break it up to make it more manageable. You may have hundreds of variables to contend with. You may therefore be tempted to pull out variables of current interest and store in a separate file. The problem is that if you do that, it's easy to lose track of where you are.



Suppose you have a giant master file, and you select a few variables to make a new file, and then in the new file you find an error. You then need to correct it in both files, which is tedious - so tedious that you may not bother and will then end up with different versions and be unsure which is correct.



Also, you may want to look at relationships between the variables you have extracted and other variables in the master file. This involves more fiddling with files, which is a good way of generating errors, especially if you use cut and paste.  So my advice is stick with one master file, which is scrupulously labelled with a version number when you update it. You can avoid the 'too much data' problem by just hiding bits of file that aren't currently in use.



In Excel, it's simple to just hide rows or columns that you aren't using. You just select the columns you want to hide and select View|Hide. If you want to see them again, you select the columns adjacent to the hidden columns and select View|Unhide.



You can do a similar thing in SPSS by defining Variable Sets. It's a bit more fiddly than the Excel operation, but once you have defined a variable set, it is saved with the file and you can reselect it very easily. Instructions for doing this are here.



This website, incidentally, is a treasure trove of useful advice, including instructions on  how to produce APA formatted tables from SPSS output.




6. Never name a file with the suffix 'final' and always back up key data


It's just tempting fate to call a file 'final'. You will end up with files called 'final final' or 'really final' or 'final v.10'. Better to use dates to identify the most recent version.



The back-up advice is blindingly obvious but even the most seasoned of us still forgets to do it. If you've put in several days' work on something, you need to have it stored safely in case your computer dies or gets stolen.


7. Look at your data


Before rushing into an analysis, it is important to plot your data. This can be a good way of picking up gremlins in the dataset, such as values which are out of range. For instance, if you have raw scores and scaled scores on a variable, plot one vs the other in a scatterplot - depending on whether there is age variation in the sample, scores should either fall on a straight line, or at least cluster around it.

You should also check whether variables are normally distributed, and if not consider how to deal with this if your analysis assumes normality. Andy Field's textbook Discovering Statistics Using SPSS gives helpful advice on this general topic (and many more!).


8. Keep a log of your analysis


I find it helpful to think through what I want to do, then make a list of the analyses and outputs that I need and work through them systematically. In practice, things seldom go smoothly. You may find that a subject's file is missing, or corrupted, or that data need transforming in some way. It's traditional to work with a lab book in which such things are recorded, though I prefer to do this electronically when at the analysis stage, by just keeping a Word document open, and making a note of everything as I go along, under the relevant date.



In this log I note the names of the files I've created and their location and what they do. This really is necessary if you are to avoid a future time when you have to wade through numerous versions of numerous files trying to find the one you can only dimly remember creating.



I also record problems in the log. For instance, suppose you have a task that was wrongly administered in a couple of cases so the data will have to be excluded. The log is where you record the date at which this was noted, which data were affected and what action was taken. If you don't do this, you may well find that you, or someone else, comes back to the dataset at a later time and cannot make sense of it. Or, as I have done all too often, spends a morning discovering the same problem that was already found months previously.


9. Use scripting in SPSS


The best advice of all is to keep a script of your analysis if you are working in SPSS. Even if you know nothing about scripting, this is very easy to do. Set up your analysis using the menu, and instead of hitting OK, you hit Paste. This opens a script window showing the script-version of the commands you have selected. You can then select all or part of the script and run it by hitting the big green arrow at the top of the script window. If you then run another analysis, and again hit Paste, the new syntax will be appended at the bottom of the script.



There are four advantages to doing things this way:


  • You can save the script and thus have a permanent record of the analysis you have carried out. This complements the log you are keeping (and you can save its name in the log).

  • You can annotate the script. Any text you type in that begins with an asterisk and ends with a full stop is treated as a comment. You can use this to remind yourself of what the analysis does.

  • You can come back and re-run the analysis. Suppose you find that you had an error in your dataset and you had to correct one person's data. Instead of having to laboriously reconstruct the whole analysis, you just  re-run the script.

  • This is a good way to learn about scripting. A lot can be picked up by just taking the auto-generated script as a model and tweaking things like the variables that are analysed. If you want to do a parallel analysis on several variables, it is much easier to copy and paste the relevant section of a script and change one or two variable names than to go through a menu.



10. Check all analyses before publishing


It is really important to check all your analyses before you submit a paper. Always go through a paper and make sure that all reported tables and analysis are reproducible. It is amazing how often you re-run an analysis and things come out differently. It can seem sometimes that a malign spirit inhabits your computer and is just teasing you. Even if the  differences are not serious, they can still be a source of worry. In my experience, they often have to do with things like forgetting to select the correct subset of participants for a specific analysis - if, - for instance, there are exclusionary criteria. Or a missing data code may have been  inadvertently been treated as data. In the worst case you may find you cut and pasted something into a file forgetting that the file had been sorted in a different order. And sometimes, just sometimes, both Excel and SPSS files can get corrupted. It can be useful to have data in both formats for an overall consistency check on things like mean values.



This is where SPSS scripting comes into its own. The whole business of re-running the analyses becomes much more tractable with a script. Furthermore, in a final check, you can further annotate the script, noting which table in the paper corresponds to a particular output and so have a nice, clear account of what you did to refer back to.




No comments:

Post a Comment