How to Remove Personally Identifiable Information from a Data Set
It is generally best practice to not store Personally Identifiable Information, like names and email addresses with survey data, particularly when you need to share the survey data with third parties. For a variety of reasons, this kind of information can end up in your data set and you may have cause to remove it. It is generally best to ask your data provider to remove this information before sending you a file for analysis (or ask them to send you a new file if you discover personal information after you've started working on the data), but you can also take steps to remove it yourself when that option is not available.
This page describes the steps that you can take to remove personal information from a data set that you are working on in Q.
SPSS Files
When using an SPSS file you need to save a new copy of the file and then update your Q project with the new file. The steps are:
- Select File > New Project.
- Select File > Data Sets > Add to Project > From File.
- Choose your data file and click Open.
- In the Data Import window:
- Choose Use original data file structure.
- Untick Advanced > Tidy Up Variable Labels.
- Unitck Strip HTML from Labels.
- Click OK.
- Go to the Variables and Questions tab.
- For any variable that contains personal information, click the yellow H in the Tags column.
- Select Tools > Save Data as SPSS/CSV File and click OK.
- Choose a name and location for your new file and click Save.
- Select File > Open and open your Q Project.
- Select File > Data Sets > Update and choose your data set.
- Select the new version of your file and click OK.
SPSS Files - Video Walkthrough
Excel/CSV Files
To remove personal information when using an Excel-style file, you can:
- Close Q.
- Open the file in Excel.
- Delete any columns containing personal information and save the file.
- Open your Q project again.
- Select File > Data Sets > Update and choose your data set.
- Select the new version of your file and click OK.
R Data Set
If you are using an R data set then you can remove any personal information by excluding the relevant columns from the data frame that is generated by your code. To do so, select File > Data Sets > Update and select your data set, then amend your code.
SQL
If you are using an SQL data set then you can remove any personally identifiable information by modifying the SQL query. To do so, select File > Data Sets > Update, and choose the SQL data set. Due to the large variety of queries that may be used, we can't give specific advice. If someone else wrote the query you should contact them and ask them to send you an updated version.