Create New Variables - Impute Missing Data
This tool creates new Questions in your data set where any missing values are filled in using imputation. Imputation is a process of creating estimates for missing values using the distribution of existing values in one or more variables. When you use this tool, a new Question is added for each Question that was selected. These new Questions are inputed together, meaning that all of the variables are used to derrive the imputed values. In addition, you can later add Auxilliary variables, which are additional variables whose data is used to inform the imputation, but which are not themselves added to the data set.
Example
The following table shows raw data for responses from two survey questions which asked respondents how many text messages they send in a typical week, and how much they spend per month on their phone bill:
The first two columns show the original data, and respondents 12, 15, 21, and 22 have some missing values. The second two columns show the imputed versions of the same two variables, and those respondents have now been assigned values based on the imputation.
The settings for these new inputed variables are as follows:
This tells us that the new imputed data is derived from four variables in total:
- Text messages per week
- Average monthly bill
- Age
- Gender
The final two variables have not themselves been added to the data set
Usage
- Select one or more variables or questions in the Variables and Questions tab.
- Select Automate > Browse Online Library > Create New Variables > Impute Missing Data.
- To change how the imputation is performed:
- Select one of the new imputed Questions in the Variables and Questions tab.
- Right-click and select Edit R Variable.
- Choose the desired options in the Inputs section on the right. These options are explained below.
- Click Update R Variable.
Settings
The following settings are available for this tool:
Variables These are the variables which are imputed.
Auxilliary variables You can add additional variables to this drop-box to use the data from those variables in the imputation.
Seed This is the random number seed used in the imputation. Changing this number will result in a different solution.
Method This option allows you to choose which imputation algorithm is used.
- Try mice The imputation will initially try to use the mice algorithm, and if this is not successful it will attempt to use the hotdeck algoithm.
- Hot Deck Force the imputation to only use the hotdeck algoritm.
- Mice Force the imputation to only use the mice algoritm.
Technical details
By default, data is imputed using the default settings from the mice R package, which employs Multivariate Imputation by Chained Equations (predictive mean matching) [1]. Care should be taken to ensure that variables have the correct variable type, as this has a big impact on this algorithm. Where a technical error is experienced using mice, the imputation is performed using hot-decking, via the hot.deck package in R.[2]
When applied with regression, missing values in the outcome variable are excluded from the analysis after the imputation has been performed.[3]
Note that although imputation can reduce the bias of parameter estimates, it can create misleading statistical inference (e.g., as the simulated sample size is assumed to be the actual sample size in calculations).
The new Questions are imputed jointly. This means that if you make changes to one of them then the others will also change.
There are some technical limitations with regards to how you can change the new variables:
- You cannot add or remove variables from the Variables drop-box.
- You cannot change the order of variables in the Variables drop-box.
- If you wish to delete any of the imputed variables you must delete them all together because they are linked.
How to apply this QScript
- Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
- Click on the QScript when it appears in the QScripts and Rules section of the search results.
OR
- Select Automate > Browse Online Library.
- Select this QScript from the list.
Customizing the QScript
This QScript is written in JavaScript and can be customized by copying and modifying the JavaScript.
Customizing QScripts in Q4.11 and more recent versions
- Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
- Hover your mouse over the QScript when it appears in the QScripts and Rules section of the search results.
- Press Edit a Copy (bottom-left corner of the preview).
- Modify the JavaScript (see QScripts for more detail on this).
- Either:
- Run the QScript, by pressing the blue triangle button.
- Save the QScript and run it at a later time, using Automate > Run QScript (Macro) from File.
Customizing QScripts in older versions
JavaScript
// For insertAtHoverButtonIfShown, preventDuplicateQuestionName, inDisplayr,
// correctTerminology, preventDuplicateVariableName
includeWeb("QScript Utility Functions");
includeWeb("QScript Selection Functions"); // getAllUserSelections
includeWeb("QScript R Variable Creation Functions") // robustNewRQuestion
function ImputeSelections(){
let user_selections = getAllUserSelections();
let selected_questions = user_selections.selected_questions;
let selected_variables = user_selections.selected_variables;
if (selected_variables.length === 0) {
let data_location = (inDisplayr() ? "Data Sets on the left" :
"the Variables and Questions tab");
log("Please select the variables to impute from " + data_location + " and rerun.");
return false;
}
let data_file = selected_questions[0].dataFile;
if (selected_questions.some(v => v.dataFile.name !== data_file.name)) {
log("Sorry, all selected variables must be from the same data set.");
return false;
}
let non_text_obj = selected_variables.filter(v => v.variableType != "Text").filter(validNonTextVariable)
.map( function (v, ind) {
return { variable: v,
question: v.question,
type: v.question.questionType,
index: ind,
is_text: v.variableType == "Text" };
});
let text_obj = selected_variables.filter(v => v.variableType == "Text").filter(validTextVariable)
.map( function (v, ind) {
return { variable: v,
question: v.question,
type: v.question.questionType,
index: ind,
is_text: v.variableType == "Text" };
});
if (text_obj.length === 0 && non_text_obj.length === 0)
{
log("The selected variables contain entirely missing data. Imputation not performed.")
return false;
}
let new_non_text = [];
let new_text = [];
if (non_text_obj.length > 0) {
new_non_text = createImputedVariableSet(non_text_obj.map(o => o.variable)).variables;
if (!new_non_text)
return false;
non_text_obj.forEach(function (o, ind) {
o[["new_variable"]] = new_non_text[ind];
});
}
if (text_obj.length > 0) {
new_text = createImputedVariableSet(text_obj.map(o => o.variable)).variables;
if (!new_text)
return false;
text_obj.forEach(function (o, ind) {
o[["new_variable"]] = new_text[ind];
});
}
let all_obj = text_obj.concat(non_text_obj);
let selected_variable_names = selected_variables.map(v => v.name);
all_obj.sort(function (a, b) {
selected_variable_names.indexOf(a.variable.name) - selected_variable_names.indexOf(b.variable.name);
})
let new_questions = [];
selected_questions.forEach(function (q) {
let new_question;
let new_vars_for_question = all_obj.filter(function (obj) {
return obj.question.equals(q);
}).map(obj => obj.new_variable);
let df = q.dataFile;
new_question = df.setQuestion(preventDuplicateQuestionName(df, q.name + " - Imputed"),
q.questionType, new_vars_for_question);
df.moveAfter(new_question.variables, q.variables[q.variables.length-1]);
new_questions.push(new_question);
});
moveQuestionsToHoverButtonIfShown(new_questions)
return;
}
function validNonTextVariable(variable) {
let vattr = variable.valueAttributes;
let invalid = variable.rawValues.filter(x => !Number.isNaN(x) &&
!vattr.getIsMissingData(x)).length === 0;
if (invalid)
log("The variable '" + variable.label + "' contains entirely missing values and has been ignored.")
return !invalid;
}
// Check if text variable has entirely missing data.
// We allow variables with only one single-unique non-missing value, because
// sometimes hot deck is okay with this and we catch this error later
function validTextVariable(variable)
{
let unique_vals = variable.uniqueValues;
let invalid = unique_vals.length < 1 && !unique_vals[0];
if (invalid)
log("The variable '" + variable.label + "' contains entirely missing values and has been ignored.")
return !invalid;
}
function createImputedVariableSet(imputed_vars) {
let data_file = imputed_vars[0].question.dataFile;
let selected_variables = project.report.selectedVariables();
let imputed_names = imputed_vars.map(v => v.name);
let n_original_variables = imputed_vars.length;
let n_aux_variables = selected_variables.length - n_original_variables;
let aux_guids;
if (n_aux_variables)
aux_guids = selected_variables.filter(v => !imputed_names.includes(v.name)).map(v => v.guid).join(';');
let new_question_name = preventDuplicateQuestionName(data_file,
"TEMP" + " - Imputed");
let temp_var_name = preventDuplicateVariableName(data_file,
"tempVarXFBG361_adf");
let structure_name = correctTerminology("variable set");
let new_r_question;
let inputs = {formData: imputed_vars.map(v => v.guid).join(";")};
if (n_aux_variables)
inputs[["formAuxiliary"]] = aux_guids;
try {
let v_name_string = imputed_vars.map(v => v.name).join(",");
new_r_question = robustNewRQuestion(data_file, rCodeString(n_original_variables, v_name_string),
new_question_name, temp_var_name,
imputed_vars[imputed_vars.length - 1], jsCodeString(), inputs);
new_r_question.variables.forEach((v,i) => {
v.label = imputed_vars[i].label;
v.name = preventDuplicateVariableName(data_file, imputed_vars[i].name + "_imputed");
});
insertAtHoverButtonIfShown(new_r_question);
project.report.setSelectedRaw([new_r_question.variables[0]]);
}catch (e) {
let data_location = inDisplayr() ? "Data Editor" : "Data tab";
if (e.message.indexOf("supply both 'x' and 'y'") > -1)
log("Your variables appear to contain entirely missing data. " +
"Please check the supplied variable values in the " + data_location + ".");
else if (e.message.indexOf("invalid first argument") > -1 ||
e.message.indexOf("default method not implemented for type") > -1)
log("Sorry, imputation failed. This may occur if some of your input variables are inappropriate. " +
"For example, if they contain only one unique non-missing value. " +
"Please check the supplied variables in the " + data_location + ".");
else if (e.message.indexOf("Can only convert tabular results") > -1)
log("Sorry, we were unable create the imputed data set because it is too large. " +
" Please consider reducing the number of variables and contacting Support.");
else if (e.message.indexOf("cannot allocate vector of size") > -1)
log("Sorry, we are unable to perform imputation with such a large amount of input data. " +
" Please consider reducing the number of variables and contacting Support.");
else
log("Sorry, an error occurred while imputing the selected data. " + e);
return false;
}
return new_r_question;
}
function rCodeString(n_var, v_name_string) {
let structure_name = correctTerminology("variable set");
return `library(flipImputation)
N.VARIABLES <- ${ n_var }
if (N.VARIABLES != length(formData))
stop("Sorry, it is not possible to change the number of imputed variables in the existing ${structure_name}. ",
"Please rerun the feature with updated selections to add or remove variables." )
ORIGINAL.NAMES <- "${ v_name_string }"
dat <- QDataFrame(formData)
CURRENT.NAMES <- paste0(vapply(dat, FUN = function (xx) {return(attr(xx, "name"))} , FUN.VALUE = character(1)), collapse = ",")
if (CURRENT.NAMES != ORIGINAL.NAMES)
stop("Sorry, it is not possible to change the order of imputed variables. The best thing to do is to click the Undo button in the top left.")
if (length(formAuxiliary))
dat <- cbind(dat, QDataFrame(formAuxiliary))
imputed.data <- Imputation(dat, seed = formSeed, method = tolower(formMethod))[[1]][, 1:N.VARIABLES]
imputed.data`;
}
function jsCodeString() {
return `form.dropBox({name: 'formData', label: 'Variables', multi: true,
required: true, types: ['Variables'],
prompt: 'Supply variables to be imputed'});
form.dropBox({name: 'formAuxiliary', label: 'Auxiliary variables', multi: true,
required: false, types: ['Variables'],
prompt: 'Additional variables to use in modeling/prediction but not to be imputed'});
form.numericUpDown({name: 'formSeed', label: 'Seed', default_value: 12321,
minimum: 1, increment: 1, maximum: 10000000,
prompt: 'Seed to use for random number generation'});
form.comboBox({name: 'formMethod', label: 'Method',
alternatives: ['Try mice', 'Hot Deck', 'Mice'], default_value: 'Try mice'});`;
}
ImputeSelections();
See also
- QScript for more general information about QScripts.
- QScript Examples Library for other examples.
- Online JavaScript Libraries for the libraries of functions that can be used when writing QScripts.
- QScript Reference for information about how QScript can manipulate the different elements of a project.
- JavaScript for information about the JavaScript programming language.
- Table JavaScript and Plot JavaScript for tools for using JavaScript to modify the appearance of tables and charts.
- ↑ Stef van Buuren and Karin Groothuis-Oudshoorn (2011), "mice: Multivariate Imputation by Chained Equations in R", Journal of Statistical Software, 45:3, 1-67.
- ↑ Skyler J. Cranmer and Jeff Gill (2013). We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data. British Journal of Political Science, 43, pp 425-449.
- ↑ von Hippel, Paul T. 2007. "Regression With Missing Y's: An Improved Strategy for Analyzing Multiply Imputed Data."
Q Technical Reference
Q Technical Reference
Q Technical Reference > Setting Up Data > Creating New Variables
Q Technical Reference > Updating and Automation > Automation Online Library
Q Technical Reference > Updating and Automation > JavaScript > QScript > QScript Examples Library > QScript Online Library