Predicting Social Outcomes

Predicting social outcomes is hard, and TypeThePipe is showing you why.


Introduction

Outcome prediction is one of the main uses of machine learning and can be used to perform predictive models for real world events, such as diagnosing illnesses or predicting stock market trends. In this post, we develop predictive models for estimating six key life outcomes of children from the Fragile Families and Child Wellbeing Study, using both regression models for predicting continuous outcomes and classification models for predicting binary outcomes. We applied several feature imputation and selection techniques to improve our models’ predictive abilities and runtimes and additionally used techniques such as cross-validation to tune the hyperparamters used by the models.

We present our empirical results found from both predicting outcomes on the test data set, as well as from estimating outcomes from the training data set during cross-validation.


The problem

The Fragile Families and Child Wellbeing Study collected information for a mass inference and predictive study with the ultimate goal of gaining insights into improving these children’s lives~. The study gathered a large database of information from 4242 ``fragile’’ families with unmarried parents over the first 15 years of each child’s life in six waves of collection: at birth, age 1, age 3, age 5, age 9, and age 15. The Fragile Families Challenge then brought together 160 teams who built models to predict six key outcomes of the study found in the final wave at age 15: GPA, grit, material hardship, eviction, layoff, and job training. As it was concluded in Measuring the predictability of life outcomes with a scientific mass collaboration paper and remarked later predicting social outcomes based on collected features leads to a very high prediction error, with little, if any, improvement over a 4 variable model cherrypicked by a domain expert.


The approach

In this work, we undertake this challenge to train regression and classification models to effectively predict these six outcomes using 2121 of the families’ data as the training data set and the remaining half as the test data set. We use AdaBoost, random forest, gaussian process, and support vector machine models for predicting all outcomes, as well as elastic net regression for the continuous outcomes and logistic regression for the binary outcomes. We performed clever feature selection using techniques such as principal component analysis after filtering and imputing the data set. We tuned the hyperparameters of each model with five-fold cross-validation using the available training data set. Unfortunately this dataset is not public and I had access to it as PhD candidate at Princeton University. All the methods described here have corresponding Python code in our Colab Notebook that you can read and reuse for your problems. The methods are quite generic and can be reused for other datasets.


Imputation and Feature Selection

In this work, we use the Fragile Families data set consisting of 12,942 variables for 4242 families. There are binary, categorical and continuous values among these variables. The training data labels consist of six key outcomes from the 15th year of the child’s life for 2121 of these families: GPA, grit, material hardship, eviction, layoff, and job training. Testing is performed by predicting the six key outcome values for the remaining 2121 unlabeled families.

Several values in the data set are missing so to make full use of the data, we first needed to fill in these missing values. We chose to reduce the variable set by filtering out all variables with identical values for all the families, as well as all variables missing more than 50% of families’ data. Additionally, we chose to treat negative data values indicating specific reasoning for missing data as the same as the other missing data values. However, we hypothesize that these negative values could be used with clever techniques to further improve outcome prediction. Due to time constraints, we leave this exploration for future work. Next, to fill in the remaining missing values in the remaining data set, we used two imputation methods. For variables with continuous data values, we used the mean of the values to fill in the missing entries. For variables with categorical data values, we used the mode of the values to impute the missing entries.


Feature Selection

We additionally experimented with different ways of defining these two types of data values. In particular, we experimented with different thresholds for defining categorical variables, (e.g., 5 or 10 different data values). We used one-hot encoding to represent the categorical variables and assigned the remaining as continuous variables. For continuous variables, we used standard scaling to normalize the values by subtracting the mean and dividing by the standard deviation.

Due to the large set of variables for a relatively small number of families, feature selection is particularly important for using machine learning techniques effectively and efficiently in this setting. Based on prior works on the Fragile Families data set, we chose to try extracting the most relevant features for each outcome using principal component analysis (PCA). We attempted to use recursive feature elimination (RFE) as well, but found runtimes to be intractable so we instead experimented with selecting the K-Best features using the Python scikit-learn library’s \(\\tt{f\\\_classif}\) metric, which analyzes the variance of the features. Additionally, we performed prediction of each outcome with the full data set to compare results with and without feature selection. To exhaustively test feature selection, we support this on a per-outcome basis, allowing different schemes for different outcome features, e.g. PCA for eviction and the full feature set for grit.


Regression and Classification Methods

Of the six outcomes we predict from the Fragile Families study, three consist of continuous data values (GPA, grit, material hardship) while the other three are binary (eviction, layoff, job training). Thus, we use regression techniques for predicting the continuous outcomes and classification techniques for predicting the binary outcomes. In most cases, we use the regression and classification variants of the same technique. For all methods, we use the implementations available in the Python scikit-learn library.

AdaBoost: Prior work on the Fragile Families project and another work on predicting the mortality of ICU patients as a function of time found AdaBoost to perform well for outcome prediction, motivating our use of this algorithm in our study. AdaBoost is an ensemble machine learning method that uses an ensemble of decision trees referred to as decision stumps. Each decision stump splits decisions for a single feature based on a threshold value. The best-fitting stumps for each feature is chosen by sorting the samples according to the feature and trying thresholds between each adjacent sample pairing. This algorithm is useful for both regression and classification as it can effectively capture non-linear relationships in real world data.

Random Forest: Random forest is another ensemble learning method that was explored in prior Fragile Families studies. This algorithm is composed of \(\\tt{n}\) decision trees with nodes that split decisions and leaves with prediction values. Each decision tree is fitted to a subset of training data and features. This resolves the overfitting problem that standalone decision trees face. In regression settings, the leaves average the values of training samples that land on them. The estimated values provided by each decision tree are then averaged by the random forest to predict a value. We use this method for classification and regression as both have been shown to be effective.

Elastic Net: One of the best performing prediction methods found by prior Fragile Families work was the elastic net algorithm. Elastic net is a regression algorithm that builds upon ridge and lasso regression. Ridge regression seeks to prevent overfitting of training data by introducing a penalty to decrease variance in predictions. This is effective when most features are useful. Lasso regression assigns a similar penalty but goes an extra step by removing useless features and thus reducing variance further. Elastic net combines the penalties used in ridge and lasso regression resulting in better handling of the correlation between features.

Gaussian Process: Caywood et al. found Gaussian Process regression to be useful and easy to analyze for predicting mental workload based on EEG data. This work suggests complex data set inputs can be effectively processed for predictive purposes, motivating our experimental use of this process. Roberts et al. additionally found Gaussian Processes to be useful in the domain of time-series analysis, suggesting this technique may work well for analyzing the Fragile Families data set which was collected over time. Gaussian Processes offer non-parametric regression and classification. They generalize the Gaussian probability distribution, defined by a choice of covariance function which states how target values may change with changing inputs. As few assumptions are made about the estimator function shape beyond this choice of covariance function, this process works well for high-dimensional feature inputs such as the Fragile Families data set. We experiment with both regression and classification using Gaussian Processes.

Support Vector Machine: Support vector machines (SVMs) expand on linear regression by estimating a regression hyperplane in a high-dimensional feature space. While ideally the hyperplane would split data perfectly, this is often not feasible with real world data. Instead, risk minimization can be used to find an appropriate estimate for the regression hyperplane. SVMs natively perform binary classification but they also work well for regression problems, such as for predicting stock trends. We thus test SVM for both continuous and binary outcome prediction.

Logistic Regression: Logistic regression is a well-known binary classification model. This is a simple technique that maps probabilities to a sigmoid function. Depending on a set threshold value, a sample is labeled as one of two classes. As this is such a common binary classification method, we expect it may perform well for predicting the three binary outcomes in the Fragile Families study.


Evaluation

We performed five-fold cross-validation with the training data set to determine the best performing hyperparameters for each of the regression and classification methods we use for prediction.

Table 1: Best prediction accuracy, model, and runtime for each outcome.
Regression MethodGPAGritMat. Hard.
AdaBoost0.59660.76310.9732
Random Forest0.60350.76310.9775
Elastic Net0.62080.76980.9798
Gaussian Process0.55750.76470.9758
Support Vector Machine0.62450.77360.9793
Table 2: Best prediction accuracy, model, and runtime for each outcome.
AccPrecRecAccPrecRecAccPrecRec
AdaBoost0.93630.10710.04900.74540.27740.14810.66460.16860.1093
Random Forest0.9404000.7909000.765200
Logistic Regression0.9404000.7909000.765200
Gaussian Process0.9404000.79010.51670.01470.75430.32780.0462
Support Vector Machine0.92320.18630.07940.68990.25560.25980.64270.27530.3219
Table 3: Best prediction accuracy, model, and runtime for each outcome.
OutcomeLossModelTrain+Pred. Time (s)
GPA0.37504Elastic Net5.596
Grit0.21421Elastic Net9.238
Material Hardship0.02565Elastic Net6.555
Eviction0.05660Logistic Reg.3.504
Layoff0.22453Logistic Reg.4.181
Job Training0.27736Logistic Reg.5.957


Results

Table 1 presents the average accuracy (as the opposite of the Mean Squared Loss) for the studied regressors across five-fold cross-validation, performed strategically by splitting the training data into train and test subsets for estimating prediction results. We used Gridsearch to study a range of hyperparameters and obtain the best performing ones, for each of the variations of the datasets (across imputation parameters). Similarly, Table 2 presents the average accuracy for the studied classifiers, as well as the precision and recall for class 1. Note that due to data set imbalance on class 0 and 1 for binary outcomes, classifiers like logistic regression classify all samples as 0, obtaining a null recall for class 1. These results are from our final imputation scheme, where we pruned 9503 features with more than half the data missing and 2433 features with constant values in all samples. Across the remaining 3523 features, 250 were considered categorical with less than 10 different values, and one-hot encoded. This encoding increased the feature count to 11355, and we empirically found (as expected) that reducing to 1000 features with PCA performs better.

Table 3 presents the best accuracy values in the leaderboard, their model, and the runtime of training and prediction together. Besides each outcome type using the same model, we allow each outcome to have a different model. During cross-validation with the training set, we obtained the best performing models for each outcome and used them for prediction. All continuous outcomes happened to perform best with the same model. The binary outcomes performed the same with random forest and logistic regression, but we selected logistic regression due to its faster runtime.


Discussion and Conclusion

This work compared 5 classifiers and 5 regressors to predict the six outcomes of interest in the study. The best performing classifier and regressor achieved a higher accuracy than the baseline model used in the study and did so with competitive runtimes (<10s for training and prediction). Nevertheless, most outcomes are predicted with high loss (GPA, grit, layoff, job training) and binary prediction achieves higher accuracy by always predicting zero, similar to the best results found in the study. We thus come to the same conclusion that life outcome prediction is difficult and standard machine learning is insufficient for outcome prediction of everyone’s lives, even in the face of a large data base of variables. I hope this may have brought some light into predicting social outcomes. Remember that the code for all the methods contained in this post you can find it at Colab here


Acknowledgements

This work was done as part of the Graduate-level course COS524 at Princeton University in collaboration with Naorin Hossain. We had access to the dataset thanks to Princeton Fragile Families but it is not publicly available. The conclusions exposed are are only ours and do not necessarily represent Princeton views or thoughts.

Marcelo Orenes
Marcelo Orenes
CEO at UbiWork.co

Marcelo Orenes is a graduate student in Computer Science at Princeton University and CEO at UbiWork