time series: predicting an aberrant year

Question

Hi, I am new to trying to use time series analysis and would appreciate a pointer (or several).  Here is my hypothetical use case:

1.  200 students
2.  10 years of performance data for those students
3.  10 variables each year for each student

So, the questions.

1.  Is there a way to use an unsupervised classifier to "cluster" student-years to identify aberrant performance.  (ie: the student has a really good or a really bad year)
2.  Assuming that one of the classifications amounts to a "bad year," are there techniques that would allow us to look at the time series data to predict if the coming year is going to be good or bad.

I appreciate any help or pointers you can give.  I think right now I am trying to wrap my head around working with time series data, I am pretty comfortable modeling data and conditions that are static.  Thanks, Chris

wessel · Answer

I'm confused by your preference for unsupervised learning techniques.
You state you wish to make a prediction for aberrant. 
So your data has labels right? 
Student 1 to N: aberrant
Students N to M: non-aberrant

Note that, if you do not have labels, you are unlikely to need any advanced machine learning techniques.
Maybe you can get some information out of clustering or association rules, but you are probably better off spending time creating nice plots and using your own eyes for pattern recognition.

Furthermore, there is nothing "non-elegant" about having lots of columns in your data set (i.e. 7x10 = 70 columns per row to encode grades).

My suggestion would be:

1.
Start off simple, encode all predictive variables simply as separate columns.
Run as many learners as you can, try at least: nearest neighbors, trees, random forest, Bayesian, support vector machines, logistic boosting.

2.
If 1. doesn't work, try stacking. In other words group several columns that you think are related together, and encode them in such a way to maximally utilize the capabilities of a single learner.
Do this for several column / learner combinations (these are your level-0 predictors). Then feed the output of your level-0 learners to a level-1learner to combine predictions.
It might be possible that you feed some columns directly to the level-1 learner. For example, maybe you feed some demographic variable directly. Combine several grade variables can be really simple (e.g. "trend in last 3 years", "average over last 5 years", "deviation in last 7 years", "distance to some cluster (outlier rank)", etc.). A more complex (almost black box) way to combine variables would be to use a neural network.

3.
If 1 and 2 both don't work, try graphical modelling so you have full control over the distribution at each node.
Use copula functions to create modularity in your network so you can test separate parts of your network independently.
You can use Bayesian Networks to do some limited form of graphical modelling in Rapid Miner, but you might want to tools from the R plugin instead

Best regards,

Wessel

cgkolar · Answer

Thanks for the followups, it gives me some things to look at.What are the variables you will be measuring?
How many grades do students get in 1 year?

Let's say you have 1 grade per year, then you'd have a data set consisting of 200 rows:
student_id, grade1, grad2, grade3, grade4, ..., grade10

In this setup, applying a standard clustering algorithm should be able to find students that have abnormal grades?

The variables are a combination of grades as well as some demographic variables such as placement and aptitude test scores.  The number of grades will likely be around 7.  I imagined also coding the year (1 2 3 or 4 for freshman sophomore junior or senior, here in the US).  So looking at several years of data the number of rows would be closer to, say, 2000 if I was looking at 10 years.

student_id, gender, school_year(1-4), aptitude test, grade1, grade2 ....

That is what I inelegantly tried to describe as student-year data.  In normal statistics world I think the closest method would be Latent Class Transition Analysis, where there is a latent class (cluster) and individuals have measures at multiple time points and you are looking for individuals that transition from one class to another during the time observed.

I think that what I could do is take my large dataset, perform the cluster analysis, store the clusters, find the student IDs that go from a nominal cluster to a "bad" bluster, isolate the badyear-1 cases, and then see if I can build a classifier that can tell me when student performance is characteristic of the years preceding a downturn.

However, as I am new to time series analysis, I feel like doing that would strip out some of the slope/acceleration data that is in there from having taken repeated measures.  Thanks for all of the guidance as I venture into new methods.  Chris

wessel · Answer

What are the variables you will be measuring?
How many grades do students get in 1 year?

Let's say you have 1 grade per year, then you'd have a data set consisting of 200 rows:
student_id, grade1, grad2, grade3, grade4, ..., grade10

In this setup, applying a standard clustering algorithm should be able to find students that have abnormal grades?

Best regards,

Wessel