Scanning plain text files for source code
Good afternoon everyone,
I have the RapidMiner community edition installed. I wish to use this software in order to scan a set of plain text documents to see which of those documents contain source code (in the form of reserved words). I imagine this is something that I could do with a Support Vector Machine, but I am not sure how I would implement this in RapidMiner. Could anyone give me a point in the right direction? Thank you.
Answers
-
Sure, here's a conceptual approach to what you would need to do.
- First, since this is a supervised learning problem, you will need to create a "label" attribute (RapidMiner vocabulary for a dependent variable or outcome of interest)--in your case this is probably a binmominal categorical variable for "Contains Source Code" (meaning it is either yes or no).
- You will then need to have many labeled cases to train your model, which means you may need to score a few dozen examples of each outcome (yes/no) by hand so the model has something to work with.
- Then you will need to import your text and process them as documents, so you can return a word vector for each document. You might also want to look at things like tokenizing, stemming, filtering stopwords, and all the other typical text processing steps.
- Once you have your dataset built this way, you can run a machine learning model (like SVM) on the scored examples so it can learn patterns of text associated with source code.
- Assuming you build a successful model, you can then apply that model on your unscored data.
As I said, this is a fairly conceptual workflow but it should cover all the basics you need to tackle your problem.
0 -
Hi Brian,
Thank you very much for your post. I just wanted to clarify a few things about your answer. The documents I have are a mixture of source code and normal plain text. What I want to be able to do is automatically categorise those files which contain at least some source code. Presumably this will look for things like reserved words and so forth. Do I need to add any document tags beyond just adding a binomial label of contains source code/doesn't contain source code? Thank you.
0 -
As long as you present it with the properly labeled cases for training, the model should be able to figure out which words (tokens) are characteristic of source code and which ones are not. So you will need to do the text preprocessing I describe, but not anything else in terms of telling the model explicitly which tokens are associated with source code. If you did that, you would be using a deterministic approach (a series of rules) rather than a machine learning algorithm.
0