multidimensional text mining

anahochmanova
anahochmanova New Altair Community Member
edited November 5 in Community Q&A
Hi! I'm trying to find some way to find relationships in text files, I have 150 text files, but each of them is written using 3 different criteria so I have 450 text files in total. For each letter in each  word I have 3 different representations.
Also each of the 150 text files is labeled meaning that I have for each file a category (A, B).
I need to find relationships involving the words, like association rules and some way to build a model for classification like a decision tree.

Is there some way to try to take advantage of the 3 different representations? I could build a model for each of them but don't know how to represent the 3 of them simultaneously.
My texts are not in any language, they are not words, I have just sequences of letters and each letter can take 3 different values.

Sorry if this is confusing, please ask me to try to explain it better..

Best
Ana
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi Ana,

    indeed this is confusing. Is it important to have three different representations? Do all representations contain the same information, or do you get additional information from each representation? If the information is the same, you should focus on one single representation.

    Additionally, can you post a small example of how a text file looks, and which type of information you want to extract?

    Best regards,
    Marius
  • anahochmanova
    anahochmanova New Altair Community Member
    Yes, I have 150 participants doing activities during a period of about 5 to 6 months. There are  25 different activities.
    I thought about representing each activity using  a letter, the activities they do in a day are short sequences of letters (words) because they do up to 10 activities in one day and if they pass more than 3 days without activities then I start a new sentence.
    So I thought about representing all of the activities using text.

    Each activity is defined by 3 features: the name, the result and the function. The name as I've said is a letter (there are 25 different names), the result can also be represented as a letter (I have 20 different results) and there are 15 different functions.
    So each feature generates me a text for each of the participants and I  can't figure out how to represent all of them simoultaneously to extract association rules, the ultimate objective is to find relations involving the results, the functions and the activities for example in a decision tree.