"text mining (classification

mksaad
mksaad New Altair Community Member
edited November 5 in Community Q&A
Hello all,

I read many tutorials about text mining (TM) including tutorials about TM using RM.

most of these tutorials uses support vector machine (SVM) and Naive-Bayes (NB) as classification methods. I conclude they are the best Algorithm for text classification.
do you recommend me to use these algorithm or there are other suitable algorithms for text classification. (I am looking for Algorithms that implemented in RM)
If SVM and NB are the best one, any references about that will be appreciated.


I also appreciate any recommendation of RM clustering algorithms for text.


Thanks in advance,
--
Motaz K. Saad

Answers

  • land
    land New Altair Community Member
    Hi,
    I would suggest any clustering algorithm supporting the Cosine Similarity. And as always KMeans is worth a try.

    Greetings,
      Sebastian
  • gunjanamit
    gunjanamit New Altair Community Member
    Motaz,

    Have you done anything on Text Classification?

    I need help there...
  • mksaad
    mksaad New Altair Community Member
    Hello,

    You can take a look at http://sites.google.com/site/motazsite/publications

    you can find there conclusions on Arabic text classification and conclusions text classification in general.


    Regards,
    Motaz
  • jforr
    jforr New Altair Community Member
    Is there a good algorithm to use when my documents can have multiple categories assigned to them?  An example might be resumes where some are Java developers, some are SQL developers, and some are both Java and SQL developers?
  • MariusHelf
    MariusHelf New Altair Community Member
    Hi, you can use Polynominal by Binominal Classification for this. This operator trains a model based on its inner process, where it tries to discriminate between each class and all other classes. During application the confidence for each class is calculated, and the one with the highest value is predicted. Please have a look at the attached process.

    Best, Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
        <process expanded="true" height="494" width="752">
          <operator activated="true" class="generate_data" compatibility="5.2.006" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="target_function" value="three ring clusters"/>
            <parameter key="number_of_attributes" value="2"/>
          </operator>
          <operator activated="true" class="polynomial_by_binomial_classification" compatibility="5.2.006" expanded="true" height="76" name="Polynominal by Binominal Classification" width="90" x="246" y="30">
            <process expanded="true" height="512" width="770">
              <operator activated="true" class="naive_bayes" compatibility="5.2.006" expanded="true" height="76" name="Naive Bayes" width="90" x="313" y="30"/>
              <connect from_port="training set" to_op="Naive Bayes" to_port="training set"/>
              <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.2.006" expanded="true" height="76" name="Apply Model" width="90" x="461" y="30">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Polynominal by Binominal Classification" to_port="training set"/>
          <connect from_op="Polynominal by Binominal Classification" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Polynominal by Binominal Classification" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_port="result 2"/>
          <connect from_op="Apply Model" from_port="model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • jforr
    jforr New Altair Community Member
    Thanks, I'll try that.