Community & Support
Learn
Marketplace
Discussions
Categories
Discussions
General
Platform
Academic
Partner
Regional
User Groups
Documentation
Events
Altair Exchange
Share or Download Projects
Resources
News & Instructions
Programs
YouTube
Employee Resources
This tab can be seen by employees only. Please do not share these resources externally.
Groups
Join a User Group
Support
Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
"Text Mining - Clustering Task - DISCOVER THE CONTENT OF EACH CLUSTER"
Marcello_Sandi
Hi,
My problem is Unsupervised Learning, because, as I said, my BOW has exactly 2290 attributes and 1572 examples. It does't has any label, just descriptors extracted to the texts and one attribute that is the name of the documents, which I put as a label.
I need to find the optimal number of clusters first. I did that model to discover it. I didn't know that the RapidMiner KMeans already had an implementation to the local minimum problem.
Opening a parenthesis about it, what kind of algorithm/theory do you use in this case? I only need put some reference in my thesis and explain it.
So, I leave the "ParameterIteration" to run over about an interval of desired clusters, and exclud the "RandomOptimizer" because it's not necessary. Do you has another suggetion?
Finally, I want measure the quality of my clusters. Using "ParameterIteration" I can generate scatter plot over "ClusterCentroidEvaluator" and I can see the relations about AVG and DB distances over each cluster. Do you has any other choice?
The problem, in this case, is because there are a lot of attributes, ie, a lot of descriptors.
I want to label or characterize each cluster.
I would be very grateful and happy for any help.
Marcello Sandi
Find more posts tagged with
AI Studio
Clustering
Text Mining + NLP
Accepted answers
All comments
land
Hi Marcello,
your setup seems to be well suited for your case. For cluster characterization, usually an understandable classification model is used. For example use the one rule learner, or a tree with a small depth.
KMeans is restarted as often as specified and the solution with the minimal intra cluster distance is chose, if I remember correctly.
Greetings,
Sebastian
Marcello_Sandi
Hi Sebastian,
You are the man.....
If it's possible, please, setup me an example model. I'm not still able to do it alone.
About KMeans, just if you can talk .....the solution with the minimal intra cluster distance is chose....
You could tell me what is the solution? Only the name of the algorithm or the theory for me is good.
Thanks for all,
Marcello
land
Hi Marcello,
you simply have to change the cluster attribute's role into label and then use the learner. I think you will be able to set up this process on your own.
"Choosing the solution with the minimal average intra cluster distance" very well describes the algorithm. I don't think there's a special name for this three liner.
Greetings,
Sebastian
Marcello_Sandi
Sebastian,
I did this model. Is it good.?
<operator name="Processo de Optimização do Centroid do KMeans" class="Process" expanded="yes">
<description text="#ylt#p#ygt#This process shows how restarts can be performed in order to find the optimal clusteringindependent of the initialization. #ylt#/p#ygt#"/>
<parameter key="logverbosity" value="warning"/>
<operator name="Gerar Dados" class="OperatorChain" expanded="yes">
<operator name="Light SN txRelev" class="ExampleSource">
<parameter key="attributes" value="/home/msandi/workspace/modelos/light/sn_10_txRelev/light_sn_10_txRelev.aml"/>
</operator>
<operator name="Filtrando Cluster" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="label"/>
<parameter key="invert_filter" value="true"/>
<parameter key="apply_on_special" value="true"/>
</operator>
</operator>
<operator name="KMeans Distância Euclidiana" class="KMeans">
<parameter key="k" value="3"/>
</operator>
<operator name="Marcando Cluster id como Rótulo" class="ChangeAttributeRole">
<parameter key="name" value="cluster"/>
<parameter key="target_role" value="label"/>
</operator>
<operator name="XValidationParallel" class="XValidationParallel" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="create_complete_model" value="true"/>
<parameter key="number_of_threads" value="4"/>
<operator name="DecisionTreeParallel" class="DecisionTreeParallel">
<parameter key="criterion" value="gini_index"/>
<parameter key="number_of_threads" value="4"/>
</operator>
<operator name="Testando o Modelo" class="OperatorChain" expanded="yes">
<operator name="Aplicando o Modelo" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="Performance" class="Performance">
</operator>
</operator>
</operator>
</operator>
With this model I want describe the clusters. So it was generated a tree with just three levels and six attributes. I already changed the confidence parameters to high level and nothing changed.
I need more attributes to describe each cluster. I am not concerned with accuracy, in this case.
Please, could you give me another suggestion?
Thanks for all,
Marcello
land
Hi,
only to try different learning schemes. I'm sorry, but any further suggestions would need to take a look at the data. And this would be definitly beyond the scope of this forum.
Greetings,
Sebastian
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups