Select column with non-zero value
ElenaVet
New Altair Community Member
Hi everybody!
I've calculated TF-IDF with "Process document from data" and I found a matrix that have a word in every column and a body for every row and every cell of the matrix cointains TF-IDF's value. Now I filter by cluster, creates with k.means, and I want to see only columns with values non-zero. I firstly thought to do a sum of every column's value (with Aggregate) and take only those with sum greater than zero, but I also think that it's a mistake do the sum of TF-IDF and all the analysis would be distorted, so can you please tell me a solution to filter only columns with at least one value different from zero?
Thanks you so much!
I've calculated TF-IDF with "Process document from data" and I found a matrix that have a word in every column and a body for every row and every cell of the matrix cointains TF-IDF's value. Now I filter by cluster, creates with k.means, and I want to see only columns with values non-zero. I firstly thought to do a sum of every column's value (with Aggregate) and take only those with sum greater than zero, but I also think that it's a mistake do the sum of TF-IDF and all the analysis would be distorted, so can you please tell me a solution to filter only columns with at least one value different from zero?
Thanks you so much!
0
Answers
-
Have you tried looking at the cluster centroid output? This is essentially giving you the average value for each cluster for each attribute. You should be able to filter that more easily.
If you don't want to use that approach, you would need to loop over each cluster, do an Aggregation using the Max function and remove those attributes that have a max value of zero.1 -
Hi @Telcontar120
thank you for your answer! I found the cluster centroid output, as you suggested, but i don't really understand the value of every cell, can you explain me, please? I attach the screen of my results.0 -
Cluster centroids are showing the average value of the word vector metric (using whatever parameter metric you selected such as TF-IDF) for each cluster for each attribute. You can see, for instance, the cluster that has the highest value for the token "aapl" is cluster 12. You can use this to understand what attributes are most dominant for any particular cluster by sorting and filtering. You can also compute differences between clusters if you like.
I noticed you have a lot of clusters. This can sometimes make interpretation difficult, you should probably also think about whether you have a need for this many distinct clusters. Or you could try another approach beyond k-means such as LDA analysis.
0 -
Hi,too add one more thought: The operator Extract Cluster Centroid gives you that table as an example set to work with.Cheers,Martin0