An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
awchisholm wrote:HelloIf you input 1200 examples to the data to similarity operator you will get 1200*1199 pairs - 1.4 million rows - so you're probably getting memory issues. My suggestion is to use the similarity to data operator to turn the similarity result back into an example set and see if this displays more efficiently. If not, I would write the result to the repository, a database or a file and I would disconnect the result from the output so that it does not display at all.You can then read the result later and use the filter or sample operators to extract the bits you're interested in.regardsAndrew
Hi, I found this entry because I faced the same issue. It takes forever to get the output of cosine similiarity analysis out of 4100 documents. I followed some of the suggestions above and my flow is:
Read CSV--> Process documents from Data-->Data to similarity--> Similarity to Data--> Write Excel
After 24 hours it is still in the "Similarity to Data" process.
Any one has an idea how much time this will take? My PC characteristics are as follow:
Windows 10 entreprise Version 1607, 64 bit
Processor Intel Core i5-4310U
CPU 2,60 GHZ
RAM (8GB)
Thanks for any tip
Hello @roberto_r_herma - so process time varies a lot depending on many factors including your machine, the size and scope of the documents, etc... One thing that I can definitely tell you is that RapidMiner loves RAM and multiple core processors. FWIW, I just upgraded to 64GB of RAM with my 6-core Intel Xeon E5 to keep things humming along.
If I were you, I'd use the Sample operator and grab a small sample of your documents first. Benchmark the sample and then gently increase so you can get a sense if the full 4100 docs is going to take 2 days or 2 years.
Scott
Thanks for the tip!