"Association Rules hangs machine"

mimesis · April 2011

Hi,
I am working on a text analysis project of news articles grabbed from the internet. I have been successful in extracting the content, preprocessing (tokenizing, filtering, transforming case etc) and generating a binary term occurrence exampleset. I followed Neil M's video and used Process documents from files, Numerical to Binomial and FP Growth. Using breakpoints, all these stages appear to be fine and run quickly. When I attempt to create Association rules, the machine runs and seems to hang, regardless of the memory allocated when I start java (I use the command line and for instance, assigned as follows:

java -Xmx12000m -jar rapidminer.jar. Right now, the program seems locked out (still calculating the rules at 13 minutes) and one CPU is at 100%. My dataset only has five documents in it and the rest of the process ran in 3 seconds or less on an 8 core Mac Pro. I noticed that the memory usage would increment and max out, regardless of how much I allocated (I have tried various lower and upper limits and read on a Weka site that one should not set both -Xms and -Xmx at the same time).

Any ideas?

William

haddock · April 2011

Hi William,

Sadly it looks like the code is, shall we say, 'bold'....

// generating rule by splitting set in every two parts for head and body of rule 
			if (set.getItems().size() > 1) {
				PowerSet<Item> powerSet = new PowerSet<Item>(set.getItems());

Which would mean quite a wait for longer frequent item sets, as there are 2^N subsets of an N length frequent item set. So a 50 word itemset would need the space for just 1,125,899,906,842,624 sets. Bold indeed.

IngoRM · April 2011

Hi there,

that's sad but true. However, even if we would not store the rules (before throwing most of them away due to a too low confidence), also generating them one by one would take much too long. So the recommendation here is: Don't create rules for long frequent itemsets. For many applications the frequent itemsets are as interesting as the rules anyway. But if you need the rules, you should restrict the number of items by increasing the minimum support (always start very high, especially if the base number of items is high). There is also an option in FP-Growth to restrict the maximum number of items for the itemsets which should also used then.

Hope that helps,
Ingo

haddock · April 2011

Greets!

I've been messing around with this issue in my CUDA stuff, and used the 'Banker's sequence', which "generates a sequence of all subsets of a set of n elements in which the number of elements in each subset is monotonically increasing", which has some obvious advantages. Details for the deranged? Right here... http://applied-math.org/subset.pdf

IngoRM · April 2011

Hi again,

yip, that's definitely better, thanks for pointing it out. I wonder if this still needs too much time for item sets with 50 items - at least for non-CUDA implementations

Cheers,
Ingo

mimesis · April 2011

Hi all,

Thanks for your help. Ingo, your comment was particularly useful - I went back and examined the settings for support and also for max items in set and that produced far fewer itemsets and consequently, Associaton Rules worked. I'm really quite excited to have pushed past this first small hurdle.

William

"Association Rules hangs machine"

Answers

Categories