"Apriori Project based on a medical database"
IngoRM
New Altair Community Member
Original messages postet on SourceForge forum at http://sourceforge.net/forum/forum.php?thread_id=2032138&;forum_id=390413
Hi!
We are a couple of portuguese students and we are working in a Project based in Association Rules and we have to do the implementation in Rapid Miner using the Apriori algorithm.
We've got some doubts and we're seeking some answers.
Our work is about association rules between procedures and diagnosis codes of an cirurgical intervention in medicine. Our database is from a local hospital.
The database is constituted by 3 columns, the first one is the ID of the pacient (NUM), the second column corresponds to the procedure code (COD_P) and the third one is the diagnosis code (COD_D).
We'll get all the database knowledge of cirurgical interventions made during 2007. Each intervention has a diagnosis code and another code that is related to the procedure (example: the patient with id number 36699 we've got the procedure code 8050 and 72290 as the diagnose code. These codes are attributed for all the interventions by a codifier doctor using a ICD-9 book.
As it was expected some patients were operated more than once in the past year. For that reason we have to “aggregate” all procedures and diagnosis codes for this patients in the next manner:
NUM COD_P COD_D
40876 8605 82320
40876 8026 717
40876 4576 1533
In this example we have for the patient with the id=40876 the following itemset {8605; 8026; 82320; 717; 4576; 1533}.
And we have to create the frequent itemset and then the wished association rules.
We pretend to join the diagnosis and procedure in a market basket type to do the analysis but we don’t know how to do that in Rapid Miner. We need to find some operators to do the prep-procedure in this database.
Can you help us to find a way to do this work in Rapid Miner?
Thanks for the atention.
Answer by Tomas:
Hi Ana,
I am not quite sure what you are trying to do, but I will try to illustrate the correct data format.
As a first step, you need to get the data in a nominal form, what means that all your items are from the known set of items. For example, you nominal data should be:
T1: bread, milk, butter
T2: bread, milk, chocolate
T3: bread, wine
RapidMiner algorithms require, that the data is in the binominal format, I think this is best explained with an example:
Transactiom | bread, milk, butter, chocolate, wine
T1 | 1 1 1 0 0
T2 | 1 1 0 1 0
T3 | 1 0 0 0 1
For the preprocessing part - I do not know RapidMiner preprocessing operators, but it is quite possible that the correct operator is right there, however, when faced with preprocessing tasks, I usually use external tools.
When mining association rules, I get my data from database using DatabaseExampleSource. I create a database view where I perform the preprocessing and then select the preprocessed values. I advise you to do so too, especially when you are transferring a lot of data over network.
Regards,
Tomas
Answer by Ana:
Hello Tomas.
First of all thanks for the answer.
Unfortunately i cannot continue this work until i discover what is happening with the pre-processing part.
The Nominal2Binomial Operator hasn't been enought. My data is in the right format now but i need to find a way to “aggregate” all procedures and diagnosis codes for the patients with the same ID num.
Is there any operator that do this?
This part is not beeing done and therefore the itemsets are not available.
Thanks again, Ana Coelho
Answer by Tobias:
Hello Ana,
Tomas was generally right in his first post explaining what to do to obtain association rules. As far as I can conclude from your second post you already figured out how to do this in RapidMiner.
Unfortunately, there is no operator in RapidMiner yet which implements the aggregation of multiple examples/transactions to exactly one and thus reduces the unit of analysis from diagnosis/treatment combinations to a patient as to speak in the terminology of your application. We probably will implement such an operator in the future as we need it for ourselves, but I can not say yet when this will be the case.
Regards,
Tobias
Answer by Tomas:
Hi Ana,
you need to create a process, consisting of these operators:
1. some example loader, that will load your dataset
2. The implementation in RM requires your data to be in a binominal format and from what I read, your data is not, so you need to apply a Nominal2Binominal Operator.
3. FpGrowth
4. AssociationRulesGenerator
I am too lazy to open RM, so I hope I got the operator names right.
One final warning: you won't get anything useful from the current version of RM. Because of the bug in the AssociationRulesGenerator, you will get completely wrong rules. The frequent itemsets should be fine, so if you are happy with just frequent itemsets try it, or try the latest beta (although the beta1 produces even more "buggy" rules - but it's just beta and this is quite expected).
The above process does not use Apriori algorithm, if you really need to use Apriori, try W-Apriori, although it is a weka operator and I never succeeded with this one.
Regards,
Tomas
Answer by Sebastian:
Hi all,
first one question to Tomas about the bug in the AssociationRulesGenerator you were talking about:
I know that there is one bug when the first row of the binominal input data from a database contains true and false values. Then FPGrowth makes mistakes in calculating the frequent itemsets because the true and false values are mixed up. This is due to the fact that you cannot declare an aml-file when you use the operator DatabaseExampleSource. I circumvented this problem by inserting one row where all values are set to false at the beginning of my table in the database. Then the generated association rules using the operator AssociationRulesGenerator are correct using version 4.1beta and the current release 4.1 (I have not tested 4.1beta2). So in my opinion there is just a little problem with FPGrowth in combination with DatabaseExampleSource but AssociationRulesGenerator works fine. My question is: Is there another bug I didn't recognize or are you talking about a specific version of RM I haven't used?
@Ana: I also do it like Tomas already explained. I create one table in the database which contains the data in the correct format for the operator which calculates the frequent itemsets. To achieve this, I use Java with JDBC because I do not use the RM-GUI but integrate the RM-operators in my own application, but there are many other possibilities, of course. Then I use DatabaseExampleSource -> FPGrowth -> AssociationRulesGenerator. FPGrowth is in most cases much faster than Apriori but it accepts only binominal data, so the data preparation can be a little more complex.
Cheers
Sebastian
Answer by Tomas:
Hi Sebastian,
the bug I was talking about is present in 4.0 version. It is a bug in AssociationRulesGenerator, which calculates completely wrong values for confidence. As a result, you end with some rules -- but their confidence is wrong, in fact, the good rules, which have high confidence might be silently filtered away (because their calculated confidence might be low) and the bad rules might be considered good, because their calculated confidence might be high enough to sneak past the minconf constraint.
The bug was already filed and the issue was fixed in 4.1beta1, although when I tried 4.1beta1 on my data, it produced thousands of rules even with minsup=1, minconf=1. I did't investigate the issue further, because at that time, I already had my own MS-Apriori implementation.
I wouldn't go back to Apriori, nor Fp-Growth now, the single minsup value is far too limited.
Cheers,
Tomas
Answer by Ingo:
Hi Tomas,
> it produced thousands of rules even with minsup=1, minconf=1. I did't investigate the issue further, because at
> that time, I already had my own MS-Apriori implementation.
Please be aware that we introduced a new parameter "min_number_of_itemsets" (or something similar) which reduced the support as long until at least the minimum number of item sets was found.
Cheers,
Ingo
Hi!
We are a couple of portuguese students and we are working in a Project based in Association Rules and we have to do the implementation in Rapid Miner using the Apriori algorithm.
We've got some doubts and we're seeking some answers.
Our work is about association rules between procedures and diagnosis codes of an cirurgical intervention in medicine. Our database is from a local hospital.
The database is constituted by 3 columns, the first one is the ID of the pacient (NUM), the second column corresponds to the procedure code (COD_P) and the third one is the diagnosis code (COD_D).
We'll get all the database knowledge of cirurgical interventions made during 2007. Each intervention has a diagnosis code and another code that is related to the procedure (example: the patient with id number 36699 we've got the procedure code 8050 and 72290 as the diagnose code. These codes are attributed for all the interventions by a codifier doctor using a ICD-9 book.
As it was expected some patients were operated more than once in the past year. For that reason we have to “aggregate” all procedures and diagnosis codes for this patients in the next manner:
NUM COD_P COD_D
40876 8605 82320
40876 8026 717
40876 4576 1533
In this example we have for the patient with the id=40876 the following itemset {8605; 8026; 82320; 717; 4576; 1533}.
And we have to create the frequent itemset and then the wished association rules.
We pretend to join the diagnosis and procedure in a market basket type to do the analysis but we don’t know how to do that in Rapid Miner. We need to find some operators to do the prep-procedure in this database.
Can you help us to find a way to do this work in Rapid Miner?
Thanks for the atention.
Answer by Tomas:
Hi Ana,
I am not quite sure what you are trying to do, but I will try to illustrate the correct data format.
As a first step, you need to get the data in a nominal form, what means that all your items are from the known set of items. For example, you nominal data should be:
T1: bread, milk, butter
T2: bread, milk, chocolate
T3: bread, wine
RapidMiner algorithms require, that the data is in the binominal format, I think this is best explained with an example:
Transactiom | bread, milk, butter, chocolate, wine
T1 | 1 1 1 0 0
T2 | 1 1 0 1 0
T3 | 1 0 0 0 1
For the preprocessing part - I do not know RapidMiner preprocessing operators, but it is quite possible that the correct operator is right there, however, when faced with preprocessing tasks, I usually use external tools.
When mining association rules, I get my data from database using DatabaseExampleSource. I create a database view where I perform the preprocessing and then select the preprocessed values. I advise you to do so too, especially when you are transferring a lot of data over network.
Regards,
Tomas
Answer by Ana:
Hello Tomas.
First of all thanks for the answer.
Unfortunately i cannot continue this work until i discover what is happening with the pre-processing part.
The Nominal2Binomial Operator hasn't been enought. My data is in the right format now but i need to find a way to “aggregate” all procedures and diagnosis codes for the patients with the same ID num.
Is there any operator that do this?
This part is not beeing done and therefore the itemsets are not available.
Thanks again, Ana Coelho
Answer by Tobias:
Hello Ana,
Tomas was generally right in his first post explaining what to do to obtain association rules. As far as I can conclude from your second post you already figured out how to do this in RapidMiner.
Unfortunately, there is no operator in RapidMiner yet which implements the aggregation of multiple examples/transactions to exactly one and thus reduces the unit of analysis from diagnosis/treatment combinations to a patient as to speak in the terminology of your application. We probably will implement such an operator in the future as we need it for ourselves, but I can not say yet when this will be the case.
Regards,
Tobias
Answer by Tomas:
Hi Ana,
you need to create a process, consisting of these operators:
1. some example loader, that will load your dataset
2. The implementation in RM requires your data to be in a binominal format and from what I read, your data is not, so you need to apply a Nominal2Binominal Operator.
3. FpGrowth
4. AssociationRulesGenerator
I am too lazy to open RM, so I hope I got the operator names right.
One final warning: you won't get anything useful from the current version of RM. Because of the bug in the AssociationRulesGenerator, you will get completely wrong rules. The frequent itemsets should be fine, so if you are happy with just frequent itemsets try it, or try the latest beta (although the beta1 produces even more "buggy" rules - but it's just beta and this is quite expected).
The above process does not use Apriori algorithm, if you really need to use Apriori, try W-Apriori, although it is a weka operator and I never succeeded with this one.
Regards,
Tomas
Answer by Sebastian:
Hi all,
first one question to Tomas about the bug in the AssociationRulesGenerator you were talking about:
I know that there is one bug when the first row of the binominal input data from a database contains true and false values. Then FPGrowth makes mistakes in calculating the frequent itemsets because the true and false values are mixed up. This is due to the fact that you cannot declare an aml-file when you use the operator DatabaseExampleSource. I circumvented this problem by inserting one row where all values are set to false at the beginning of my table in the database. Then the generated association rules using the operator AssociationRulesGenerator are correct using version 4.1beta and the current release 4.1 (I have not tested 4.1beta2). So in my opinion there is just a little problem with FPGrowth in combination with DatabaseExampleSource but AssociationRulesGenerator works fine. My question is: Is there another bug I didn't recognize or are you talking about a specific version of RM I haven't used?
@Ana: I also do it like Tomas already explained. I create one table in the database which contains the data in the correct format for the operator which calculates the frequent itemsets. To achieve this, I use Java with JDBC because I do not use the RM-GUI but integrate the RM-operators in my own application, but there are many other possibilities, of course. Then I use DatabaseExampleSource -> FPGrowth -> AssociationRulesGenerator. FPGrowth is in most cases much faster than Apriori but it accepts only binominal data, so the data preparation can be a little more complex.
Cheers
Sebastian
Answer by Tomas:
Hi Sebastian,
the bug I was talking about is present in 4.0 version. It is a bug in AssociationRulesGenerator, which calculates completely wrong values for confidence. As a result, you end with some rules -- but their confidence is wrong, in fact, the good rules, which have high confidence might be silently filtered away (because their calculated confidence might be low) and the bad rules might be considered good, because their calculated confidence might be high enough to sneak past the minconf constraint.
The bug was already filed and the issue was fixed in 4.1beta1, although when I tried 4.1beta1 on my data, it produced thousands of rules even with minsup=1, minconf=1. I did't investigate the issue further, because at that time, I already had my own MS-Apriori implementation.
I wouldn't go back to Apriori, nor Fp-Growth now, the single minsup value is far too limited.
Cheers,
Tomas
Answer by Ingo:
Hi Tomas,
> it produced thousands of rules even with minsup=1, minconf=1. I did't investigate the issue further, because at
> that time, I already had my own MS-Apriori implementation.
Please be aware that we introduced a new parameter "min_number_of_itemsets" (or something similar) which reduced the support as long until at least the minimum number of item sets was found.
Cheers,
Ingo
0