Dear all,
I am working with a dataset, that contains more than 8456rows, 26 columns. this data is about projects that are taken place in Europe, each row is a project.
these are the columns:
Office |
Office Country |
Competence |
Executive competence |
Classification |
Enquiry date |
Creation date |
Confirmation date |
Proposal Date |
Final invoice sent date |
Intermediary |
Customer ID |
Customer |
Event |
Group name |
Reference code |
Start date |
End date |
Project manager |
Main contact |
Via sales contact |
Project location |
Project country |
Heard About Us |
Source Market |
Client Kind |
Client Sector |
Region |
Market |
Lead Sent to |
Event Frequency |
Pipeline Future Projects |
Initial Pax |
Estimated turnover |
Estimated costs |
Estimated profit % |
Status |
Pax |
Net turnover |
Net costs |
Gross profit |
Gross profit % |
Net profit |
Net profit % |
Agency commissions |
Supplier commissions |
Cancellation/Rejection reason |
Cancellation date |
Remarks |
Controlled |
Financial Regime |
Currency |
Exchange Rate |
Payment status % |
Required(Net) |
Required |
Invoiced |
To invoice |
Receipt |
To pay |
Custom invoices |
Balance carried forward |
Comments to low margin |
Debits |
Assets |
Balance |
TO Inv. |
TO Acc. |
TO Total |
Cost Eff. |
Cost Man. |
Cost Acc. |
Cost Total |
for privacy policy I cannot expose the data itself, so I created an imaginary data just for illustration:
Office |
Office Country |
Competence |
Executive competence |
Classification |
Enquiry date |
Creation date |
Confirmation date |
Proposal Date |
Final invoice sent date |
Intermediary |
Customer ID |
Customer |
Event |
Reference code |
Start date |
End date |
Project manager |
Project location |
Project country |
Heard About Us |
Source Market |
Client Kind |
Client Sector |
Region |
Initial Pax |
Estimated turnover |
Estimated costs |
Estimated profit % |
Status |
Pax |
Net turnover |
Net costs |
Gross profit |
Gross profit % |
Net profit |
Net profit % |
Agency commissions |
Supplier commissions |
Cancellation/Rejection reason |
Cancellation date |
Remarks |
Controlled |
Financial Regime |
Currency |
Exchange Rate |
Payment status % |
Required(Net) |
Required |
Invoiced |
To invoice |
Receipt |
To pay |
Custom invoices |
Balance carried forward |
Debits |
Assets |
Balance |
TO Inv. |
TO Acc. |
TO Total |
Cost Eff. |
Cost Man. |
Cost Acc. |
Cost Total |
Saint Louis |
Senegal |
BL |
Saint Louis |
Unknown |
22.02.2016 |
08.04.2016 |
08.04.2016 |
23.02.2016 |
08.04.2016 |
|
11896 |
Customer2 |
zina 2016 |
code e1 2 |
15.04.2016 |
16.04.2016 |
Maya |
Saint Louis 1 hall |
Senegal |
|
BL |
Agency |
Other |
|
35 |
0 |
0 |
0 |
Completed |
35 |
1.950 |
1.486 |
463 |
24 |
122 |
6 |
0 |
0 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
1.950 |
2.321 |
2.321 |
0 |
2.321 |
0 |
0 |
0 |
0 |
0 |
0 |
1.950 |
0 |
1.950 |
0 |
0 |
1.487 |
1.487 |
Saint Louis |
Senegal |
BL |
Saint Louis |
Other |
08.06.2016 |
08.07.2016 |
08.07.2016 |
14.06.2016 |
25.07.2016 |
|
43 |
Customer3 |
|
code e1 3 |
07.07.2016 |
07.07.2016 |
Maya |
Saint Louis |
Senegal |
|
BL |
Agency |
Other |
|
0 |
200 |
0 |
100 |
Completed |
0 |
297 |
9 |
288 |
97 |
236 |
79 |
0 |
0 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
297 |
354 |
354 |
0 |
354 |
0 |
0 |
0 |
0 |
0 |
0 |
297 |
0 |
297 |
0 |
0 |
9 |
9 |
Saint Louis |
Senegal |
BL |
Saint Louis |
Embassy |
19.05.2016 |
20.05.2016 |
04.08.2016 |
04.08.2016 |
04.08.2016 |
|
1978 |
Customer4 |
leab 2016 |
code e1 4 |
11.09.2016 |
16.09.2016 |
Laura |
Saint Louis |
Senegal |
|
BL |
Agency |
|
|
32 |
12.000 |
0 |
100 |
Completed |
32 |
9.614 |
7.416 |
2.197 |
23 |
515 |
5 |
0 |
0 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
9.614 |
11.441 |
11.441 |
0 |
11.441 |
0 |
0 |
0 |
0 |
0 |
0 |
9.614 |
0 |
9.614 |
0 |
0 |
7.417 |
7.417 |
Saint Louis |
Senegal |
BL |
Saint Louis |
Embassy |
20.05.2016 |
21.05.2016 |
28.06.2016 |
28.06.2016 |
04.08.2016 |
|
1978 |
Customer5 |
leab 2016 |
code e1 5 |
12.09.2016 |
16.09.2016 |
Laura |
Saint Louis |
Senegal |
|
BL |
Agency |
|
|
12 |
4.500 |
0 |
100 |
Completed |
12 |
4.550 |
3.526 |
1.024 |
22 |
227 |
5 |
0 |
0 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
4.550 |
5.415 |
5.415 |
0 |
5.415 |
0 |
0 |
0 |
0 |
0 |
0 |
4.550 |
0 |
4.550 |
0 |
0 |
3.526 |
3.526 |
Saint Louis |
Senegal |
BL |
Saint Louis |
Unknown |
21.03.2016 |
01.04.2016 |
15.06.2016 |
01.04.2016 |
28.11.2016 |
|
807 |
Customer6 |
festival 2016 |
code e1 6 |
23.09.2016 |
25.09.2016 |
Martin |
Saint Louis |
Senegal |
|
BL |
Agency |
|
|
20 |
18.000 |
0 |
100 |
Completed |
20 |
11.276 |
9.676 |
2.104 |
19 |
130 |
1 |
0 |
503 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
11.277 |
12.815 |
12.815 |
0 |
12.815 |
0 |
0 |
0 |
0 |
0 |
0 |
11.277 |
0 |
11.277 |
0 |
0 |
9.676 |
9.676 |
Saint Louis |
Senegal |
BL |
Saint Louis |
Unknown |
28.06.2016 |
29.06.2016 |
10.08.2016 |
10.08.2016 |
14.09.2016 |
|
43 |
Customer7 |
|
code e1 7 |
04.10.2016 |
05.10.2016 |
Laura |
Saint Louis |
Senegal |
|
BL |
Agency |
Other |
|
30 |
6.000 |
0 |
100 |
Completed |
30 |
4.789 |
3.778 |
1.011 |
21 |
173 |
4 |
0 |
0 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
4.790 |
5.700 |
5.700 |
0 |
5.700 |
0 |
0 |
0 |
0 |
0 |
0 |
4.790 |
0 |
4.790 |
0 |
0 |
3.779 |
3.779 |
Saint Louis |
Senegal |
BL |
Saint Louis |
Unknown |
05.08.2016 |
06.08.2016 |
10.08.2016 |
10.08.2016 |
10.08.2016 |
|
2374 |
Customer8 |
|
code e1 8 |
04.10.2016 |
06.10.2016 |
Laura |
Saint Louis |
Senegal |
|
BL |
Agency |
Other |
|
2 |
1.500 |
0 |
100 |
Completed |
2 |
2.007 |
1.753 |
254 |
13 |
-97 |
-5 |
0 |
0 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
2.008 |
2.228 |
2.228 |
0 |
2.228 |
0 |
0 |
0 |
0 |
0 |
0 |
2.008 |
0 |
2.008 |
0 |
0 |
1.753 |
1.753 |
Saint Louis |
Senegal |
BL |
Saint Louis |
Incentive |
01.09.2016 |
02.09.2016 |
29.11.2016 |
06.09.2016 |
02.11.2016 |
|
535 |
Customer9 |
|
code e1 9 |
19.10.2016 |
20.10.2016 |
Larissa |
Saint Louis |
Senegal |
|
BL |
Agency |
Other |
|
15 |
2.700 |
0 |
100 |
Completed |
15 |
2.240 |
1.736 |
503 |
22 |
111 |
5 |
0 |
0 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
2.240 |
2.666 |
2.666 |
0 |
2.666 |
0 |
0 |
0 |
0 |
0 |
0 |
2.240 |
0 |
2.240 |
0 |
0 |
1.737 |
1.737 |
Saint Louis |
Senegal |
BL |
Saint Louis |
Incentive |
22.09.2016 |
12.10.2016 |
23.11.2016 |
14.10.2016 |
07.11.2016 |
|
43 |
Customer10 |
|
code e1 10 |
19.10.2016 |
20.10.2016 |
Maya |
Saint Louis |
Senegal |
|
BL |
Agency |
Other |
|
25 |
1.000 |
0 |
100 |
Completed |
25 |
2.360 |
1.433 |
926 |
39 |
513 |
22 |
0 |
0 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
2.360 |
2.808 |
2.808 |
0 |
2.808 |
0 |
0 |
0 |
0 |
0 |
0 |
2.360 |
0 |
2.360 |
0 |
0 |
1.434 |
1.434 |
Saint Louis |
Senegal |
BL |
Saint Louis |
Incentive |
05.07.2016 |
06.07.2016 |
11.01.2017 |
12.07.2016 |
04.11.2016 |
|
535 |
Customer11 |
|
code e1 11 |
21.10.2016 |
22.10.2016 |
Larissa |
Saint Louis |
Senegal |
|
BL |
Agency |
Other |
|
24 |
4.500 |
3.500 |
22 |
Completed |
24 |
7.513 |
6.404 |
1.109 |
15 |
-206 |
-3 |
0 |
0 |
|
|
|
|
Input/Output |
EUR |
1 |
100 |
7.514 |
8.791 |
8.791 |
0 |
8.791 |
0 |
0 |
0 |
0 |
0 |
0 |
7.514 |
0 |
7.514 |
0 |
0 |
6.405 |
6.405 |
for these data, I want to make analysis and predictions/classifications to get new insight of the data and to contribute something. I am using this data from the company in order to help me write my master thesis upon.
I need to make a data mining process, predicting for example the Net turnover of next year, or to make cluster classification and to get new insights,
I am new somehow to this in rapidMiner and I am struggling in choosing my appropriate path for starting.
I thought about to generate two new columns at the beginning (inside the Turbo Preparation) one column called
"Year"=that takes the year of each project
and another column
"Poject's length"= that counts how many days each project lasts
i need to know please with these attributes that I have, can I reach to a satisfying result? do you have any ideas ? I am stucked in the middle with too much data and dilemmas inside my head which prevents me to concentrate and take the right approach
that's why I need some wet ideas, some motivations and recommendations please
I thought about Clustering, and getting insights from the clusters i'll get, and then upon it to continue with a decision tree model that predicts the next years net turnover for example, (it can be another idea rather than predicting the turnover if you have any, im open to everything)
I tried to make the auto model and to cluster, but actually im not getting any useful results. I guess there might be 2 reasons for this:
1. that I do not know how exactly to approach this procedure, and I am missing something.
or
2. the data that I have is not enough good for this type of approach
any help please guys ?
@sgenzer @jczogalla @David_A @mschmitz @stevefarr @Pavithra_Rao
Tons of Thanks and Gratitudes.
Kind regards,
Jana