Confused how to approach my data, to start by Clustering? or Prediction directly? or a better idea?

Question

Dear all,

I am working with a dataset, that contains more than 8456rows, 26 columns. this data is about projects that are taken place in Europe, each row is a project.

these are the columns:

OfficeOffice CountryCompetenceExecutive competenceClassificationEnquiry dateCreation dateConfirmation dateProposal DateFinal invoice sent dateIntermediaryCustomer IDCustomerEventGroup nameReference codeStart dateEnd dateProject managerMain contactVia sales contactProject locationProject countryHeard About UsSource MarketClient KindClient SectorRegionMarketLead Sent toEvent FrequencyPipeline Future ProjectsInitial PaxEstimated turnoverEstimated costsEstimated profit %StatusPaxNet turnoverNet costsGross profitGross profit %Net profitNet profit %Agency commissionsSupplier commissionsCancellation/Rejection reasonCancellation dateRemarksControlledFinancial RegimeCurrencyExchange RatePayment status %Required(Net)RequiredInvoicedTo invoiceReceiptTo payCustom invoicesBalance carried forwardComments to low marginDebitsAssetsBalanceTO Inv.TO Acc.TO TotalCost Eff.Cost Man.Cost Acc.Cost Total

for privacy policy I cannot expose the data itself, so I created an imaginary data just for illustration:

OfficeOffice CountryCompetenceExecutive competenceClassificationEnquiry dateCreation dateConfirmation dateProposal DateFinal invoice sent dateIntermediaryCustomer IDCustomerEventReference codeStart dateEnd dateProject managerProject locationProject countryHeard About UsSource MarketClient KindClient SectorRegionInitial PaxEstimated turnoverEstimated costsEstimated profit %StatusPaxNet turnoverNet costsGross profitGross profit %Net profitNet profit %Agency commissionsSupplier commissionsCancellation/Rejection reasonCancellation dateRemarksControlledFinancial RegimeCurrencyExchange RatePayment status %Required(Net)RequiredInvoicedTo invoiceReceiptTo payCustom invoicesBalance carried forwardDebitsAssetsBalanceTO Inv.TO Acc.TO TotalCost Eff.Cost Man.Cost Acc.Cost TotalSaint LouisSenegalBLSaint LouisUnknown22.02.201608.04.201608.04.201623.02.201608.04.2016 11896Customer2zina 2016code e1 215.04.201616.04.2016MayaSaint Louis 1 hallSenegal BLAgencyOther 35000Completed351.9501.48646324122600    Input/OutputEUR11001.9502.3212.32102.3210000001.95001.950001.4871.487Saint LouisSenegalBLSaint LouisOther08.06.201608.07.201608.07.201614.06.201625.07.2016 43Customer3 code e1 307.07.201607.07.2016MayaSaint LouisSenegal BLAgencyOther 02000100Completed02979288972367900    Input/OutputEUR1100297354354035400000029702970099Saint LouisSenegalBLSaint LouisEmbassy19.05.201620.05.201604.08.201604.08.201604.08.2016 1978Customer4leab 2016code e1 411.09.201616.09.2016LauraSaint LouisSenegal BLAgency  3212.0000100Completed329.6147.4162.19723515500    Input/OutputEUR11009.61411.44111.441011.4410000009.61409.614007.4177.417Saint LouisSenegalBLSaint LouisEmbassy20.05.201621.05.201628.06.201628.06.201604.08.2016 1978Customer5leab 2016code e1 512.09.201616.09.2016LauraSaint LouisSenegal BLAgency  124.5000100Completed124.5503.5261.02422227500    Input/OutputEUR11004.5505.4155.41505.4150000004.55004.550003.5263.526Saint LouisSenegalBLSaint LouisUnknown21.03.201601.04.201615.06.201601.04.201628.11.2016 807Customer6festival 2016code e1 623.09.201625.09.2016MartinSaint LouisSenegal BLAgency  2018.0000100Completed2011.2769.6762.1041913010503    Input/OutputEUR110011.27712.81512.815012.81500000011.277011.277009.6769.676Saint LouisSenegalBLSaint LouisUnknown28.06.201629.06.201610.08.201610.08.201614.09.2016 43Customer7 code e1 704.10.201605.10.2016LauraSaint LouisSenegal BLAgencyOther 306.0000100Completed304.7893.7781.01121173400    Input/OutputEUR11004.7905.7005.70005.7000000004.79004.790003.7793.779Saint LouisSenegalBLSaint LouisUnknown05.08.201606.08.201610.08.201610.08.201610.08.2016 2374Customer8 code e1 804.10.201606.10.2016LauraSaint LouisSenegal BLAgencyOther 21.5000100Completed22.0071.75325413-97-500    Input/OutputEUR11002.0082.2282.22802.2280000002.00802.008001.7531.753Saint LouisSenegalBLSaint LouisIncentive01.09.201602.09.201629.11.201606.09.201602.11.2016 535Customer9 code e1 919.10.201620.10.2016LarissaSaint LouisSenegal BLAgencyOther 152.7000100Completed152.2401.73650322111500    Input/OutputEUR11002.2402.6662.66602.6660000002.24002.240001.7371.737Saint LouisSenegalBLSaint LouisIncentive22.09.201612.10.201623.11.201614.10.201607.11.2016 43Customer10 code e1 1019.10.201620.10.2016MayaSaint LouisSenegal BLAgencyOther 251.0000100Completed252.3601.433926395132200    Input/OutputEUR11002.3602.8082.80802.8080000002.36002.360001.4341.434Saint LouisSenegalBLSaint LouisIncentive05.07.201606.07.201611.01.201712.07.201604.11.2016 535Customer11 code e1 1121.10.201622.10.2016LarissaSaint LouisSenegal BLAgencyOther 244.5003.50022Completed247.5136.4041.10915-206-300    Input/OutputEUR11007.5148.7918.79108.7910000007.51407.514006.4056.405

for these data, I want to make analysis and predictions/classifications to get new insight of the data and to contribute something. I am using this data from the company in order to help me write my master thesis upon.

I need to make a data mining process, predicting for example the Net turnover of next year, or to make cluster classification and to get new insights,

I am new somehow to this in rapidMiner and I am struggling in choosing my appropriate path for starting.

I thought about to generate two new columns at the beginning (inside the Turbo Preparation) one column called

"Year"=that takes the year of each project

and another column

"Poject's length"= that counts how many days each project lasts

i need to know please with these attributes that I have, can I reach to a satisfying result? do you have any ideas ? I am stucked in the middle with too much data and dilemmas inside my head which prevents me to concentrate and take the right approach

that's why I need some wet ideas, some motivations and recommendations please

I thought about Clustering, and getting insights from the clusters i'll get, and then upon it to continue with a decision tree model that predicts the next years net turnover for example,  (it can be another idea rather than predicting the turnover if you have any, im open to everything)

I tried to make the auto model and to cluster, but actually im not getting any useful results. I guess there might be 2 reasons for this:

1. that I do not know how exactly to approach this procedure, and I am missing something.

or

2. the data that I have is not enough good for this type of approach

any help please guys ?

@sgenzer @jczogalla @David_A @mschmitz @stevefarr @Pavithra_Rao

Tons of Thanks and Gratitudes.

Kind regards,
Jana

M_Martin · Answer

Hi: In addition to the great advice from Telcontar120, perhaps it would also be a good idea to ask the people who gave you the data (if you haven't already) how they collected the data, the meanings of all of the data fields, and what they are hoping you might find and why, and how whatever you find out will actually be used.  This might help you formulate and set goals as to what exactly you would like to learn or need to learn from exploring the data. If there's anyone you could talk to who has experience managing or has worked with people involved in some of the projects, this might give you some ideas.
If they just gave you the data and said "Find something interesting", you would certainly want to try and discover some interesting relationships between the various data fields which you could then talk about with the people who gave you the data, which might lead to you learning more about the meanings of all of the data fields or what your colleagues would like you to concentrate on.
You may also want to check for missing and NULL data values in the various data fields, and look for any inconsistencies in the data values in the various data fields because if the data is not entered in a consistent manner, this could make it more difficult for RapidMiner to find interesting relationships between the data fields.  It's usually helpful to get a sense of minimum, average, median, and maximum values for the numeric data fields and how evenly (or unevenly evenly) the data for each data field is distributed.
Hope this helps, good luck, and best wishes, Michael Martin

Telcontar120 · Answer

You could start with some simply exploratory data analysis to see the relationship between your attributes.  How about some simple weighting by correlation or by information gain?
You could also use clustering to see what kind of patterns are in the data.  You should also look for outliers.
Another option would be to reformulate your target label, sometimes predicting a continuous numerical (like net turnover) is more difficult.  Could you redefine it into a classification problem, by setting a threshold level of net turnover and then assigning a class (either above that level or below it)?
Without seeing your actual data, it is almost impossible to say whether there is enough predictive power in your attributes to do a good job predicting your outcome.  But these are a few other things you should try.