Specific case for Joining multiple datasets
Hi everyboday,
In a data science project I have recieved 4 cleaned datasets on an intersting topic healthy diet to protects ourselselves against Covid-19. I have already imported these 4 data sets in Rapidminer Studio. before I analyse them with machine learning, modelling and statistical preditions I would like to join and aggrgate those 4 datasets into one single dataset. This is where I ahve encountered an underestanding problem how to go further....
All of those 4 cleaned datasets have exact the same 32 Columns and 170 rows. Only the information delivered in these tables as following are different in their %-value:
1. Fat_Supply_Quantity_Data.csv
2. Food_Supply_kcal_Data.csv
3. Food_Supply_Quantity_kg_Data.csv
4. Protein_Supply_Quantity_Data
How do you see a possibility to join those 4 datasets to one single datase,t although all have the same number of 32 columns and same 170 rows OR should I look at each of 4 datasets separately and process them seperated from eachtother? Can you support me with your insights?
Answers
-
Hi @Sunnyboy_nh,
A priori, if your 4 datasets have strictly the same attributes names you need the Append operator.
Thus the resulting dataset will have 680 rows (170x4 rows) x 32 columns.
After creating this resulting dataset, you can begin by submitting these dataset to AutoModel to create automatically
Machine Learning models based on your data.
Good luck for your study.
Regards,
Lionel0 -
Thanks Lionel for the tip with Append instead of Join !
I will try that but before hand another question rises now..... The same number of 170 rows in each dataset are the "Country" names which are also the exact same from where the data has been taken. How do I go about the new added rows/examples after Append ? Should I then rename each new added 170 batch of rows/examples according to the data values or what?
And finally is it possible to use Append for on all multiple 4 datesets or is it only possible to do Append two at the the time?
Regards
Nader
0 -
It matters how your data is organized and saying it is data in the same number of rows and columns doesn't clarify. Are these 170 individuals who have 32x4 different observations about their diets? Is this a time series dataset? Or is it the same 32 attributes for 170x4 different people? Or something else entirely?
As Lionel says, if it is the same 32 attributes for 680 different people then Append should do the trick. But if it is something different they you may need another type of Join or perhaps even Merge, and you may also need to pivot or transform your data. Generally you need to think about the structure of the data you want to have at the end and work backwards.0 -
Hi Telcontar120,
Thanks for your further differntiation.
Those 170 rows/examples as I wrote back to Lionel are 170 countries around the globe where the data has been taken from for the same 32 Coulumn/Attributes but data has been taken for 4 different observations as you see here again:1. Fat_Supply_Quantity_Data.csv
2. Food_Supply_kcal_Data.csv
3. Food_Supply_Quantity_kg_Data.csv
4. Protein_Supply_Quantity_Data.csv
So the rows are not people but the same countries and the attributes are the same parameters but from 4 different aspects or observation in 4 seperated datasets which deliver different %-values of information in each datasets.
SO do you still think Append will do the trick or if so should the new added rows be renamed with a suffix since they are all the same countries ?
Regards
Nader
0 -
You don't want Append in this case, since it sounds like you want to end up with a dataset that has 170 rows with 128 attributes, not 680x32. The attributes in each file are similar but actually not the same.
So you should choose Join and then add each file's attributes one by one.
But you are going to want to rename your attributes first so it is clear which file they come from. You can do that with the Rename by Replacing operator.0 -
Thanks, this last recommendation sounds good I think makes more sense not to append but join the datesets and rename their attributes/column by their files before jouning at the end to one single dats set of 128 attributes x 170 rows.
But since i have 4 datasers and operator Join only joibs 2 datasets at a time should I do the Join in 3 stages in series or is there a better easier way to join multiple datasets in the begining?
Regards
Nader0