I have a university ranking dataset and one of the columns is gender ratio. Is there a way to analyze it to answer my research question " Does gender distribution affect the ranking of university?"
Hi @apiphuh
First, I used the write Excel operator to convert your .csv file into excel file. Then I performed in excel a preprocessing step with a macro on your female_male_ratio attribute to create a new attribute female_male_ratio_2 which is numerical (33:67 => 0,49 for example).
The new excel file is in attached zip file.
1.After visual analysis, it seems that there are no obvious relationship between "world rank" and "female_male_ratio_2". See the following screenshot :
2. to confirm this observation, I use the "correlation matrix" : the correlation coef between "word rank" and "female_male_ratio_2" is 0,138.
this score means that there are not linear relationships between these two attributes.
You can go further by applying some algo.
Here the process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34"> <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\timesData_Excel.xlsx"/> <parameter key="imported_cell_range" value="A1:O2604"/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <list key="data_set_meta_data_information"> <parameter key="0" value="world_rank.true.integer.attribute"/> <parameter key="1" value="university_name.true.polynominal.attribute"/> <parameter key="2" value="country.true.polynominal.attribute"/> <parameter key="3" value="teaching.true.numeric.attribute"/> <parameter key="4" value="international.true.polynominal.attribute"/> <parameter key="5" value="research.true.numeric.attribute"/> <parameter key="6" value="citations.true.numeric.attribute"/> <parameter key="7" value="income.true.polynominal.attribute"/> <parameter key="8" value="total_score.true.numeric.attribute"/> <parameter key="9" value="num_students.true.polynominal.attribute"/> <parameter key="10" value="student_staff_ratio.true.numeric.attribute"/> <parameter key="11" value="international_students.true.polynominal.attribute"/> <parameter key="12" value="female_male_ratio.true.polynominal.attribute"/> <parameter key="13" value="female_male_ratio_2.true.numeric.attribute"/> <parameter key="14" value="year.true.integer.attribute"/> </list> </operator> <operator activated="true" class="correlation_matrix" compatibility="8.0.001" expanded="true" height="103" name="Correlation Matrix" width="90" x="246" y="34"/> <connect from_op="Read Excel" from_port="output" to_op="Correlation Matrix" to_port="example set"/> <connect from_op="Correlation Matrix" from_port="example set" to_port="result 1"/> <connect from_op="Correlation Matrix" from_port="matrix" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator></process>
I hope this first response elements will be helpful.
Regards,
Lionel
You could use an statistical test to answer the question, for example a chi squared independency test.