Extract Sample Size and Population Type from a group of Article Abstracts
TL;DR: I want to extract some specific information from a series of similarly (but not identically) formatted paragraphs of text and I hope someone here can point me in the right direction.
I work in the research office of a private university and as part of a project I want to analyze information found in journal article abstracts. Journal databases like Scopus allow me to download a CSV file with article titles, URLs, DOIs (a unique identifier), author names, and abstracts. I want to mine the abstracts to find specific information for each article.
To give a specific example, I have an exported list of around 400 articles, each one is a row in a CSV file, and all of them relate to the development of surveys, questionnaires, scales, or similar academic instruments. I have found that there are some articles that were included by mistake, and relate to "instruments", but not in the sense of questionnaires but rather of machines to measure and quantify data such as meteorological phenomena - I need to ignore these articles. For the relevant articles, they general include the sample size of the group that they administered their instrument to and a brief description of the type of people who participated (university students, children from 9-12 years of age, patients with type II diabetes, etc.)
I want to mine the abstracts and produce two additional columns, one containing the sample size and the other listing the type of people who participated in the study. Thus I need to mine two types of data, one is numerical and the other is textual. But to complicate things, an abstract can include other numbers that are not related to the sample size and there are even some who write out the sample size as text, like "four hundred at thirty."
I have attached a sample file with 12 abstracts and my manual analysis of the sample size and a Yes/No field to show if the study was carried out on students or not (which means any children through university students).
I know very little about text mining and text analysis, but from what I have read the rapidminer platform seems to be the most promising possible solution. I am hoping that someone here could help point me in the right direction to see how this could be done.