Hi there,
I am new to Rapid Miner though have a deadline coming up soon and just wanted some help with webcrawling.
I'm doing a crowdsourcing assignment where I need to 'crawl' a website to find detailed information, which I can then subject this data to further processing. However, I am having trouble running my initial analysis. I've downloaded both web and text mining extensions, have put in the URL to crawl, tried to add parameters where results returned match with my URL and links containing the name of the site itself. I've followed some tutorials and specified Rapidminer to save results to a directory, in .txt format.
I'm not sure how 'max crawl depth' translates to actually 'going through' links and pages in my given URL. I want to search through user suggestions in a crowdsourcing project, but there is no way to specify a time window of these results. I set the max dept to 400. I've selected 'add content as attribute', and to write pages to disk. I have also put in my user agent prior to running the analysis.
In one instance, I did manage to find 60 or so text files to my directory which pertained to the analysis. Whilst some of these were links I wanted, a lot weren't, and the date was too recent anyway. I wasn't sure how to further systematise my search criteria.
It is frustrating because I have a whole design set up, but no way to 1) get the data in Rapid Miner, or even 2) review the text files reliably and go through these whilst specifying I want user reviews posted from a certain date. I also don't know how I would include user metadata, such as past voting and commeting history, into the analysis, or if this is done after. All this information is available on the website itself, when you click on a given idea - the website shows how many ideas this user has submitted, how many votes and comments they've made etc. I could do this by hand, but I need hundreds if not over 1,000 different links to reliably analyse.
If anyone could provide further guidance I would be wholly appreciative. I have a deadline but not much time.
Thanks,
milkshake_luva