Web page selection.

ratheesan
ratheesan New Altair Community Member
edited November 5 in Community Q&A
Hi,
How can I select the contents of a particular web page  using RM.I tried it with crawler,but getting more  pages than I specified.

Thanks,
Ratheesan
Tagged:

Answers

  • fischer
    fischer New Altair Community Member
    Hi,

    the question is unclear. What exactly do you mean by "contents"? Do you want only a specific (list of) web pages? Do you want to extract information from the Web page?
    Please specify?

    Cheers,
    Simon
  • ratheesan
    ratheesan New Altair Community Member
    Hi Simon,
    I want to extract information from web page.If I can copy the contents in the web page as a text file,then I will apply text mining algorithms.So now I need to copy the web page in to a text file.

    Thanks
    Ratheesan.
  • land
    land New Altair Community Member
    Hi,
    I guess you might change the "max_depth" parameter to zero. The crawler shouldn't then follow any links.

    With RapidMiner 5 there will soon be a web mining extension making this more easily.

    Greetings,
    Sebastian
  • ratheesan
    ratheesan New Altair Community Member
    Hai,

    I have tried with the above method and I saved it as a text file. The saved text contains html tags and image url's etc... Is there any way to save only the texts (the text that is seen by a user when he opens a web page).

    Thanks,
    Ratheesan
  • land
    land New Altair Community Member
    Hi,
    with 5.0 this would be easy, in 4.x you can only set the TextInput to contenttype html, so that all tags are filtered out.

    Greetings,
      Sebastian