-
Topic Modeling for PDF files
Hello everyone, I want to read several PDF files (business reports) and analyze them. Until now I use the operator Read Douments, because I haven't found a better operator yet. I want to do a topic modeling on the files to find out relevant topics. A pre-processing is done by the operators Tokenize, Transform Cases, Filter…
-
Use pdf file name as attribute
Hello everyone :smile: I want to do some simple Text Mining using pdf files in RM but I'm a little stuck right now. I created a process using the loop files and process document operator for reading in several pdf files. As I have a lot of files to analyze, which I also want to compare, I would like to create an attribute…
-
Extraction of sentences based on a wordlist (to create a new doc)
Hello, For the purpose of my thesis I have to analyze multiple corporate reports. I have to extract from these reports sentences that contains specific words (from a wordlist) and create a document with all the selected sentences, which will be used later for further analysis. For that I used first a "read document"…
-
Sentiment analysis multiple Pdfs
I am a master's student in Business Economics and I'm new to RapidMiner. For my thesis, I have to pre-process multiple Pdf files by tokenizing, stemming, transforming cases etc. If I do this for one file, I get the wanted outcome: a processed text. But when I use the loop function to process multiple pdfs, the output is…
-
\n command doesn't work in Replace Token Operator
Hi, I'm trying to read pdf-files in RapidMiner through the "Read Document" operator and then use the "Replace Token Operator" to delete all line-breaks. I replace "\n" with " ", but when I then copy the text, all line breaks are still in place. Weirdly, when I use the "Create Document" operator and manually copy the text…
-
Delete hyphens after reading pdf-files
Hi there, I'm very new to RapidMiner. I'm reading german pdf-files and tokenizing them, which is working fine... However, the pdf-files contain hyphens that seperate a fair amount of words in to two parts, like the following example: "die Bedeutung der finan- ziellen Interessen der Union" I'm trying to dehyphenate the…
-
Executing Tesseract
I am trying to execute Tesseract to extract detail from PDF's that have been converted to PNG. Any idea why I am getting an error 127 when I am able to execute via terminal? <?xml version="1.0" encoding="UTF-8"?><process version="9.10.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true"…
-
Text extraction of key themes/words from series of pdf files
Hi Folks, Im new to this & struggling a little bit :) I just wanted some easy (explicit) steps to help me achieve what I want to do, which is: I have a series of mostly pdf reports;- I want to extract key themes or words that recur throughout the reports, for example 'serious accident' or 'safety' What I have done so far…
-
text processing pdfs
I am trying to build a word cloud from pdfs. Is there some sort of "demo" for this? Do I need to convert the pdfs to text first? I saw a video where he suggested converting to txt files and put them in a separate folder. ((92) Text Processing on Rapid Miner - YouTube) I tried with a process (see attached xml) but I am…
-
Rapidminer Studio Manual
Where is the last version manual of Rapidminer studio? in documentation manual is very old V6.
-
Text Mining: analyse PDFs with a dictionary which has categories
Hello, I want to
analyse a number of PDFs (35) with kind of a dictionary. The output of the
analysis should be an Excel File which shows how often every single word of the
dictionary appears in the PDFs. Maybe it's important to know that the
dictionary is not only a list of words. Instead the words are classified into five…
-
Extract e-mail adresses out of a pdf
Hello dear Rapidminer community, I have a pdf full of adresses (name, street, phonenumber, email). What I want is to extract only all the e-mail adresses and store them line per line in an excel or csv. How is the approach to this? (I am really a Rapidminer newbie) Greetings, Marcel
-
pdf for errors
Hello It would be nice if we have a pdf for errors of RM and the way to solve them. Regards mbs
-
Read PDF Tables Extension - Need to
Hello - I am trying to use the "Read PDF Tables" Extension. I have successfully read my PDF but it has been split out into 21 different example sets. I would like to use the "Select" operator to choose the Example sets that I need. I am running into some issues. "Select" only lets you pick on example set whereas I will…
-
PDF
I want to know how to extract information from PDF I want to use Rapidminer tools on pdf file but I do not know how to load pdf file in Rapidminer thanks
-
text mining pdf articles omitting references
In a previous post https://community.rapidminer.com/discussion/53107/text-mining-of-multiple-pdf-files-with-separate-key-word-counts an approach for mining multiple pdf files was described.If the pdfs are articles, is there a way to exclude References section from being mined. The section often starts with the same term…
-
"Import data PDF documents"
Hi there! I'm completely new to rapid miner - and can't manage to import PDF files into the repository. It says that it's an unknown file type. I'm sorry for the completely (!) basic question, but I can't find anything about that in the getting started training. Thank you very much for your help!
-
Text Mining of multiple PDF files with separate key word counts
Hello all, I am new to this community and hope that somebody can help me. I already searched the forum a lot and found very good topics, but I couldn't find a proper solution for my task. Here's what I want to do: I have about 500 PDF files and want to text mine them and compare the results to key words I already have in…
-
AMOUNT OF EXAMPLES DOES NOT CORRELATES WITH INPUT DATA LOADED FROM PDFs
on="1.0" encoding="UTF-8"?><process version="8.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="all"/> <process expanded="true"> <operator activated="true"…
-
Compare 2 pdf texts
Hello, I'm trying to create a process which consist on comparing 2 pdf that are subtly different. I process my documents (tokenize, filter stopwords, generate n grams...) from two differents files and merge it into one common example set with the operator "Append" and use the operator "Remove duplicates" to see differences…
-
Creating wordlists from PDFs via an URL
I would like to create a wordlist (for applying a machine learning model that was specified before) with a PDF as a source. This usually works using the Process Documents operator. But I need to access the PDF via an URL. I thought about using the Web Mining extension for this. The Get Pages operator does not work, it…
-
Sentiment score for pdf files
Hello all, So, I have this requirement. I have a bunch of pdf files (nearly 50 pdf files) and I need to know the sentiment score of each of the pdf files. Can this be done in Rapidminer using any of the extensions? Also, it would be really great if we could create an output as an excel sheet which has the pdf file name as…
-
Processing PDF documents for text mining with the Process Documents from Files operator
I tried processing large PDF documents using the Process Documents from Files operator. When running the process, RapidMiner returns an error while processing the Process Documents from Files operator. The error message is: "Process failed. javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when…
-
PDF Table extraction into data
The Extract PDF Tables seems to be a relativly new extension and I do not see much discussion around it. I have multiple PDF documents from which I need to extract the data contained in the tables. The output of this operator is an IO Object collection. Due to the fact that there are tables within tables, it means that…
-
"Scrape a website and download hyperlinked pdf files"
I can scrape in python, but how do download and store hyperlinked pdf or other files in their native format using RapidMiner?