nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Siemens Community Catalyst Program

The Siemens Community Catalyst program was co-created with our community to acknowledge technology leaders who consistently contribute to the Siemens Community. Nominations are accepted on a rolling basis.

Nominate Now

Import a Word document to Rapidminer

BrilliantData

On a project for a recent client I needed to apply some common Natural Language Processing (NLP) techniques to surveys they had gathered, but one of the requirements for the project was that the source document had to remain in Word's .docx format and couldn't be exported to .txt. RapidMiner was the tool of choice for this engagement since it is graphical in nature and has a very usable library for text analysis, but what it doesn't have is an operator that specifically imports .docx files.

Microsoft Word files are basically zip files that contain an XML representation of the actual document. It stands to reason that if you can unzip the wrapper and get to the XML inside, you have a good chance of being able to read the document and do whatever you need in terms of analysis. RapidMiner has an operator for executing custom Python scripts (if you download the Python extension), so I chose to start there and see if it could handle those tasks.

Using Python in RapidMiner

First we'll need to download the Python extension, which you can do by going to Extensions-->Marketplace in the menu at the top of the page. It's one of the most popular downloads, so just go to "Top Downloads," select it from the list, and click "Install Packages" at the bottom of the window. You'll need to restart RapidMiner afterwards for the extension's operators to become available.

To use a custom Python script, search for the "Execute Python" operator and drag it onto the workflow. Double-click and you'll see the usual parameter editing box on the top right of the screen, which should contain a button labeled "Edit Text." This is where we'll enter the code.

The Code

I try not to reinvent the wheel when coding, so I Googled the problem to see if someone had tackled it before me and someone definitely had. The code I used is below:

If you want to download it straight from Etienne's blog, just follow this link:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

The initial workflow looked like this:

After using Etienne's code to unwrap the .docx file, it was easily readable by the "Read Document" operator. After that I transformed all words to lowercase, tokenized them, removed stop words, then converted the resulting word list to data and loaded it into a database for analysis. Simple.

Find more posts tagged with

AI Studio

Text Mining + NLP

Accepted answers

All comments

sgenzer

hello @BrilliantData - welcome to the community and thanks for sharing this! It's actually similar to another thread from last December about xlsx files (see https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Extract-Sheet-name-from-an-Excel-file/m-p/44747).

Scott

Telcontar120

Wonderful solution to a common problem! If you would be willing to post an anonymized version of the process, I am sure there are many community members that would be grateful!

orsan_awawdi

This is brilliant.

I ca'nt find Read Document component? any idea .

using Rapid Miner Studio 8.1

Telcontar120

Did you install the free text mining extension? All the document operators are in that and not in the base version of Sudio. Just search for Text Processing on the Marketplace and it will come up.

orsan_awawdi

Yes, you are right, it is right there.

for some reason, it is failing in some identation issue. don't know why.

---

Untitled7

  File "<ipython-input-28-405e2fcdbb20>", line 21
    document = zipfile.ZipFile('C:/Users/orsana/Desktop/MMO.docx')
                                                                  ^
IndentationError: unindent does not match any outer indentation level

---

orsan_awawdi

I think I know what is wrong here. I will fix

blake_galbreath

This is a great article, but I still can't quite figure out how to actually get the word doc into the RM repository, in order to enter it into the process described above. I tried using the Import Data module, but it only seems to allow Binary, Excel, and CSV. Where do I go to import docx files?

rfuentealba

I got it as a Building Block.

You just use the operator Open File to pass the Word Document, and then insert the building block here.

Before pasting the building block into your system, remove the .txt extension I had to add.

Usage:

Image: https://us.v-cdn.net/6038102/uploads/editor/xz/8yp0ysvcr16s.png

blake_galbreath

rfuentealba I believe this will work.