🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

How to Properly Use Loop Amazon S3

User: "AustinT"
New Altair Community Member
Updated by Jocelyn

Community,

 

I am trying to extract data from S3 using the "Loop Amazon S3" operator. It is Twitter data and the data files are nested pretty deeply - for example: raw_data/2016/10/11/16/file_1.txt

 

I must not have it configured correctly because RM tells me "Input Missing .... previous operator did not return any output" - if I point the operator to a higher directory like "10" , the process runs a long time before erroring. If I point it to the directory like "16" (i.e. the directory where all my files are located) it still gives an error.

 

I suspect I need to customize the "macro" fields but the description of the fields don't really make any sense. Right now the "file name" , "file path" and "parent path" macro fields contain the default values. 

 

My layout goes like: [Loop Amazon S3] -> [Read Document] -> [JSON to Data] -> results

 

 

Thanks for your help!2016-10-12 07_00_22-Clipboard.png2016-10-12 06_59_58-Clipboard.png

Find more posts tagged with

Sort by:
1 - 3 of 31
    User: "mmichel"
    New Altair Community Member
    Accepted Answer

    Hi AustinT,

     

    the 'Loop Amazon S3' is a meta operator. So you need to provide the subprocess within the operator itself.

    Do it by double clicking on the operator and move the other operators (Read document and JSON to Data) inside the 'Loop Amazon S3' operator.

     

    You should end up like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="cloud_connectivity:loop_amazons3" compatibility="7.2.000" expanded="true" height="82" name="Loop Amazon S3" width="90" x="45" y="34">
    <parameter key="connection" value="AmazonS3"/>
    <parameter key="folder" value="/someFolder/someSubfolder"/>
    <process expanded="true">
    <operator activated="true" class="text:read_document" compatibility="7.2.001-SNAPSHOT" expanded="true" height="68" name="Read Document" width="90" x="112" y="34"/>
    <connect from_port="file object" to_op="Read Document" to_port="file"/>
    <connect from_op="Read Document" from_port="output" to_port="out 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Loop Amazon S3" from_port="out 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Cheers,

    Marcel

    User: "AustinT"
    New Altair Community Member
    OP

    Thank you for the quick response, Marcel. Here's what the subprocess within the Loop Amazon S3 operator looks like. I have chosen a directory very close to the "node" (so to speak) so I'm not anticipating the operator to run very long. It is still running, so I will check back when I have some results. Thanks again

     

    2016-10-13 08_54_17-Clipboard.png

     

    EDIT: Although it ran for awhile it worked very nicely! Next thing to troubleshoot is text encoding and combining the results into one dataset. I'm a beginner! Thanks again

    User: "mmichel"
    New Altair Community Member

    Hi AustinT,

     

    glad to hear that your process is working. Depending on the file number and your internet connection it may take some time to complete this process.

    Just a quick tip for the process designing phase. You don't want to execute the Loop Amazon operator every time while editing the process, so just save the results of the operator by using the Store operator. After that you are able to load the results with the Retrieve operator. So during the designing phase just use the Retrieve operator instead of the Loop Amazon operator. Otherwise you will be wasting a lot of time ;-)