Mining source code files

confusedMonMon
confusedMonMon New Altair Community Member
edited November 2024 in Community Q&A
Hi there,
I'm new to the mining world and what I'm looking for is mining source code files, i.e files written in programming languages. I thought since source codes are textual data then I can find some text mining tool to mine them, and picked RapidMiner as it is one of the most famous text mining tools. Unfortunately, it couldn't read such files. Am I missing something here? do you have any advice on how to mine such files?
Many thanks
Tagged:

Best Answers

  • YYH
    YYH
    Altair Employee
    Answer ✓
    Hi @confusedMonMon,

    if you have source codes files, saying .sql, .c, .py files, you would need to read document operator from text processing extension. 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="open_file" compatibility="9.2.001" expanded="true" height="68" name="Open File" width="90" x="112" y="34">
            <parameter key="resource_type" value="URL"/>
            <parameter key="url" value="https://raw.githubusercontent.com/Marcnuth/AnomalyDetection/master/anomaly_detection/anomaly_detect_vec.py"/>
          </operator>
          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="380" y="34">
            <parameter key="extract_text_only" value="true"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="SYSTEM"/>
          </operator>
          <connect from_op="Open File" from_port="file" to_op="Read Document" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


Answers

  • YYH
    YYH
    Altair Employee
    Answer ✓
    Hi @confusedMonMon,

    if you have source codes files, saying .sql, .c, .py files, you would need to read document operator from text processing extension. 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="open_file" compatibility="9.2.001" expanded="true" height="68" name="Open File" width="90" x="112" y="34">
            <parameter key="resource_type" value="URL"/>
            <parameter key="url" value="https://raw.githubusercontent.com/Marcnuth/AnomalyDetection/master/anomaly_detection/anomaly_detect_vec.py"/>
          </operator>
          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="380" y="34">
            <parameter key="extract_text_only" value="true"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="SYSTEM"/>
          </operator>
          <connect from_op="Open File" from_port="file" to_op="Read Document" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • SGolbert
    SGolbert New Altair Community Member
    Hi,

    you can search GitHub repos by language:


    Whether you do it through web crawling or directly with the GitHub API, I leave it for you to find out ;)

    Regards,
    Sebastian

  • confusedMonMon
    confusedMonMon New Altair Community Member
    Thany you @yyhuang and @SGolbert. Now I'm able to read the source code files I have. I also finished the RapidMiner tutorials which mainly used structured text files and got my head around it. However, since the source code files are unstructured textual data and what I'm looking for is analyzing some programming language features usage based on mining the files. Is there any advice/recommendations on the components/operators I should use considering that (1) the data is unstructured text (source code) and (2) I'm not going to analyze any natural language phrases and will skip the comments parts of the source code files. Thank you 
  • SGolbert
    SGolbert New Altair Community Member
    Hi confusedMonMon,

    Can you tell us what problem you are trying to solve? The approach will vary depending on that.

    I think that in general you can work with the text mining extension. The Text and Web Mining Course is a great introduction, but AFIK it hasn't been made open yet (@Knut-RM). Depending on the complexity of the problem, you can build a good model by counting word frequencies and using n-grams.

    Regards,
    Sebastian
  • IngoRM
    IngoRM New Altair Community Member
    Hi,
    Just wanted to mention that the text and web mining course is now public as well:
    Hope this helps,
    Ingo
  • confusedMonMon
    confusedMonMon New Altair Community Member
    Thanks @SGolbert. I'm currently working on the text processing extension. I want to (1) exclude some sentences and paragraphs that start and/or end with a certain character (e.g. /*, //, #,...) from processing. Also, I want to (2) look for a predefined list of words and/or phrases that have a specific pattern in the documents to be detected and compared to others. Any suggestions to start with?
  • confusedMonMon
    confusedMonMon New Altair Community Member
    Hi @SGolbert. I couldn't find the "Text mining" extension. Maybe you mean web mining? would it help since I already have the source code files locally? Thanks
  • sgenzer
    sgenzer
    Altair Employee
    hi @confusedMonMon I think @SGolbert meant to say the "Text Processing" extension. I make the same mistake all the time. :smile:


  • confusedMonMon
    confusedMonMon New Altair Community Member
    edited April 2019
    Thanks @sgenzer and @SGolbert. Makes sense now. 
  • confusedMonMon
    confusedMonMon New Altair Community Member
    edited April 2019
    Hi @SGolbert . I have a further question. How to set the text directories in Process Documents from Files operator automatically instead of manually? Is there any way to do so? This is my attempt:
    <br>
  • SGolbert
    SGolbert New Altair Community Member

    I cannot parse your process, can you paste it again with the right format?

    Regards,
    Sebastian

  • confusedMonMon
    confusedMonMon New Altair Community Member
    edited May 2019
    Thanks @SGolbert I managed to make it work now.