Large data set with Time Series

Toaldo
Toaldo New Altair Community Member
edited November 5 in Community Q&A
Dear all - I am kind of new to using RapidMiner. So, I am working with a large data set with Time Series (from 2000 to the 2019 year). There are about ~200.000 lines and 4 different attributes (variable, region, times series, and values). The Decision Tree and Forecasting with Windowing are one of those that are on my radar. Anyway, I am kind of lost here... what type of analysis I could do within this type of database? Thanks in advance for your help! Alexsandro Toaldo
Tagged:

Best Answers

  • Toaldo
    Toaldo New Altair Community Member
    Answer ✓
    Hi Martin -
    Thanks for your prompt response.
    This is a great question, therefore I am not sure yet. 
    As a background, I am working with public information about our city (Sao Paulo) which contain about ~200.000 register within 4 different attributes. As this is a time-series dataset, I am not sure where I could start and what type of analysis I can do. The attached file is a sample of the dataset. 


Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,
    what is your business problem? :)

    Best,
    Martin
  • Toaldo
    Toaldo New Altair Community Member
    Answer ✓
    Hi Martin -
    Thanks for your prompt response.
    This is a great question, therefore I am not sure yet. 
    As a background, I am working with public information about our city (Sao Paulo) which contain about ~200.000 register within 4 different attributes. As this is a time-series dataset, I am not sure where I could start and what type of analysis I can do. The attached file is a sample of the dataset. 


  • Toaldo
    Toaldo New Altair Community Member
    Dear Martin/Rapid Miner team:

    The attached is a template containing public data from our country city. 

    Under the first column "district" there are approximately 2.243 registers.
    Time Series contains data from 1996 to 2019 (~23 years) 
    Column C to Column WE (approximately 600 different attributes) contains several different information about the data from our city (indexes, GDP, number of males, females, etc and etc). These are very large of data and high quality information.

    My intended research approach initially are the following:
    1) Start pilot project on 3 neighborhoods to identify correlations and possible regressions to: -Explain the number of industrial and commercial companies per neighborhood (large, medium, small);
    2) Select independent variables (10 to 20) explaining the selected independent variable (companies);
    3) 
    Decision tree on 10 selected neighborhoods explaining increase on companies;
    4) 
    Cluster neighborhoods considering the potential to increase the number of companies.

    So, I have a couple of questions:

    1) there are many attributes with no values. As this is a large set of data, should I leave it open or change it by zero?
    2) What type of operator/analysis should I start the analysis, always considering the "District" as label (every single possible answer should come from Ditrict and size type of organization (
    large, medium, small).

    Thanks for your attention!

    Best,
    A.Toaldo