Using the SAS language to acquire data from a web site

IanBD
IanBD
Altair Employee
edited September 2022 in Altair RapidMiner

When collecting data, a website you find might contain information that would be useful to analyse in your SAS language programs, however it is not always possible to just highlight and copy from a website into a dataset, as sometimes the data on the website is not in the right format to manually copy and paste. You might also want to analyse some data that is updated frequently and by using web scraping you can introduce a level of automation by re-running your program instead of visiting the website then copying and pasting each time. Here we will take and convert it into a dataset using DATA step functions.

FILENAME rdwbpg HTTP "https://www.metacritic.com/browse/games/release-date/available/pc/metascore?view=detailed&page=" LRECL=32767 recfm=P debug; DATA _NULL_;  file "c:\temp\testout";  INFILE rdwbpg LENGTH=len;  INPUT line $varying32767. len;  put line; RUN; 

Here the FILENAME statement stores the HTML content of the webpage into a filename and the DATA step writes the data to an external file using the INFILE and INPUT statements to structure the input data into observations.

If we were to output the contents from the DATA step now, all the information we need is obfuscated by HTML tags:

image

So, using the DATA step functions in the next DATA step, we locate the lines with the information and extract it from the tags, for example:

image

We can see the score is contained between the opening and closing div tag in the second line, also, the game title is contained between the opening and closing a and h3 tag in the second to last line. We extract this information and output the data using DATA step functions in the following DATA step:

DATA OUT (DROP=FLAG);  RETAIN FLAG 0;  RETAIN GAMENAME;  FORMAT GAMENAME $100.;  INFILE "c:\temp\testout" _INFILE_=LINE;  INPUT;  IF LENGTH(LINE) LE 18 THEN RETURN;  IF LINE EQ STRIP('<span class="title numbered">')   THEN FLAG=1;  IF FLAG EQ 1 AND SUBSTR(line, 1, 18) EQ '<a href="/game/pc/'  THEN DO;   GAMENAME=SCAN(LINE,3,"<>");  END;  IF FLAG EQ 1 AND SCANQ(line,1,"<>") eq 'div  class="metascore_w large game positive"' THEN DO;   SCORE=SCAN(line,2,"<>");   FLAG=0;   OUTPUT;  END; RUN; 

Here the DATA step takes the HTML content from the and processes the content into readable data. This now produces the following output:

image

DATA OUT (DROP=FLAG);  RETAIN FLAG 0;  RETAIN GAMENAME;  FORMAT GAMENAME $100.; 

The retain statements initialise the FLAG and GAMENAME variables. The FLAG variable will be used as an indicator to do different types of processing depending on the HTML data; being a placeholder variable, it is dropped later. The GAMENAME variable is used to output the name of each game with their corresponding scores in the SCORE variable.

 INFILE "c:\temp\testout" _INFILE_=LINE;  INPUT; 

The infile and input statements specify the input data that comes from the file we created in the last DATA step.

 IF LENGTH(line) LE 18 then return; 

The IF statement checks for lines shorter than a length of 18. If so, then the rest of the DATA step statements are not processed for this line. This avoids extractions from other lines that might contain the same tags.

 IF LINE EQ STRIP('<span class="title numbered">') THEN FLAG=1; 

The IF statement sets the FLAG variable to 1 if the row contains the specified HTML tag.

  IF FLAG EQ 1 AND SUBSTR(line, 1, 18) EQ '<a href="/game/pc/' THEN DO;    GAMENAME=SCAN(line,3,"<>");   END;

The IF statement checks that the flag variable is 1 and the line contains the specified HTML tag, then we have our game name and use the SCAN() function to extract it from the tag.

 

IF FLAG EQ 1 AND SCANQ(line,1,"<>") EQ 'div class="metascore_w large game positive"' THEN DO;  SCORE=SCAN(line,2,"<>");  FLAG=0;  OUTPUT; END;

The if statement checks that the FLAG variable is 1 and the line contains the specified content in the HTML tag using the SCANQ() function, then we have our game name and use the SCAN() function to extract it from the tag. Finally, the game and score are output into the output dataset.