Making Money On-line: When Screen Scraping became API calling – Gathering Oracle OpenWorld Session Catalog with ...

Sunday 20 May 2018

When Screen Scraping became API calling – Gathering Oracle OpenWorld Session Catalog with ...

A dataset with all sessions of the upcoming Oracle OpenWorld 2017 conference is nice to have – for experiments and demonstrations with many technologies. The session catalog is exposed at a website here.

With searching, filtering and scrolling, all available sessions can be inspected. If data is available in a browser, it can be retrieved programmatically and persisted locally in for example a JSON document. A typical approach for this is web scraping: having a server side program act like a browser, retrieve the HTML from the web site and query the data from the response. This process is described for example in this article – https://codeburst.io/an-introduction-to-web-scraping-with-node-js-1045b55c63f7 – for Node and the Cheerio library.

However, server side screen scraping of HTML will only be successful when the HTML is static. Dynamic HTML is constructed in the browser by executing JavaScript code that manipulates the browser DOM. If that is the mechanism behind a web site, server side scraping is at the very least considerably more complex (as it requires the server to emulate a modern web browser to a large degree). Selenium has been used in such cases – to provide a server side, programmatically accessible browser engine. Alternatively, screen scraping can also be performed inside the browser itself – as is supported for example by the Getsy library.

As you will find in this article – when server side scraping fails, client side scraping may be a much to complex solution. It is very well possible that the rich client web application is using a REST API that provides the data as a JSON document. An API that our server side program can also easily leverage. That turned out the case for the OOW 2017 website – so instead of complex HTML parsing and server side or even client side scraping, the challenge at hand resolves to nothing more than a little bit of REST calling. Read the complete article here.

PaaS Partner Community

For regular information on business process management and integration become a member in the SOA & BPM Partner Community for registration please visit www.oracle.com/goto/emea/soa (OPN account required) If you need support with your account please contact the Oracle Partner Business Center.

Blog Twitter LinkedIn Facebook Wiki

Technorati Tags: SOA Community,Oracle SOA,Oracle BPM,OPN,Jürgen Kress

from Oracle Blogs | Oracle Marketing Cloud https://ift.tt/2IWcozu
via IFTTT

2 comments:

ClubHub12 March 2022 at 10:14
Would you like to tell me that what type of scraping are you talking about? I read your post and unable to understand it. I have also wrote some article about is web scraping legal? If you need any help you can ask it from me.
ReplyDelete
Replies

Add comment