Sunday, 20 May 2018

When Screen Scraping became API calling – Gathering Oracle OpenWorld Session Catalog with ...

image

A dataset with all sessions of the upcoming Oracle OpenWorld 2017 conference is nice to have – for experiments and demonstrations with many technologies. The session catalog is exposed at a website here.

With searching, filtering and scrolling, all available sessions can be inspected. If data is available in a browser, it can be retrieved programmatically and persisted locally in for example a JSON document. A typical approach for this is web scraping: having a server side program act like a browser, retrieve the HTML from the web site and query the data from the response. This process is described for example in this article – https://codeburst.io/an-introduction-to-web-scraping-with-node-js-1045b55c63f7 – for Node and the Cheerio library.

However, server side screen scraping of HTML will only be successful when the HTML is static. Dynamic HTML is constructed in the browser by executing JavaScript code that manipulates the browser DOM. If that is the mechanism behind a web site, server side scraping is at the very least considerably more complex (as it requires the server to emulate a modern web browser to a large degree). Selenium has been used in such cases – to provide a server side, programmatically accessible browser engine. Alternatively, screen scraping can also be performed inside the browser itself – as is supported for example by the Getsy library.

As you will find in this article – when server side scraping fails, client side scraping may be a much to complex solution. It is very well possible that the rich client web application is using a REST API that provides the data as a JSON document. An API that our server side program can also easily leverage. That turned out the case for the OOW 2017 website – so instead of complex HTML parsing and server side or even client side scraping, the challenge at hand resolves to nothing more than a little bit of REST calling. Read the complete article here.

PaaS Partner Community

For regular information on business process management and integration become a member in the SOA & BPM Partner Community for registration please visit www.oracle.com/goto/emea/soa (OPN account required) If you need support with your account please contact the Oracle Partner Business Center.

Blog Twitter LinkedIn image[7][2][2][2] Facebook clip_image002[8][4][2][2][2] Wiki

Technorati Tags: SOA Community,Oracle SOA,Oracle BPM,OPN,Jürgen Kress



from Oracle Blogs | Oracle Marketing Cloud https://ift.tt/2IWcozu
via IFTTT

2 comments:

  1. Would you like to tell me that what type of scraping are you talking about? I read your post and unable to understand it. I have also wrote some article about is web scraping legal? If you need any help you can ask it from me.

    ReplyDelete
    Replies
    1. In web scraping, there are two main types: manual and automated. Manual scraping involves extracting data by hand, while automated scraping uses scripts or bots for efficiency. When discussing mri centers of texas houston, web scraping could be used to gather information about local healthcare providers offering these services, including pricing and customer reviews. Regarding legality, it depends on factors like website terms of service and data usage. If you have any questions or need help, feel free to ask!

      Delete