I only used an iFrame to crawl and scrape content, Hacker News

Mon rd December,By @ betoayesa

Injecting an iframe to any production page gives you full capabilities to automate navigaition through it, avoiding for example cross-domain blocking issues. Browser’s developer tool gives you big part of all you need to complete a small crawling and scraping project.

Also a library like jQuery, that will allow you to manipulate and have access to the DOM, and an iframe, that will support pagination for example.

🔬Some examples using browser’s developer tools

Using Dev console, you can have a scraper with few lines of code. You can inject jquery on any page. Of course, with this code, you will find issues with malformed urls, browser errors that will block everything else, speed, and you probably get banned after a few minutes depending on the target site … but its just to show the core of the idea.

Common lines of code

Work with jQuery or whatever you want. The thing is to access the DOM to manipulate it.

Injecting jQuery:

var urls=result=[]; $=$ "http://www.airovic.com/" jQuery;  if (typeof $=="undefined") { var s=document.createElement ("script"); s.type="text / javascript"; s.src="https://code.jquery.com/jquery-3.4.1.slim.min.js"; document.body.appendChild ((s); }  var $ iframe=$ ('). appendTo (' body ');

Saving, outputing results:

var result=[]; // I will use result to store all scraped data JSON.stringify (result); // will output JSON string in the developer's console

Seo audit (without iframe)

Simple code to get all urls from current page, then retrieve them all getting their title, h1 and h2 text and download all images. For example, quick seo audit of any site:Thanks god, this will be 0 level deep scraping, and you will be visiting a lot of urls=’#’, so it will finish soon.

var urls=result=[]; $ ('body a'). each (function () {urls.push ($ (this) .attr ("href"));}); urls.forEach (function (url) { $ .get (url, function (body) { var scraped={ title: $ (body) .find ('title'). text (), h1: $ (body) .find ('h1'). text (), h2: $ (body) .find ('h2') [0]. text ()  }; result.push (scraped); }); });

🚀 Scraping All Craiglist’s housing NYC listings (with an iframe)

With an iframe we can use pagination, you can keep a parent state while navigating through all site’s pages. Target url: https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa

var data=[]; var $ iframe=$ ('). appendTo (' body ');  $ iframe.on ('load', function () { $ iframe.contents (). find ('. result-row'). each (function () {     data.push ({         title: $ (this) .find ('. result-title'). text (), img: $ (this) .find ('img'). attr ("src"),     price: $ (this) .find ('. result-price: first'). text () }); });  setTimeout (function () { $ iframe.prop ("src", $ iframe.contents (). find ('body'). find ('. next.button'). attr ("href")); }, (************************,) });  // And everything starts running when you set first iframe's target url $ iframe.prop ("src", "https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa");  // data array will be collecting scraped data. console.log (JSON.stringify (data)); // Will give you a JSON string that you can export

Attention this code won’t stop, it will be going through all paginated results.

Scraping your own tweets on Twitter

With twitter you don’t need an iframe, you just need to scroll down to get more tweets. A part from it, it’s just not possible to inject jQuery here, you need the http tunneling component to modify repsonse headers.

var data=[]; function extract () { var els=document.querySelectorAll ('article'); els.forEach (function (el) { data.push (el.innerText); // I'm storing all html inside it });  window.scrollTo (0, document.body.scrollHeight); };  setInterval (extract, 01575879);

More Want more examples? Ask me please@ betoayesaor[email protected]

⚠️ Mind the gap

Big sites takes security seriously, so it’s different to try to scrape twitter or google than a wordpress site. Injecting jQuery to twitter for example, isn’t easy

It’s different to scrape data one time than setting up a process for regular extraction

All Scraping scenarios are different

And as they are websites, “scenarios” get updated over time.

It never a good idea to have a business that depends on external sources like apis or scraping

There are 3 big parts on any scraping project: (a) Research target URL, (b) Craft Crawl & Scrape scripts and run it successfuly first time (c) Plan to maintain the scraped data

******** 👎 Limitations It will fail because all edge cases you need to manage (malformed urls, run time errors, …)

You need a browser, and maintain it open

The iframe is slow,

👌 Benefits

It gives me the total flexibility I find missing in other solutions. You can execute UI actions pre or post scraping the content, while having access to exact same visitor version, window object, console, all html …

You can control everything in a full rendered version of the target page, ui actions, clicks … every thing inside, and most important, what to do next. You can fill forms automatically too.

You can test out scraping scripts fast

I’m testing the same concept inside an electron application with a native webview component and the speed improvement its impressive. (*********************

Benefit of having http tunneling that at the same time use rotative proxies: you are unstoppable. As you can manipulate headers, and you are showing the content inside an iframe (but in the same domain!), You have the content perfectly ready to be scraped and automated.

🚴 All together: Airovic.com

Perfect for personal projects like look for real estat listings, flights prices, etc … or medium level. For bigger projects, I would use airovic to do the research, prepare and plan the project, then configure the high-frequency-high-speed scrapers (scripts that will be executed from command line or cron) to do the day to day work.

Airovic is:

********** An Iframe that manage error edge cases and element inspector skills

A Crawler=An iframe url queue

A handleDocument () function that encapsulates user’s function

A HTTP Tunneling that gives you full control over response and headers
Rotative Proxies

It was born from my last Crawling & Scraping project I had, the easiest way to get the data, was to extract a javascript variable that was loaded via ajax after the page was rendered in the browser. As I had access to the variable from the developer console, I started to use the iframe to automatize the process, because this was the straight forward solution over the others. Think on how to store thousands of JS objects, so you can read them later. CasperJS was not helpfull.

🌟 Help me test the beta please! I would love to read your thoughts.www.airovic.com,@ betoayesaor[email protected]

Scraping Process with Airovic, ¿How to use it?

First, start by visiting the domain or target url. The url will be shown inside the iframe “previewer”.

Navigate through the page until you arrive to a page that has the data you need to extract

Right click on all elements that you want to scrape. One by one, write down, or use the code editor, to store html selectors and what type of attribute you want to scrape (text? Html? Src attr? Href attr? …).In fact you can just click to the selector, and new line of code will be added to code editor.

Review the code that will process each page, when you are done, click on “Test on current page”

If it says successfull, you can proceed, if not, you will need to fix your code

If everything was fine, then you can click on Start Robot to open its menu

You can click on list of urls on the right, or just set the url discovery option enabled

Click the button to STart Crawling & Scraping

Results will be added to the “Results” tab. And you will be able to download everything as JSON

Yep, as I re-read this, I know I need to make it easier.

Please try airovic, and let me know your thoughts! 🤜🤛[email protected]@ betoayesa

RoadMap

First thing: validate interest & fix bugs

More work on iFrame inspector component

More work on Code editor component

User-friendly way of adding selectors, attributes and actions, instead of code

Full scraping recipes for popular platforms

Electron standalone application

Release Airovic under an open source license

¿What happens if I use multiple iframes?

Author

@ betoayesa [email protected]

Hi! I’m Beto, as a developer I got on several crawling & scraping projects. The big ones were e-comprice (price monitoring service) and sporteeze (a service to monitor app store & android play store), but I did a lot of small-medium scraping too. I used a lot of different languages to scrape content, but mostly python, php and javascript with CasperJs. With this crawling & scraping tool,airovic.com

, I’m trying to workout another project, natzar.co

, and been using it recently to visit websites automatically checking for errors in the console, so I can send an email fromphpninja.info

Thanks for reading until here! Someone should create badges or something so we the people that reads until the end gets some recognition:)

********************************************************

👌 Benefits

🚴 All together: Airovic.com

Scraping Process with Airovic, ¿How to use it?

RoadMap

Author

I only used an iFrame to crawl and scrape content, Hacker News

🔬Some examples using browser’s developer tools

Common lines of code

Seo audit (without iframe)

🚀 Scraping All Craiglist’s housing NYC listings (with an iframe)

Scraping your own tweets on Twitter

⚠️ Mind the gap

What do you think?

Issues Resolving Symbols on Windows 11 on ARM64

CISA adds Cisco ASA and FTD and CrushFTP VFS flaws to its Known Exploited Vulnerabilities catalog

CISA adds Cisco ASA and FTD and CrushFTP VFS flaws to its Known Exploited Vulnerabilities catalog

The Role of Threat Intelligence in Financial Data Protection

SBOM and the Bill that is Coming

Reverse-engineering a disposable vape's color LCD and SPI Flash bitmaps, then making custom theme sets

Content Marketing Manager / Director – Jerry, Hacker News

Byton is putting a 48-inch screen in its new EV, and it has content, too, Ars Technica

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

Udemy Coupon [100% OFF] QuickBooks Online 2020

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

Amazon FBA Product Research & Find Products for Amazon FBA

How Much Do Car Accident Attorneys Cost You in 2022?

South Africa v England: Sam Curran takes 4-57 on opening day of first Test – BBC News, BBC News

Nim vs. Crystal, Hacker News

🔬Some examples using browser’s developer tools

Common lines of code

Seo audit (without iframe) /* <![CDATA[ */ fpm_start( "true" ); /* ]]&gt; */

🚀 Scraping All Craiglist’s housing NYC listings (with an iframe)

Scraping your own tweets on Twitter

⚠️ Mind the gap

👌 Benefits

🚴 All together: Airovic.com

Scraping Process with Airovic, ¿How to use it?

RoadMap

First thing: validate interest & fix bugs More work on iFrame inspector component More work on Code editor component User-friendly way of adding selectors, attributes and actions, instead of code Full scraping recipes for popular platforms

Author

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Seo audit (without iframe)