in ,

I only used an iFrame to crawl and scrape content, Hacker News

Mon rd December,By@ betoayesa

Injecting an iframe to any production page gives you full capabilities to automate navigaition through it, avoiding for example cross-domain blocking issues. Browser’s developer tool gives you big part of all you need to complete a small crawling and scraping project.

Also a library like jQuery, that will allow you to manipulate and have access to the DOM, and an iframe, that will support pagination for example.

🔬Some examples using browser’s developer tools

Using Dev console, you can have a scraper with few lines of code. You can inject jquery on any page. Of course, with this code, you will find issues with malformed urls, browser errors that will block everything else, speed, and you probably get banned after a few minutes depending on the target site … but its just to show the core of the idea.

Common lines of code

Work with jQuery or whatever you want. The thing is to access the DOM to manipulate it.

Injecting jQuery:

var urls=result=[]; $=$ "" jQuery;  if (typeof $=="undefined") { var s=document.createElement ("script"); s.type="text / javascript"; s.src=""; document.body.appendChild ((s); }  var $ iframe=$ ('). appendTo (' body ');

Saving, outputing results:

var result=[]; // I will use result to store all scraped data JSON.stringify (result); // will output JSON string in the developer's console

Seo audit (without iframe)

Simple code to get all urls from current page, then retrieve them all getting their title, h1 and h2 text and download all images. For example, quick seo audit of any site:

var urls=result=[]; $ ('body a'). each (function () {urls.push ($ (this) .attr ("href"));}); urls.forEach (function (url) { $ .get (url, function (body) { var scraped={ title: $ (body) .find ('title'). text (), h1: $ (body) .find ('h1'). text (), h2: $ (body) .find ('h2') [0]. text ()  }; result.push (scraped); }); });

🚀 Scraping All Craiglist’s housing NYC listings (with an iframe)

With an iframe we can use pagination, you can keep a parent state while navigating through all site’s pages. Target url:

var data=[]; var $ iframe=$ ('). appendTo (' body ');  $ iframe.on ('load', function () { $ iframe.contents (). find ('. result-row'). each (function () {     data.push ({         title: $ (this) .find ('. result-title'). text (), img: $ (this) .find ('img'). attr ("src"),     price: $ (this) .find ('. result-price: first'). text () }); });  setTimeout (function () { $ iframe.prop ("src", $ iframe.contents (). find ('body'). find ('. next.button'). attr ("href")); }, (************************,) });  // And everything starts running when you set first iframe's target url $ iframe.prop ("src", "");  // data array will be collecting scraped data. console.log (JSON.stringify (data)); // Will give you a JSON string that you can export

Attention this code won’t stop, it will be going through all paginated results.

Scraping your own tweets on Twitter

With twitter you don’t need an iframe, you just need to scroll down to get more tweets. A part from it, it’s just not possible to inject jQuery here, you need the http tunneling component to modify repsonse headers.

var data=[]; function extract () { var els=document.querySelectorAll ('article'); els.forEach (function (el) { data.push (el.innerText); // I'm storing all html inside it });  window.scrollTo (0, document.body.scrollHeight); };  setInterval (extract, 01575879);

More Want more examples? Ask me please@ betoayesaor[email protected]

⚠️ Mind the gap

Big sites takes security seriously, so it’s different to try to scrape twitter or google than a wordpress site. Injecting jQuery to twitter for example, isn’t easy

What do you think?

Leave a Reply

Your email address will not be published.

GIPHY App Key not set. Please check settings

South Africa v England: Sam Curran takes 4-57 on opening day of first Test – BBC News, BBC News

South Africa v England: Sam Curran takes 4-57 on opening day of first Test – BBC News, BBC News

Nim vs. Crystal, Hacker News