Ask HN: Best practices for ethical web scraping ?, Hacker News

Ask HN: Best practices for ethical web scraping?

points by (aspyct 2 hours ago hide | past | web | favorite | comments Hello HN! As part of my learning in data science, I need / want to gather data. One relatively easy way to do that is web scraping. However I’d like to do that in a respectful way. Here are three things I can think of: 1. Identify my bot with a user agent / info URL, and provide a way to contact me 2. Don’t DoS websites with tons of request. 3. Respect the robots.txt What else would be considered good practice when it comes to web scraping?

It takes some time, but instead of watching Netflix (if that’s a thing you do), watch this instead! Although, The Good Place is a pretty good Netflix show sprinkling some basic ethics in there.

[1]

https://www.youtube.com/watch?v=kBdfcR-8hEY

()

The rules you named are some I personally followed. One other extremely important thing is privacy when you want to crawl personal data like social networks. I personally avoid crawling data that inexperienced users might accidentally expose, like email adresses, phone numbers or their friends list. A good rule of thumb for social networks for me always was, that I only scrape the data that is visible when my bot is not logged in (also helps to not break the providers ToS). The most elegant way would be to ask the site provider if they allow scraping their website and which rules you should obey. I was surprised how open some providers were, but some don’t even bother replying. If they don’t reply, apply the rules you set and follow the obvious ones like not overloading their service etc.

Nice you to ask this question and think about how to be as considerate as you can.

Some other thoughts:

– Find the most minimal, least expensive (for you and them both) way to get the data you’re looking for. Sometimes you can iterate through search results pages and get all you need from there in bulk, rather than iterating through detail pages one at at a time.

– Even if they don’t have an official / documented API, they may very likely have internal JSON routes, or RSS feeds that you can consume directly, which may be easier for them to accommodate.

– Pay attention to response times. If you get your results back in 749 ms, it probably was trivially easy for them and you can request a bunch without troubling them too much. On the other hand, if responses are taking 5s to come back, then be gentle. If you are using internal undocumented APIs you may find that you get faster / cheaper cached results if you stick to the same sets of parameters as the site is using on its own (eg, when the site’s front end makes AJAX calls)

IMO, the best practice is “don’t”. If you think the data you’re trying to scrape is freely available, contact the site owner, and ask them whether dumps are available.

Certainly, if your goal is “learning in data science”, and thus not tied to a specific subject, there are enough open datasets to work with, for example from https://data.europa.eu/euodp/en/ home or (https://www.data.gov/

I think your main obligation is not to the entity from which you’re scraping the data, but the people whom the data is about. For example, the recent case between LinkedIn and hiQ centered on the latter not respecting the former’s terms of service. But even if they had followed that to the T, what hiQ is doing – scraping people profiles and snitching to their employer when it looked like they were job hunting – is incredibly unethical. Invert power structures. Think about how the information you scrape could be misused. Allow people to opt out.


	I tried to find a source to back up what you’re saying about hiQ “Snitching” to employers about employees searching for jobs, but all I can find is vague documentation about the legal suit hiQ v. LinkedIn. Do you have a link to an article or something?

Be careful about making the data you’ve scraped visible to Google’s search engine scrapers. .

That’s often how site owners get riled up. They search for some unique phrase on Google, and your site shows up in the search results.

This isn’t really an “ethical” practice, more like how to hide that you are scraping data practice. If you have to hide the fact that you are scraping their data, maybe you shouldn’t be doing it in the first place.

Depends. Maybe, for example, you’re doing some competitive price analysis and never plan on exposing scraped things like product descriptions … you only plan to use those internally to confirm you’re comparing like products. But you expose it accidentally. Avoid that.

If all scrapers did what you did, I’d curse a lot less at $ work. Kudos for that.

Re 2 and 3: do you parse / respect the “Crawl-delay” robots.txt directive , and do you ensure that works properly across your fleet of crawlers?

In some cases, especially during development, local caching of responses can help reduce load . You can write a little wrapper that tries to return url contents from a local cache and then falls back to a live request.