in ,

Ask HN: Best practices for ethical web scraping ?, Hacker News

Ask HN: Best practices for ethical web scraping?

         points by (aspyct 2 hours ago hide | past | web | favorite | comments Hello HN! As part of my learning in data science, I need / want to gather data. One relatively easy way to do that is web scraping. However I’d like to do that in a respectful way. Here are three things I can think of: 1. Identify my bot with a user agent / info URL, and provide a way to contact me 2. Don’t DoS websites with tons of request. 3. Respect the robots.txt What else would be considered good practice when it comes to web scraping?

                  

Indirectly related, if you have some time to spare follow Harvard’s course in ethics! [1] Here is why: while it didn’t teach me anything new (in a sense), it did give me a vocabulary to better articulate myself. Having new words to describe certain ideas means you have more analytical tools at your disposal. So you’ll be able to examine your own ethical stance better.

It takes some time, but instead of watching Netflix (if that’s a thing you do), watch this instead! Although, The Good Place is a pretty good Netflix show sprinkling some basic ethics in there.

[1]

https://www.youtube.com/watch?v=kBdfcR-8hEY

()             

                  

The rules you named are some I personally followed. One other extremely important thing is privacy when you want to crawl personal data like social networks. I personally avoid crawling data that inexperienced users might accidentally expose, like email adresses, phone numbers or their friends list. A good rule of thumb for social networks for me always was, that I only scrape the data that is visible when my bot is not logged in (also helps to not break the providers ToS). The most elegant way would be to ask the site provider if they allow scraping their website and which rules you should obey. I was surprised how open some providers were, but some don’t even bother replying. If they don’t reply, apply the rules you set and follow the obvious ones like not overloading their service etc.

                  

Nice you to ask this question and think about how to be as considerate as you can.

Some other thoughts:

– Find the most minimal, least expensive (for you and them both) way to get the data you’re looking for. Sometimes you can iterate through search results pages and get all you need from there in bulk, rather than iterating through detail pages one at at a time.

– Even if they don’t have an official / documented API, they may very likely have internal JSON routes, or RSS feeds that you can consume directly, which may be easier for them to accommodate.

– Pay attention to response times. If you get your results back in 749 ms, it probably was trivially easy for them and you can request a bunch without troubling them too much. On the other hand, if responses are taking 5s to come back, then be gentle. If you are using internal undocumented APIs you may find that you get faster / cheaper cached results if you stick to the same sets of parameters as the site is using on its own (eg, when the site’s front end makes AJAX calls)

          

      
                                                   
            

            
            

            

            

                  

Be careful about making the data you’ve scraped visible to Google’s search engine scrapers. .

That’s often how site owners get riled up. They search for some unique phrase on Google, and your site shows up in the search results.

            
            

                  

This isn’t really an “ethical” practice, more like how to hide that you are scraping data practice. If you have to hide the fact that you are scraping their data, maybe you shouldn’t be doing it in the first place.

            

            

            
            

,

                  

In some cases, especially during development, local caching of responses can help reduce load . You can write a little wrapper that tries to return url contents from a local cache and then falls back to a live request.

            

            

(Read More)