info@worthwebscraping.com
(+91) 79841 03276

Web Crawling using Python

This tutorial helps the users understand how to crawl multiple pages and grab all the internal and external links from those pages.

The previous sections highlighted the methods to scrape data from a single page. However, web scraping mostly involves navigating to multiple pages to grab data and sometimes follow the links on those pages to grab more data if required. Such navigation across webpages to find more links and follow them is termed as web crawling.

It may not be as simple as it sounds, because we need to spend a lot of time to understand the HTML structure of those pages and look for any kind of exception that may break our code. It is easy we are just following internal links since most of the websites will have the same HTML structure across all pages. But if we go to external links on a page, the HTML structure of these external links will be different and requires an in-depth analysis of the structure.

In this tutorial we will go to Wikipedia and search for Tom Cruise. Then we will grab all the internal links i.e. all the links to Wikipedia pages available on that page.

Below is the complete code, for detail explanation do watch the video:

Output: