The Internet is full of valuable data for companies in all sectors. So to make the most of all this available information, it's possible to extract data from one or more web pages. This is made possible by web scraping with Power Query. DataScientest explains how.
What is Web Scraping?
Web scraping is simply the ability to retrieve data from a web page. This means you can store all the useful information on a website in your own database.
This approach is often used by companies for competitive intelligence. The idea is to compare the data of competing companies in order to carry out a complete market study. But it’s also possible to web scrape from a government site to store official information, to retrieve statistics, from a site specialising in a complementary sector of activity, etc.
Whatever the objective, data can be retrieved from any website. And above all, web scraping is made easy with Power Query, the query editor integrated into Power BI.
How do I use Power Query for Web Scraping?
To carry out web scraping using Power Query and Power BI, follow these steps.
Extracting data
The data is extracted from the Power BI tool. From there, follow the steps below:
- In the Home tab, click on “Get data”.
- In the search bar, type “web” to select the web connector. This is what connects to the website and retrieves its data.
- Enter the URL from which you wish to extract the data and click “OK”. Most of the time, no authentication will be required to access the site.
- In some cases, however, you will need to authenticate yourself using Windows connection methods, web APIs or professional accounts.
- Click on “Connect”.
- A navigation page will open, showing you several suggested tables containing the data from the URL.
- Simply select the table from which you wish to extract the data.
- Click on “Transform data”.
Web scraping makes it easy to retrieve data from a web page, but the data is formatted according to the requirements of the website.
However, most of the time, these formats are not adapted to your database. You will therefore need to transform the table and use web scraping with Power Query.
Transforming data
Once you have clicked on “Transform data”, the Power Query editor appears to complete the web scraping. Here you will find the selected table with all its data.
To ensure that it meets your analysis and reporting needs, you can perform several modification actions with Power Query. For example, you can
- Rename the table
- Delete columns
- Rename columns
- Add columns
Once your table is ready, click on “Close and apply” to view them in Power BI.
Add all the tables on the site
Another advantage of web scraping with Power Query is the ability to retrieve data from all the pages on a website.
To do this, you need to modify the table code. Here’s how to do it:
- Use the advanced editor to display the lines of code for the table;
- You will see a line :
- Source=Web.Browser.Content(“https:www.votre-url.com/blog/page/2”). In this case, only the tables on the page are integrated. If you want all the pages, you can modify the code as follows:
- Source=Web.Browser.Content(“https:www.votre-url.com/blog/page/”&Page&”)
- The data table then becomes a function. You can rename it to FxPages before creating a list containing all the page numbers.
- Click on “Call up a custom function” to call up the FxPages function.
- A new column is added. Expand it to create a new table containing all the site data.
- Transform the data as described above.
- Click “Close and apply”.
Mastering Power Query with DataScientest
In addition to web scraping, Power Query offers you a host of features for managing your databases. But the query editor is not so easy to master.
To use all these services with ease, training is essential. That’s precisely why DataScientest offers you its data training courses. Find out more!