Web scraping provides the solution to downloading any data you need from a targeted website, regardless of what information you require. This is an excellent use case for cURL. But how to use cURL for web scraping?
Downloading a file from the command line is one of the most intuitive uses of cURL. This, however, is misleading because cURL is actually a powerful tool. Even if you are familiar with the command line, you are probably not using it to its full potential.
Table of Content
- What is cURL?
- cURL vs file_get_contents
- Web Scraping using cURL
- Data Collection Automation
What is cURL?
The name is the short version for the Client URL. It is a part of libcurl free transfer library that can be used in compiled software. The libcurl library gives us the ability to transfer data to the server and receive responses from it. It gives us the ability to mimic real user behavior!
CURL is predominantly used in PHP in Guzzle libraries to make any kind of HTTP requests, from APIs to webhooks and even download websites. And binary is predominantly used by developers when they need to test anything on HTTP.
You can get page content for later analysis, get service response headers, and programmatically authorize websites, create scripts for posting (on social media or forums), or information. Everything is limited only by your imagination!
cURL vs file_get_contents
The cURL module for PHP makes it possible for PHP programs to use libcurl. The file_get_contents is a method to read the contents of a file into a string. But it has some limits when it comes to downloading content from other web pages.
CURL differs from file_get_contents because it is much more advanced. With it, you can make requests through the POST method, visit pages protected with SSL, or using cookies, and much more. And yes, with CURL you can make those bots that mimic a user in any action you want. Therefore, PHP remains a powerful language, in which you can do anything you could do in another language.
One of the most fundamental things you can do with cURL is downloading a webpage or file. To do this, you just need to use the command curl and the URL. Here is an example:
$ curl https://domain.extension
Most of the time, working with cURL this way, you will get a screen full of raw HTML or a series of undefined characters. If you prefer to save it to a file, you can use regular Unix redirects to do so.
$ curl https://domain.extension > domain.html
The option -o offers the possibility to decide the name of the destination file.
$ curl -o filename.html https://domain.extension/url
If you need to download several files at the same time, cURL makes this easy to do. You will usually want to use this with -O option.
$ curl -O https://domain.extension/file1.html -O https://domain.extension/file2.html
When you download in this way, cURL will try to reuse the same connection instead of forming new connections each time. This is particularly important if you do web scraping using proxies that give a new IP for each connection.
Web Scraping using cURL
You can use curl to automate the repeating processes when doing web scraping. It can help you avoid dull assignments. For that, you will need to work with PHP. Here is a simple web scraper example I have discovered on GitHub.
The following code returns the cURL output as a string.
<?php // Make the initialization with curl_init. $ch = curl_init(); // Set the options we need with curl_setopt. curl_setopt($ch, CURLOPT_URL, "example.com"); //Save the scraped web page as a string. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Execute request with curl_exec. $output = curl_exec($ch); // Free system resources using curl_close. curl_close($ch); ?>
curl_close does not log out. curl_close releases the allocated system resources and closes the connections that remained open. If you want to be able to log out, you must first activate cookies. Then, after activating cookies, you must log out like any other user, going to the logout page.
Data collection automation
If you are tired of complex web scraping techniques and tools, you can always switch your method of obtaining data. You can put your data collection project on an automated pilot using the Data Collector offered by Bright Data.
You can request the data in 3 ways:
- Existing templates. This is the easiest option. You just select a template for your target website or have Bright Data’s specialists make a custom template for your business’ needs.
- Code editor. You write a simple code for crawling and data collection. Simply specify the targeted website, what to do on that website and what information you need. The rest will be done by the Data Collector.
- Browser extension. With the browser extension, you can pinpoint the targeted information directly from your browser. You surf the web using this extension and index elements. Afterward, the Machine Learning technology and the Data Collector’s AI will extract your desired data sets.
If the website has a pretty easy HTML, you can simply use cURL to fulfill the call and then obtain the required elements using bash commands like grep, cut, sed, and so on.
$ curl https://ping.eu | grep "Your IP"
If you need to obtain data from different web pages with the same format, you can combine the cURL with a while or for. This bash script will get a list of queries from a .txt file, add the central URL variable, then scrape the content and output it to a text file.
url="domain.extension/?q=" for i in $(cat query.txt); do content="$(curl -s "$url/$i")" echo "$content" >> output.txt done
Now you know how to use cURL for web scraping. You can find many pieces of code or simple web scrapers on GitHub and then modify them according to your project’s requirements. You also have Bright Data’s Data Collector. The possibilities are near endless and so is the data waiting for you on the Internet. Have fun!