How to Use Cookies and Session in Python Web Scraping
How to Use Cookies and Session in Python Web Scraping
Send download link to:
In this tutorial we will learn what are Cookies and Session, its importance in scraping and ways to use them with python request library.
Cookie
An HTTP cookie is a special type of request header that represents a small piece of data sent from a website and stored on the user’s computer. It is different from other headers, as we are not the ones to choose it – it is the website that tells us how to set this field. Then, the cookie can be sent along with subsequent client requests.
Cookies were designed to be a reliable mechanism for websites to remember stateful information, such as items added in the shopping cart in an online store, or to record the user’s browsing activity.
They can also be used to remember arbitrary pieces of information that the user previously entered into form fields, such as names, addresses, passwords, and credit-card numbers.
Each time the users’ web browser interacts with a web server it will pass the cookie information to the web server. Only the cookies stored by the browser that relate to the domain in the requested URL will be sent to the server. This means that cookies that relate to www.example.com will not be sent to www.exampledomain.com.
In essence, a cookie is a great way of linking one page to the next for a user’s interaction with a web site or web application.
While scraping cookies are required to avoid blocking. They are used to mimic a web browser so that a webpage does not consider our scraper as bot and block us.
Sessions
A session can be defined as a server-side storage of information that is desired to persist throughout the user’s interaction with the web site or web application.
Instead of storing large and constantly changing information via cookies in the user’s browser, only a unique identifier is stored on the client side (called a “session id”). This session id is passed to the web server every time the browser makes an HTTP request (ie a page link or AJAX request). The web application pairs this session id with it’s internal database and retrieves the stored variables for use by the requested page.
A session creates a file in a temporary directory on the server where registered session variables and their values are stored. This data will be available to all pages on the site during that visit.
A session ends when the user closes the browser or after leaving the site, the server will terminate the session after a predetermined period of time, commonly 30 minutes duration.
Sessions in scraping services are used mostly to send a put request that is to fill a form like Login form etc. They are also used to send multiple requests and scrape data in parallel.
Now let’s see how to use cookies and session with python requests library.
We can get the response cookies after our first request by using cookies method as below and later on can send these cookies with subsequent requests:
import requests
response = requests.get('http://www.dev2qa.com')
response.cookies
Output:
We can also get individual cookie by using a for loop as below:
for cookie in response.cookies:
print(‘cookie domain = ‘ + cookie.domain)
print(‘cookie name = ‘ + cookie.name)
print(‘cookie value = ‘ + cookie.value)
print(‘*************************************’)
Output:
We can define our custom cookies using dictionary or cookies jar object as below:
# Set url value.
url = 'https://www.dev2qa.com'
# Create a dictionary object.
cookies = dict(name='jerry', password='888')
# Use python requests module to get related url and send cookies to it with cookies parameter.
response = requests.get(url, cookies=cookies)
url = 'https://www.dev2qa.com'
# Create a RequestsCookieJar object.
cookies_jar = requests.cookies.RequestsCookieJar()
# Add first cookie, the parameters are cookie_key, cookie_value, cookie_domain, cookie_path.
cookies_jar.set('name', 'jerry', domain='dev2qa.com', path='/cookies')
Output:
Cookie(version=0, name=’name’, value=’jerry’, port=None, port_specified=False, domain=’dev2qa.com’, domain_specified=True, domain_initial_dot=False, path=’/cookies’, path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={‘HttpOnly’: None}, rfc2109=False)
# Add second cookie.
cookies_jar.set('password', 'jerry888', domain='dev2qa.com', path='/cookies')
Output:
Cookie(version=0, name=’password’, value=’jerry888′, port=None, port_specified=False, domain=’dev2qa.com’, domain_specified=True, domain_initial_dot=False, path=’/cookies’, path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={‘HttpOnly’: None}, rfc2109=False)
# Get url with cookie parameter.
response = requests.get(url, cookies=cookies_jar)
Python requests module’s Session() method will return a request.sessions.Session object, then later operates ( such as get related url page ) on this session object will use one same session.
import requests
# Call requests module's session() method to return a requests.sessions.Session object.
session = requests.Session()
The returned request.sessions. Session object provide various attributes and methods for you to access such as web page by url, headers, cookies value in the same session. You can use the session object like below.
# Show all headers and cookies in this session.
session.headers
Output:
# Use this session object to get a web page by url.
response = session.get('http://www.dev2qa.com')
# When above browse web page process complete, this session has cookies.
session.cookies
Output:
This is how we can use cookies and sessions with request library. Going forward we will use all this functionality extensively. Know more about How do I save a Python request session?