A Python library for extracting data from HTML and XML files

Beautiful Soup is a Python package that allows you to parse HTML and XML files. It builds a parse tree for interpreted pages, which can be used to extract data from HTML and is useful for web scraping.

A Python library for extracting data from HTML and XML files

Beautiful Soup:

Beautiful Soup is a Python library for parsing HTML and XML files and extracting data. It integrates with your preferred parser to provide idiomatic navigation, scan, and modification of the parse tree. It is popular for programmers to save hours or even days of work.

These instructions provide illustrations of all of the main features of Beautiful Soup 4. I demonstrate what the library is about, how it operates, how to use it, how to get it to do what you want, and what to do if it fails to meet your needs.

Beautiful Soup version 4.9.3 is covered in this article. In Python 2.7 and Python 3.8, the explanations in this documentation should behave the same way.

You may be looking for the Beautiful Soup 3 documentation. If that's the case, you should be aware that Beautiful Soup 3 is no longer being created, and funding will be discontinued on or after December 31, 2020. See Porting code to BS4 for more information on the variations between Beautiful Soup 3 and Beautiful Soup 4.

The bs4/doc/ directory includes complete Sphinx documents. To build HTML documentation, run "make html" in that directory.

Quickstart:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
>>> print(soup.prettify())
<html>
 <body>
  <p>
   Some
   <b>
    bad
    <i>
     HTML
    </i>
   </b>
  </p>
 </body>
</html>
>>> soup.find(text="bad")
'bad'
>>> soup.i
<i>HTML</i>
#
>>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml")
#
>>> print(soup.prettify())
<?xml version="1.0" encoding="utf-8"?>
<tag1>
 Some
 <tag2/>
 bad
 <tag3>
  XML
 </tag3>
</tag1>

Running the unit tests:

Beautiful Soup supports unit test discovery from the project root directory:

$ nosetests
$ python -m unittest discover -s bs4

If you tested out the source tree, you could see a script called test-all-versions in the home directory. The unit tests will be run in Python 2, then a temporary Python 3 translation of the source will be created and the unit tests will be run in Python 3.


Share Tweet Send
0 Comments
Loading...
You've successfully subscribed to Kxitiz
Great! Next, complete checkout for full access to Kxitiz
Welcome back! You've successfully signed in
Success! Your account is fully activated, you now have access to all content.