HTML Navigation in Python – Aspose.HTML for Python via .NET (2024)

  • Aspose.HTML
  • Python via .NET
  • Data Extraction
  • HTML Navigation

Helpful resources ▼

Contents

[HideShow]

HTML Navigation

TheAspose.Html.Dom namespace provides API that represents and interacts with any HTML, XML or SVG documents and is entirely based on theWHATWG DOM specification supported in many modern browsers.

This article provides information on how to programmatically extract data from HTML documents with the Aspose.HTML for Python via .NET. You find out:

  • how to navigate through an HTML document and perform a detailed inspection of its elements using the Python API;
  • how to navigate over the document by using CSS Selector and XPath Query.

Navigating HTML involves accessing and manipulating elements and their relationships within a document.Aspose.HTML for Python via .NET allows you to navigate and inspect HTML, which involves working with the Document Object Model (DOM) provided by the library. The following shortlist shows the simplest way to access all DOM elements:

  1. Document Object Model (DOM) . DOM Structure represents the HTML document as a tree of nodes. Each node represents a part of the document, such as elements, text, or comments.
  • TheDocument class represents the entire HTML, XML, or SVG document and serves as the root of the document tree.
  • TheElement class represents an element in an HTML or XML document.
  • TheNode class represents a single node in the document tree.
  1. Accessing Elements
  • Use methods likeget_elements_by_tag_name(tagname) to retrieve elements by their tag name.
  • Use theget_element_by_id() method to access a specific element with a unique ID.
  • Useget_elements_by_class_name(class_names) to retrieve elements by their class names.
  • Use thequery_selector(selector) method for a single element orquery_selector_all(selector) for a list of elements that match a CSS selector.
  1. Navigating the DOM Tree
  • Access children of an element usingchild_nodes orchildren properties.
  • Use thefirst_child orlast_child property to return the first or last child node of the current node, which could be any type of node, such as an element, text, or comment.
  • Use theparent_node property to access the parent of a given element.
  • Access siblings using properties likenext_sibling ornext_sibling.
  1. Manipulating Elements
  • Use properties of the Element class likeinner_html andtext_content to modify element content.
  • Get or set attributes using methods likeget_attribute(qualified_name) andset_attribute(qualified_name, value).

TheAPI Reference Source provides a comprehensive list of classes and methods in the aspose.html.dom namespace.

Navigating the DOM Tree

We consider how the DOM represents an HTML document in memory and how to use API for navigation through HTML files. Four of theNode class properties – first_child, last_child, next_sibling, and next_sibling, each provides a live reference to another element with the defined relationship to the current element if the related element exists.

Using the mentioned properties, you can navigate through an HTML document as it follows:

 1from aspose.html import * 2 3# Prepare HTML code 4html_code = "<span>Hello</span> <span>World!</span>" 5 6# Initialize a document from the prepared code 7with HTMLDocument(html_code, ".") as document: 8 # Get the reference to the first child (first SPAN) of the BODY 9 element = document.body.first_child10 print(element.text_content) # output: Hello1112 # Get the reference to the whitespace between html elements13 element = element.next_sibling14 print(element.text_content) # output: " "1516 # Get the reference to the second SPAN element17 element = element.next_sibling18 print(element.text_content) # output: World!

Inspecting HTML

Aspose.HTML contains a list of methods that are based on theElement Traversal Specifications. You can perform a detailed inspection of the document and its elements using the API. The following Python code demonstrates how to navigate and extract specific elements and their properties from an HTML document using Aspose.HTML for Python via .NET.

 1import os 2from aspose.html import * 3 4# Load a document from a file 5data_dir = "data" 6document_path = os.path.join(data_dir, "html_file.html") 7with HTMLDocument(document_path) as document: 8 # Get the <html> element of the document 9 element = document.document_element10 print(element.tag_name) # HTML1112 # Get the last element of the <html> element13 element = element.last_element_child14 print(element.tag_name) # BODY1516 # Get the first element of the <body> element17 element = element.first_element_child18 print(element.tag_name) # H119 print(element.text_content) # Header 1

The provided Python code begins by defining the path to the HTML file located in the “data” directory.

  • Use HTMLDocument to load a document, and thedocument.document_element property to accesses the root HTML element. Print the tag name of this element, which is “HTML”.
  • Next, retrieve the last child of the HTML element using thelast_element_child, which is the “BODY” element, and print its tag name.
  • Subsequently, use thefirst_element_child property to accesse the first child of the BODY element, which is an “H1” element, printing both its tag name and its text content, which is “Header 1”.

XPath Query

The alternative to the HTML Navigation is XPath Query (XML Path Language) that often referred to simply as an XPath. It is a query language that can be used to query data from HTML documents. It is based on a DOM representation of the HTML document, and selects nodes by various criteria. The syntax of the XPath expressions is quite simple, and what is more important, it is easy to read and support.

The following example shows how to use XPath queries within Aspose.HTML Python API:

 1from aspose.html import * 2from aspose.html.dom.xpath import * 3 4# Prepare HTML code 5code = """ 6 <div class='happy'> 7 <div> 8 <span>Hello,</span> 9 </div>10 </div>11 <p class='happy'>12 <span>World!</span>13 </p>14"""1516# Initialize a document based on the prepared code17with HTMLDocument(code, ".") as document:18 # Here we evaluate the XPath expression where we select all child SPAN elements from elements whose 'class' attribute equals to 'happy'19 result = document.evaluate("//*[@class='happy']//span",20 document,21 None,22 XPathResultType.ANY,23 None)2425 # Iterate over the resulted nodes26 node = result.iterate_next()27 while node is not None:28 print(node.text_content)29 node = result.iterate_next()30 # output: Hello,31 # output: World!

Theevaluate() method in the Aspose.HTML Python library allows you to execute XPath queries against HTML or XML documents, enabling detailed data extraction and navigation. It takes an XPath expression as its primary parameter, specifying the query to be executed, and returns an XPathResult object based on the defined result type.

CSS Selector

In addition to HTML navigation and XPath, the Aspose.HTML Python API supports theCSS Selector API. This API allows you to formulate search patterns usingCSS Selectors syntax to identify and select elements within an HTML document. For instance, thequery_selector_all(selector) method can be used to traverse an HTML document and retrieve elements that match a specified CSS selector. This method accepts a CSS selector string as its argument and returns a NodeList containing all elements that conform to the selector criteria. Using CSS selectors, you can efficiently find and manipulate elements based on their attributes, classes, IDs, and other criteria, making it a versatile tool for both simple and complex document parsing tasks. This functionality is particularly useful for tasks such as styling, data extraction, and content manipulation within an HTML document.

 1from aspose.html import * 2 3# Prepare HTML code 4code = """ 5 <div class='happy'> 6 <div> 7 <span>Hello,</span> 8 </div> 9 </div>10 <p class='happy'>11 <span>World!</span>12 <p>I use CSS Selector.</p>13 </p>14"""1516# Initialize a document based on the prepared code17with HTMLDocument(code, ".") as document:18 # Create a CSS Selector that extracts all elements whose "class" attribute equals "happy" and their child <span> elements19 elements = document.query_selector_all(".happy span")2021# Iterate over the resulted list of elements22 for element in elements:23 print(element.text_content)24 # output: Hello,25 # output: World!

Conclusion

The Aspose.HTML for Python via .NET library offers a robust set of tools for working with HTML, XML, and SVG documents, adhering to modern browsers’ widely supported WHATWG DOM specification. Using the HTMLDocument class and its various navigation properties and methods, you can effectively interact with and manipulate HTML content, avoiding the complexities of manual data extraction and focusing on more strategic aspects of your projects.

Aspose.HTML offers free onlineHTML Web Applications that are an online collection of converters, mergers, SEO tools, HTML code generators, URL tools, web accessibility checks, and more. The applications work on any operating system with a web browser and do not require any additional software installation. Use our collection of HTML Web Applications to perform your daily matters and make your workflow seamless!

HTML Navigation in Python – Aspose.HTML for Python via .NET (1)

Data ExtractionSave File from URL

HTML Navigation in Python – Aspose.HTML for Python via .NET (2024)

FAQs

How do I link a Python code to an HTML page? ›

Py-env: It defines the python packages list which needs to run your code. Py-script: In this tag, the user will write their python code. Py-repl: It will Create a REPL component. The REPL component executes the code user enters and displays the result of the code in the browser.

What are aspose words in Python? ›

Aspose. Words for Python via . NET is a class library that enables your applications to perform a great range of document processing tasks.

How to combine HTML and Python code? ›

You can start by using Flask framework. you can render or write HTML as a response of a simple GET route endpoint. You could also use the Django framework of python which is used for web development, with which you could use the html codes.

How do I get the HTML code of a website in Python? ›

get(url) to send an HTTP GET request to the specified URL and store the response in the response variable. We check if the request was successful by examining the HTTP status code. A status code of 200 indicates success. If the request was successful, we print the HTML source code of the website using response.

Is Aspose free in Python? ›

You can download Aspose. Email for Python via . NET free of cost for evaluation. The evaluation version provides almost all functionality of the product with certain limitations.

How to install Aspose in Python? ›

Installing Aspose.Email for Python via .NET from Downloads section
  1. Download the installable .whl file from downloads section of the API.
  2. From Command line, use command: pip install Aspose.Email_for_Python_via_NET-18.7-py3-none-win_amd64.whl to install the API.

Who uses Aspose? ›

From software and consulting companies to banks, government organizations and educational institutions, Aspose products finds users wherever there is a need for file format expertise.

How to connect Python with webpage? ›

The steps are highlighted below:
  1. Import the urllib library.
  2. Define the primary goal.
  3. Declare the variable webUrl, then use the URL lib library's urlopen function.
  4. The URL we're going to is www.python.org.
  5. After that, we are going to print the result code.
Jul 10, 2024

How do I connect my Python code into an HTML button? ›

How to Run a Python Script from an HTML Button
  1. Understanding the Basics: HTML and Python. ...
  2. Setting Up Your Environment. ...
  3. Creating the HTML Button. ...
  4. Executing the Python Script. ...
  5. Creating the Python Script. ...
  6. Testing Your Setup. ...
  7. Handling Data Exchange. ...
  8. Using Flask for Server-Side Processing.
May 6, 2024

How to get URL from HTML Python? ›

Fetching URLs
  1. import urllib.request with urllib. request. urlopen('http://python.org/') as response: html = response. ...
  2. import shutil import tempfile import urllib.request with urllib. request. urlopen('http://python.org/') as response: with tempfile. ...
  3. import urllib.request req = urllib. request.

Top Articles
Latest Posts
Article information

Author: Ouida Strosin DO

Last Updated:

Views: 6271

Rating: 4.6 / 5 (76 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Ouida Strosin DO

Birthday: 1995-04-27

Address: Suite 927 930 Kilback Radial, Candidaville, TN 87795

Phone: +8561498978366

Job: Legacy Manufacturing Specialist

Hobby: Singing, Mountain biking, Water sports, Water sports, Taxidermy, Polo, Pet

Introduction: My name is Ouida Strosin DO, I am a precious, combative, spotless, modern, spotless, beautiful, precious person who loves writing and wants to share my knowledge and understanding with you.