Skip to content

HTMLExtract

Francesco edited this page Mar 31, 2018 · 33 revisions

Introduced in t-ui beta 6.5.5.

This feature let's you extract text from HTML pages and display it inside t-ui.

XPath

XPath is a language used to find particular nodes and tags in HTML/XML documents. It's very easy to understand, and very powerful.

Format

Values:

  • %n -> newline

  • %t -> tag name

  • %t(attributeName) -> the value of the attribute attributeName of the matched node

  • %a(format)(separator) -> prints every attribute of the matched nodes

    • %an -> attribute name
    • %av -> attribute value
  • %v -> tag value

  • #[URL] -> link

  • #rrggbb[text] -> color the text

Example

Matched node:
<a href="https://github.com/Andre1299/TUI-ConsoleLauncher/subscription" class="myClass" role="button">This is a link</a>

Example 1

Format:
#[%t(href)]

Output:
https://github.com/Andre1299/TUI-ConsoleLauncher/subscription

Example 2

Format:
%t -> %v%n%a(%an = %av)(%n)

Output:

a -> This is a link
href = https://github.com/Andre1299/TUI-ConsoleLauncher/subscription
class = myClass
role = button

Steps

1. Find a webpage

2. Decide the node kind

You can select an infinite amount of nodes, but everyone will be of the same kind. Decide carefully what kind of nodes you need.

3. Create a new XPath expression

4. Test!

5. Add the XPath expression to t-ui

htmlextract -addxpath [ID] [expression]

For instance:
htmlextract -addxpath 1 //a[@class="myClass"]

6. Add a new format to t-ui

htmlextract -addformat [ID] [format]

For instance:
htmlextract -addformat 5 #[%t(href)]

7. Use it!

htmlextract -use [XPath ID] [optional: Format ID] [webpage] For instance:
htmlextract -use 1 5 https://github.com/Andre1299/TUI-ConsoleLauncher/wiki/HTMLExtract

Notice that [Format ID] is optional. This means that if you omit it, t-ui will use the value of htmlextract_default_format instead.

Clone this wiki locally