Skip to content

HTMLExtract

Francesco edited this page Apr 18, 2018 · 33 revisions

Introduced in t-ui beta 6.6

This feature let's you extract text from HTML pages and display it inside t-ui.

XPath

XPath is a language used to find particular nodes and tags in HTML/XML documents. It's very easy to understand, and very powerful.

JSONPath

JSONPath has the same features of the language described above, but it works with Json.

Format

Values:

  • %n -> newline

  • %t -> tag name

  • %t(attributeName) -> the value of the attribute attributeName of the matched node

  • %a(format)(separator) -> prints every attribute of the matched nodes

    • %an -> attribute name
    • %av -> attribute value
  • %v -> tag value

  • #[URL] -> link

  • #rrggbb[text] -> color the text

Example

Matched node:
<a href="https://github.com/Andre1299/TUI-ConsoleLauncher/subscription" class="myClass" role="button">This is a link</a>

Example 1

Format:
#[%t(href)]

Output:
https://github.com/Andre1299/TUI-ConsoleLauncher/subscription

Example 2

Format:
%t -> %v%n%a(%an = %av)(%n)

Output:

a -> This is a link
href = https://github.com/Andre1299/TUI-ConsoleLauncher/subscription
class = myClass
role = button

Steps

1. Find a webpage

2. Decide the node kind

You can select an infinite amount of nodes, but everyone will be of the same kind. Decide carefully what kind of nodes you need.

3. Create a new XPath/JsonPath expression

4. Test!

5. Add the expression to t-ui

htmlextract -add [json OR xpath] [ID] [expression]

For instance:
htmlextract -add xpath 1 //a[@class="foo"]

6. Add a new format to t-ui (you can also use the default one)

htmlextract -add format [ID] [expression]

For instance:
htmlextract -add format 5 #[%t(href)]

7. Use it!

htmlextract -query [ID] [optional: Format ID] [webpage]
For instance:
htmlextract -query 1 5 https://website.com/page.html

Notice that [Format ID] is optional. This means that if you omit it, t-ui will use the value of htmlextract_default_format instead.

Clone this wiki locally