extractDomainsFromNewsExport

Script to transfer the domains from the Nextcloud News Export to a file.

Usage

Usage: ./domains2ips.sh
-i, --input <file>  Read file with domains, separated by newline
-h, --help          Print usage

Input

This script expects an OPML file with URLs as value for tag xmlUrl.

<?xml version="1.0" encoding="UTF-8"?>
<opml version="2.0">
  <head>
    <title>Subscriptions</title>
  </head>
  <body>
    <outline title="Podcasts" text="Podcasts">
      <outline title="BSI RSS-Newsfeed Podcast &quot;Ins Internet - mit Sicherheit!&quot;" text="BSI RSS-Newsfeed Podcast &quot;Ins Internet - mit Sicherheit!&quot;" type="rss" xmlUrl="https://www.bsi.bund.de/SiteGlobals/Functions/RSSFeed/RSSNewsfeed/RSSNewsfeed_Podcast_Update_verfuegbar.xml" htmlUrl="https://www.bsi.bund.de"/>
      <outline title="Breach FM - der Infosec Podcast" text="Breach FM - der Infosec Podcast" type="rss" xmlUrl="https://feeds.transistor.fm/breach-fm-der-infosec-podcast" htmlUrl=""/>
    </outline>
    <outline title="Auslegungssache – der c't-Datenschutz-Podcast" text="Auslegungssache – der c't-Datenschutz-Podcast" type="rss" xmlUrl="https://ct-auslegungssache.podigee.io/feed/mp3" htmlUrl="https://heise.de/-4571821"/>
    <outline title="Blog on KittenLabs" text="Blog on KittenLabs" type="rss" xmlUrl="https://kittenlabs.de/blog/index.xml" htmlUrl="https://kittenlabs.de/blog/"/>
  </body>
</opml>

Output

This script generates two files:

extractDomainsFromNewsExport.out

  www.bsi.bund.de
  feeds.transistor.fm
  ct-auslegungssache.podigee.io
  kittenlabs.de

extractDomainsFromNewsExport.log

2025-05-03 13:33:42 [DEBUG] Add www.bsi.bund.de
2025-05-03 13:33:42 [DEBUG] Add feeds.transistor.fm
2025-05-03 13:33:42 [DEBUG] Add ct-auslegungssache.podigee.io
2025-05-03 13:33:42 [DEBUG] Add kittenlabs.de

Further Details

Extract only the domains from file by using a regular expression with POSIX syntax:

Extract tag-value of xmlUrl from file (1st command)
Remove tag (1st pipe)
Remove schema (2nd pipe)

grep -o 'xmlUrl="https\{0,1\}://[^/"]*' $IN_FILE | sed 's/xmlUrl="//' | sed 's|https\{0,1\}://||'

Output of 1st command

xmlUrl="https://www.bsi.bund.de
xmlUrl="https://feeds.transistor.fm
xmlUrl="https://ct-auslegungssache.podigee.io
xmlUrl="https://kittenlabs.de

Output with 2nd pipe

https://www.bsi.bund.de
https://feeds.transistor.fm
https://ct-auslegungssache.podigee.io
https://kittenlabs.de

Final output: see above

Nice to know - Regular expression styles POSIX vs. PCRE with `grep`

The following patterns result in the same output.

PCRE: grep -oP 'xmlUrl="https?:\/\/(?:www\.)?\K[^\/"]+' "$EXPORTFILE" POSIX: grep -o 'xmlUrl="https\{0,1\}://[^/"]*' export_test.opml | sed 's/xmlUrl="//' | sed 's|https\{0,1\}://||'

grep -P interpretes the pattern as PCRE (see pcre.org).

The PCRE syntax can only be used if GNU grep is installed (see BSD grep). The POSIX syntax is also supported by systems which lack support for PCRE.

To check your grep version, run grep -V.

user@server:~$ grep -V
grep (GNU grep) 3.11
Copyright © 2023 Free Software Foundation, Inc.
Lizenz GPLv3+: GNU GPL Version 3 oder neuer <https://gnu.org/licenses/gpl.html>.
Dies ist freie Software: Sie können sie ändern und weitergeben.
Es gibt keinerlei Garantien, soweit gesetzlich zulässig.

Geschrieben von Mike Haertel und anderen; siehe
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.

grep -P verwendet PCRE2 10.42 2022-12-11

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extractDomainsFromNewsExport.sh		extractDomainsFromNewsExport.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

extractDomainsFromNewsExport

Usage

Input

Output

Further Details

Nice to know - Regular expression styles POSIX vs. PCRE with `grep`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

extractDomainsFromNewsExport

Usage

Input

Output

Further Details

Nice to know - Regular expression styles POSIX vs. PCRE with grep

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Nice to know - Regular expression styles POSIX vs. PCRE with `grep`

Packages