Preprocessing notebook

kuefmz · kuefmz · commit 0dee3a046e59 · 2022-05-12T04:28:14.000+02:00
diff --git a/src/util/example_readme.md b/src/util/example_readme.md
@@ -0,0 +1,21 @@
+# Software Metadata Extraction Framework (SOMEF) 
+https://pypi.org/project/somef/
+
+```
+cd somef
+pip install -e .
+```
+
+<img src="docs/logo.png" alt="logo" width="150"/>
+
+A command line interface for automatically extracting relevant information from readme files.
+
+## Features
+Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present):
+- **Name**: Name identifying a software component
+- **Full name**: Name + owner (owner/name)
+- **Source code**: Link to the source code (typically the repository where the readme can be found)
+
+If you don't include an authentication token, you can still use SOMEF.
+
+ディープラーニングに関する論文の実装を先行研究から順に進める
diff --git a/src/util/preprocessor.ipynb b/src/util/preprocessor.ipynb
@@ -31,14 +31,16 @@
      "output_type": "stream",
      "text": [
       "                                                Text\n",
-      "0  # Title: This isn't a real README ## Subtitle ...\n"
+      "0  # Software Metadata Extraction Framework (SOME...\n",
+      "# Software Metadata Extraction Framework (SOMEF)  https://pypi.org/project/somef/  ``` cd somef pip install -e . ```  <img src=\"docs/logo.png\" alt=\"logo\" width=\"150\"/>  A command line interface for automatically extracting relevant information from readme files.  ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found)  If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n"
      ]
     }
    ],
    "source": [
-    "t = \"\"\"# Title: This isn't a real README ## Subtitle nr 1 ``` pip3 install pandas ```  https://paperswithcode.com/ https://paperswithcode.com/  <!doctype html> <html lang=\"en\"> <head> <title>Some readme example from 2022</title> </head> <body> <p>Body comes here</p> </body> </html>  ## Subtitle nr 2  Test lemmatizer: hidden, walking, ran, found  Some non ASCII character: ディープラーニングに関する論文の実装を先行研究から順に進める\"\"\"\n",
+    "t = \"\"\"# Software Metadata Extraction Framework (SOMEF)  https://pypi.org/project/somef/  ``` cd somef pip install -e . ```  <img src=\"docs/logo.png\" alt=\"logo\" width=\"150\"/>  A command line interface for automatically extracting relevant information from readme files.  ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found)  If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\"\"\"\n",
     "text = pd.DataFrame([t], columns=[TEXT])\n",
-    "print(text)"
+    "print(text)\n",
+    "print(text[TEXT][0])"
    ]
   },
   {
@@ -50,20 +52,33 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove codeblocks\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove links\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove tags\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove punctuations\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"transform to lowercase\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove non-ascii characters\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"lemmatize verbs\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"lemmatize nouns\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"lemmatize adjectives\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"remove stop_words\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"remove tokens only containing numbers or two char\".\n",
-      "\u001b[34mpreprocessing.py:   194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"join tokens\".\n",
-      "\u001b[34mpreprocessing.py:   198: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: drop empty rows.\n",
-      "title real readme subtitle readme body come subtitle test lemmatizer hide walk find non ascii character\n"
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove codeblocks\".\n",
+      "# Software Metadata Extraction Framework (SOMEF)  https://pypi.org/project/somef/     <img src=\"docs/logo.png\" alt=\"logo\" width=\"150\"/>  A command line interface for automatically extracting relevant information from readme files.  ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found)  If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove links\".\n",
+      "# Software Metadata Extraction Framework (SOMEF)      <img src=\"docs/logo.png\" alt=\"logo\" width=\"150\"/>  A command line interface for automatically extracting relevant information from readme files.  ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found)  If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove tags\".\n",
+      "# Software Metadata Extraction Framework (SOMEF)        A command line interface for automatically extracting relevant information from readme files.  ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found)  If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n",
+      "# Software Metadata Extraction Framework (SOMEF)        A command line interface for automatically extracting relevant information from readme files.  ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found)  If you do not include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove punctuations\".\n",
+      "  Software Metadata Extraction Framework  SOMEF         A command line interface for automatically extracting relevant information from readme files      Features Given a readme file  or a GitHub Gitlab repository  SOMEF will extract the following categories  if present       Name    Name identifying a software component     Full name    Name   owner  owner name      Source code    Link to the source code  typically the repository where the readme can be found   If you do not include an authentication token  you can still use SOMEF  ディープラーニングに関する論文の実装を先行研究から順に進める\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"transform to lowercase\".\n",
+      "  software metadata extraction framework  somef         a command line interface for automatically extracting relevant information from readme files      features given a readme file  or a github gitlab repository  somef will extract the following categories  if present       name    name identifying a software component     full name    name   owner  owner name      source code    link to the source code  typically the repository where the readme can be found   if you do not include an authentication token  you can still use somef  ディープラーニングに関する論文の実装を先行研究から順に進める\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove non-ascii characters\".\n",
+      "['software', 'metadata', 'extraction', 'framework', 'somef', 'a', 'command', 'line', 'interface', 'for', 'automatically', 'extracting', 'relevant', 'information', 'from', 'readme', 'files', 'features', 'given', 'a', 'readme', 'file', 'or', 'a', 'github', 'gitlab', 'repository', 'somef', 'will', 'extract', 'the', 'following', 'categories', 'if', 'present', 'name', 'name', 'identifying', 'a', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'code', 'link', 'to', 'the', 'source', 'code', 'typically', 'the', 'repository', 'where', 'the', 'readme', 'can', 'be', 'found', 'if', 'you', 'do', 'not', 'include', 'an', 'authentication', 'token', 'you', 'can', 'still', 'use', 'somef', '']\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"lemmatize verbs\".\n",
+      "['software', 'metadata', 'extraction', 'framework', 'somef', 'a', 'command', 'line', 'interface', 'for', 'automatically', 'extract', 'relevant', 'information', 'from', 'readme', 'file', 'feature', 'give', 'a', 'readme', 'file', 'or', 'a', 'github', 'gitlab', 'repository', 'somef', 'will', 'extract', 'the', 'follow', 'categories', 'if', 'present', 'name', 'name', 'identify', 'a', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'code', 'link', 'to', 'the', 'source', 'code', 'typically', 'the', 'repository', 'where', 'the', 'readme', 'can', 'be', 'find', 'if', 'you', 'do', 'not', 'include', 'an', 'authentication', 'token', 'you', 'can', 'still', 'use', 'somef', '']\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"lemmatize nouns\".\n",
+      "['software', 'metadata', 'extraction', 'framework', 'somef', 'a', 'command', 'line', 'interface', 'for', 'automatically', 'extract', 'relevant', 'information', 'from', 'readme', 'file', 'feature', 'give', 'a', 'readme', 'file', 'or', 'a', 'github', 'gitlab', 'repository', 'somef', 'will', 'extract', 'the', 'follow', 'category', 'if', 'present', 'name', 'name', 'identify', 'a', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'code', 'link', 'to', 'the', 'source', 'code', 'typically', 'the', 'repository', 'where', 'the', 'readme', 'can', 'be', 'find', 'if', 'you', 'do', 'not', 'include', 'an', 'authentication', 'token', 'you', 'can', 'still', 'use', 'somef', '']\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"lemmatize adjectives\".\n",
+      "['software', 'metadata', 'extraction', 'framework', 'somef', 'a', 'command', 'line', 'interface', 'for', 'automatically', 'extract', 'relevant', 'information', 'from', 'readme', 'file', 'feature', 'give', 'a', 'readme', 'file', 'or', 'a', 'github', 'gitlab', 'repository', 'somef', 'will', 'extract', 'the', 'follow', 'category', 'if', 'present', 'name', 'name', 'identify', 'a', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'code', 'link', 'to', 'the', 'source', 'code', 'typically', 'the', 'repository', 'where', 'the', 'readme', 'can', 'be', 'find', 'if', 'you', 'do', 'not', 'include', 'an', 'authentication', 'token', 'you', 'can', 'still', 'use', 'somef', '']\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"remove stop_words\".\n",
+      "['software', 'metadata', 'extraction', 'framework', 'somef', 'command', 'line', 'interface', 'automatically', 'extract', 'relevant', 'information', 'readme', 'feature', 'give', 'readme', 'github', 'gitlab', 'repository', 'somef', 'extract', 'follow', 'category', 'present', 'name', 'name', 'identify', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'link', 'source', 'typically', 'repository', 'readme', 'find', 'include', 'authentication', 'token', 'still', 'somef', '']\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"remove tokens only containing numbers or two char\".\n",
+      "['software', 'metadata', 'extraction', 'framework', 'somef', 'command', 'line', 'interface', 'automatically', 'extract', 'relevant', 'information', 'readme', 'feature', 'give', 'readme', 'github', 'gitlab', 'repository', 'somef', 'extract', 'follow', 'category', 'present', 'name', 'name', 'identify', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'link', 'source', 'typically', 'repository', 'readme', 'find', 'include', 'authentication', 'token', 'still', 'somef']\n",
+      "\u001b[34mpreprocessing.py:   195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"join tokens\".\n",
+      "software metadata extraction framework somef command line interface automatically extract relevant information readme feature give readme github gitlab repository somef extract follow category present name name identify software component full name name owner owner name source link source typically repository readme find include authentication token still somef\n",
+      "\u001b[34mpreprocessing.py:   200: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: drop empty rows.\n",
+      "software metadata extraction framework somef command line interface automatically extract relevant information readme feature give readme github gitlab repository somef extract follow category present name name identify software component full name name owner owner name source link source typically repository readme find include authentication token still somef\n"
      ]
     }
    ],