|
31 | 31 | "output_type": "stream", |
32 | 32 | "text": [ |
33 | 33 | " Text\n", |
34 | | - "0 # Title: This isn't a real README ## Subtitle ...\n" |
| 34 | + "0 # Software Metadata Extraction Framework (SOME...\n", |
| 35 | + "# Software Metadata Extraction Framework (SOMEF) https://pypi.org/project/somef/ ``` cd somef pip install -e . ``` <img src=\"docs/logo.png\" alt=\"logo\" width=\"150\"/> A command line interface for automatically extracting relevant information from readme files. ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found) If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n" |
35 | 36 | ] |
36 | 37 | } |
37 | 38 | ], |
38 | 39 | "source": [ |
39 | | - "t = \"\"\"# Title: This isn't a real README ## Subtitle nr 1 ``` pip3 install pandas ``` https://paperswithcode.com/ https://paperswithcode.com/ <!doctype html> <html lang=\"en\"> <head> <title>Some readme example from 2022</title> </head> <body> <p>Body comes here</p> </body> </html> ## Subtitle nr 2 Test lemmatizer: hidden, walking, ran, found Some non ASCII character: ディープラーニングに関する論文の実装を先行研究から順に進める\"\"\"\n", |
| 40 | + "t = \"\"\"# Software Metadata Extraction Framework (SOMEF) https://pypi.org/project/somef/ ``` cd somef pip install -e . ``` <img src=\"docs/logo.png\" alt=\"logo\" width=\"150\"/> A command line interface for automatically extracting relevant information from readme files. ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found) If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\"\"\"\n", |
40 | 41 | "text = pd.DataFrame([t], columns=[TEXT])\n", |
41 | | - "print(text)" |
| 42 | + "print(text)\n", |
| 43 | + "print(text[TEXT][0])" |
42 | 44 | ] |
43 | 45 | }, |
44 | 46 | { |
|
50 | 52 | "name": "stdout", |
51 | 53 | "output_type": "stream", |
52 | 54 | "text": [ |
53 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove codeblocks\".\n", |
54 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove links\".\n", |
55 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove tags\".\n", |
56 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove punctuations\".\n", |
57 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"transform to lowercase\".\n", |
58 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"remove non-ascii characters\".\n", |
59 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:33Z:\u001b[0m Preprocessing: Process name: \"lemmatize verbs\".\n", |
60 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"lemmatize nouns\".\n", |
61 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"lemmatize adjectives\".\n", |
62 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"remove stop_words\".\n", |
63 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"remove tokens only containing numbers or two char\".\n", |
64 | | - "\u001b[34mpreprocessing.py: 194: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: Process name: \"join tokens\".\n", |
65 | | - "\u001b[34mpreprocessing.py: 198: 2022-04-25 01:11:34Z:\u001b[0m Preprocessing: drop empty rows.\n", |
66 | | - "title real readme subtitle readme body come subtitle test lemmatizer hide walk find non ascii character\n" |
| 55 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove codeblocks\".\n", |
| 56 | + "# Software Metadata Extraction Framework (SOMEF) https://pypi.org/project/somef/ <img src=\"docs/logo.png\" alt=\"logo\" width=\"150\"/> A command line interface for automatically extracting relevant information from readme files. ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found) If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n", |
| 57 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove links\".\n", |
| 58 | + "# Software Metadata Extraction Framework (SOMEF) <img src=\"docs/logo.png\" alt=\"logo\" width=\"150\"/> A command line interface for automatically extracting relevant information from readme files. ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found) If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n", |
| 59 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove tags\".\n", |
| 60 | + "# Software Metadata Extraction Framework (SOMEF) A command line interface for automatically extracting relevant information from readme files. ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found) If you don't include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n", |
| 61 | + "# Software Metadata Extraction Framework (SOMEF) A command line interface for automatically extracting relevant information from readme files. ## Features Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present): - **Name**: Name identifying a software component - **Full name**: Name + owner (owner/name) - **Source code**: Link to the source code (typically the repository where the readme can be found) If you do not include an authentication token, you can still use SOMEF. ディープラーニングに関する論文の実装を先行研究から順に進める\n", |
| 62 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove punctuations\".\n", |
| 63 | + " Software Metadata Extraction Framework SOMEF A command line interface for automatically extracting relevant information from readme files Features Given a readme file or a GitHub Gitlab repository SOMEF will extract the following categories if present Name Name identifying a software component Full name Name owner owner name Source code Link to the source code typically the repository where the readme can be found If you do not include an authentication token you can still use SOMEF ディープラーニングに関する論文の実装を先行研究から順に進める\n", |
| 64 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"transform to lowercase\".\n", |
| 65 | + " software metadata extraction framework somef a command line interface for automatically extracting relevant information from readme files features given a readme file or a github gitlab repository somef will extract the following categories if present name name identifying a software component full name name owner owner name source code link to the source code typically the repository where the readme can be found if you do not include an authentication token you can still use somef ディープラーニングに関する論文の実装を先行研究から順に進める\n", |
| 66 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"remove non-ascii characters\".\n", |
| 67 | + "['software', 'metadata', 'extraction', 'framework', 'somef', 'a', 'command', 'line', 'interface', 'for', 'automatically', 'extracting', 'relevant', 'information', 'from', 'readme', 'files', 'features', 'given', 'a', 'readme', 'file', 'or', 'a', 'github', 'gitlab', 'repository', 'somef', 'will', 'extract', 'the', 'following', 'categories', 'if', 'present', 'name', 'name', 'identifying', 'a', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'code', 'link', 'to', 'the', 'source', 'code', 'typically', 'the', 'repository', 'where', 'the', 'readme', 'can', 'be', 'found', 'if', 'you', 'do', 'not', 'include', 'an', 'authentication', 'token', 'you', 'can', 'still', 'use', 'somef', '']\n", |
| 68 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:00Z:\u001b[0m Preprocessing: Process name: \"lemmatize verbs\".\n", |
| 69 | + "['software', 'metadata', 'extraction', 'framework', 'somef', 'a', 'command', 'line', 'interface', 'for', 'automatically', 'extract', 'relevant', 'information', 'from', 'readme', 'file', 'feature', 'give', 'a', 'readme', 'file', 'or', 'a', 'github', 'gitlab', 'repository', 'somef', 'will', 'extract', 'the', 'follow', 'categories', 'if', 'present', 'name', 'name', 'identify', 'a', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'code', 'link', 'to', 'the', 'source', 'code', 'typically', 'the', 'repository', 'where', 'the', 'readme', 'can', 'be', 'find', 'if', 'you', 'do', 'not', 'include', 'an', 'authentication', 'token', 'you', 'can', 'still', 'use', 'somef', '']\n", |
| 70 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"lemmatize nouns\".\n", |
| 71 | + "['software', 'metadata', 'extraction', 'framework', 'somef', 'a', 'command', 'line', 'interface', 'for', 'automatically', 'extract', 'relevant', 'information', 'from', 'readme', 'file', 'feature', 'give', 'a', 'readme', 'file', 'or', 'a', 'github', 'gitlab', 'repository', 'somef', 'will', 'extract', 'the', 'follow', 'category', 'if', 'present', 'name', 'name', 'identify', 'a', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'code', 'link', 'to', 'the', 'source', 'code', 'typically', 'the', 'repository', 'where', 'the', 'readme', 'can', 'be', 'find', 'if', 'you', 'do', 'not', 'include', 'an', 'authentication', 'token', 'you', 'can', 'still', 'use', 'somef', '']\n", |
| 72 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"lemmatize adjectives\".\n", |
| 73 | + "['software', 'metadata', 'extraction', 'framework', 'somef', 'a', 'command', 'line', 'interface', 'for', 'automatically', 'extract', 'relevant', 'information', 'from', 'readme', 'file', 'feature', 'give', 'a', 'readme', 'file', 'or', 'a', 'github', 'gitlab', 'repository', 'somef', 'will', 'extract', 'the', 'follow', 'category', 'if', 'present', 'name', 'name', 'identify', 'a', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'code', 'link', 'to', 'the', 'source', 'code', 'typically', 'the', 'repository', 'where', 'the', 'readme', 'can', 'be', 'find', 'if', 'you', 'do', 'not', 'include', 'an', 'authentication', 'token', 'you', 'can', 'still', 'use', 'somef', '']\n", |
| 74 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"remove stop_words\".\n", |
| 75 | + "['software', 'metadata', 'extraction', 'framework', 'somef', 'command', 'line', 'interface', 'automatically', 'extract', 'relevant', 'information', 'readme', 'feature', 'give', 'readme', 'github', 'gitlab', 'repository', 'somef', 'extract', 'follow', 'category', 'present', 'name', 'name', 'identify', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'link', 'source', 'typically', 'repository', 'readme', 'find', 'include', 'authentication', 'token', 'still', 'somef', '']\n", |
| 76 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"remove tokens only containing numbers or two char\".\n", |
| 77 | + "['software', 'metadata', 'extraction', 'framework', 'somef', 'command', 'line', 'interface', 'automatically', 'extract', 'relevant', 'information', 'readme', 'feature', 'give', 'readme', 'github', 'gitlab', 'repository', 'somef', 'extract', 'follow', 'category', 'present', 'name', 'name', 'identify', 'software', 'component', 'full', 'name', 'name', 'owner', 'owner', 'name', 'source', 'link', 'source', 'typically', 'repository', 'readme', 'find', 'include', 'authentication', 'token', 'still', 'somef']\n", |
| 78 | + "\u001b[34mpreprocessing.py: 195: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: Process name: \"join tokens\".\n", |
| 79 | + "software metadata extraction framework somef command line interface automatically extract relevant information readme feature give readme github gitlab repository somef extract follow category present name name identify software component full name name owner owner name source link source typically repository readme find include authentication token still somef\n", |
| 80 | + "\u001b[34mpreprocessing.py: 200: 2022-05-12 02:16:01Z:\u001b[0m Preprocessing: drop empty rows.\n", |
| 81 | + "software metadata extraction framework somef command line interface automatically extract relevant information readme feature give readme github gitlab repository somef extract follow category present name name identify software component full name name owner owner name source link source typically repository readme find include authentication token still somef\n" |
67 | 82 | ] |
68 | 83 | } |
69 | 84 | ], |
|
0 commit comments