This project is based on SparkNLP.
It's highly recommended to run this project using macOS or Linux systems. For Windows users, please refer to the Hadoop tutorial first.
- Java: Ensure you have Java 8 or later installed. You can download it from Oracle's official website.
- Python: Ensure you have Python 3.6 or later installed. You can download it from Python's official website.
-
Install Apache Spark:
pip install pyspark
-
Install SparkNLP:
pip install spark-nlp
-
Verify Installation:
import sparknlp print(sparknlp.version())
-
Download Hadoop:
- Download the latest stable release from the Apache Hadoop website.
-
Extract the Downloaded File:
tar -xzvf hadoop-x.y.z.tar.gz
-
Set Up Environment Variables: Add the following lines to your
.bashrcor.zshrcfile:export HADOOP_HOME=/path/to/hadoop export PATH=$PATH:$HADOOP_HOME/bin
-
Verify Installation:
hadoop version
For more detailed instructions, refer to the Hadoop Installation Guide.
Thanks for the generous help from @rongliyuan, this project couldn't have been done without you. And thanks to @LagShaggy for useful software suggestions.