PySpark is the Python API for Apache Spark, an open-source, distributed computing system designed for large-scale data processing. Spark provides fast and general-purpose cluster-computing frameworks, making it well-suited for big data processing tasks.
To use PySpark, you need to:
pip install pyspark
PySpark is widely used in data engineering, data analysis, and machine learning applications, offering a powerful and flexible framework for big data processing with Python.
ssh -i your-key-pair.pem ec2-user@your-ec2-instance-ip
sudo yum install java-1.8.0-openjdk-devel
wget https://downloads.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
tar -xvf spark-3.2.1-bin-hadoop3.2.tgz
sudo mv spark-3.2.1-bin-hadoop3.2 /usr/local/spark
.bashrc
or .bash_profile
file to include Spark in the environment variables.
export SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$PATH
Note: Remember to source the file after editing.
source ~/.bashrc
cd /usr/local/spark/conf
cp spark-env.sh.template spark-env.sh
spark-env.sh
to include AWS configuration:
echo 'export PYSPARK_PYTHON=python3' >> spark-env.sh
echo 'export AWS_ACCESS_KEY_ID=your-access-key' >> spark-env.sh
echo 'export AWS_SECRET_ACCESS_KEY=your-secret-key' >> spark-env.sh
your-access-key
and your-secret-key
with your AWS access and secret keys. wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar
sudo mv hadoop-aws-3.3.1.jar /usr/local/spark/jars/
pyspark
# In the PySpark shell
rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd.map(lambda x: x * 2).collect()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.csv("s3a://your-s3-bucket/your-file.csv")
df.show()
Replace your-s3-bucket
and your-file.csv
with your AWS S3 bucket and file path.
These are basic steps, and you might need additional configurations based on your specific use case or AWS services. Ensure that your IAM role associated with the EC2 instance has the necessary permissions to access the required AWS services.