streaming代码
This commit is contained in:
parent
eeda4a7e9b
commit
926b86d894
51
README.md
51
README.md
@ -1,2 +1,53 @@
|
|||||||
# spark
|
# spark
|
||||||
|
|
||||||
|
# 项目周项目
|
||||||
|
## 项目主题10
|
||||||
|
- 项目介绍:云部署+ 域名 + 服务 。云部署的形式,可以在任何有互联网的地方访问,如 访问百度一样,在连接互联网的机房都可以访问。相比较项目1,云服务器上的环境,可能要重新配置环境,出现问题,可能要联系云客服等才能处理。云服务也意味着收费,所以请慎重选择。云部署形式会增加很多的部署工作,所以评分也会相应的比较高。如果域名申请,时间较长,麻烦的话,可以通过nginx,反向代理通过IP去访问不用域名。
|
||||||
|
|
||||||
|
### **项目名称:** Orders information real-time statistical system( 订单信息实时统计系统)
|
||||||
|
编写程序: kafka + spark + Jsp & Servlet / 报表工具
|
||||||
|
完成方式: 组别形式(6人一组)。可以每位成员完成一项菜
|
||||||
|
可以使用git 管理自己团队开发的代码,组长负责创建仓库,其他组别添加为仓库成员,clone,使用idea 打开项目,commit和push,完成团队成员都能有最新的项目代码和代码提交记录。
|
||||||
|
项目展示时间: 项目周(具体时间请等待通知)
|
||||||
|
#### 项目考核内容:
|
||||||
|
1. 系统能正常运行且添加
|
||||||
|
2. 正常展示相应的数据
|
||||||
|
3. 问题答辩
|
||||||
|
4. 关于项目的ppt, 每组展示时间不超过10分钟
|
||||||
|
5. 完成 完成一份niit项目报告
|
||||||
|
##### 内容包括:
|
||||||
|
1) 项目部署
|
||||||
|
2) 前端 JSP & Servlet实现或报表工具实现,后端通过spark 处理kafka 实时数据
|
||||||
|
3) 功能:kafka实时数据生产,spark实时数据处理,数据实时展现
|
||||||
|
(这个报告期末的时候也要交)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
##### 项目具体要求:
|
||||||
|
1. Kafka data real-time production (kafka 数据实时生产)
|
||||||
|
这个模块主要用来模拟在实时的业务系统中,用户数据是实时生产的,并推送到kafka 相应的主题中。
|
||||||
|
1)在kafka中创建 主题 orders
|
||||||
|
2) 使用java 代码实现producer,并每隔5s推送以下数据给 kafka的 orders主题
|
||||||
|
|
||||||
|
- 需要推送的数据格式,请使用随机数 +数组 的形式进行随机产生,结果如下:
|
||||||
|
- 5个字段:分别是 订单类别,订单名字,订单数量,日期,是否有效: Y是有效,N无效
|
||||||
|
|
||||||
|
分隔符可以用 \t 进行拼接。
|
||||||
|
|
||||||
|
2. Spark real-time data processing
|
||||||
|
选这个模块主要使用 spark 读取kafka当中的相应topic 的数据,并进行需求处理。
|
||||||
|
1) 使用 spark streaming 创建消费者读取相应主题的数据
|
||||||
|
2) 使用 spark streaming 实时统计 每隔2秒 分别统计 所有有效和无效订单号的总和
|
||||||
|
3) 使用 spark streaming 实时统计 每隔2秒 各个订单号 各自的有效和无效数量
|
||||||
|
4) 使用 spark streaming 实时统计 每隔2秒 所有订单类别的数量
|
||||||
|
5) 使用 spark sql 统计 各个订单的有效数和无效数的数量
|
||||||
|
6) 使用 spark core/rdd 统计 各个订单的 各个类别的有效和无效数量
|
||||||
|
7) 可以将上边计算的结果 spark streaming推送到 kafka的一个新的主题,使用前端工具实时展示,spark core 和 spark rdd推送到 mysql 使用 echarts展示。
|
||||||
|
提示:前端展示的工具或者实时展示 或者 echarts 展示的项目 直接读取 mysql 不用和后端的spark+kafka在同一个项目。前端所有展示的数据指标最好在同一个项目中。
|
||||||
|
3. Data visualization
|
||||||
|
展示学生信息
|
||||||
|
(1)实时展示所有有效和无效数量总和
|
||||||
|
(2)实时展示各个订单号 各自的有效和无效数量
|
||||||
|
(3)实时展示所有订单类别的数量
|
||||||
|
(4)实时展示不同类别商品的有效数和无效数的数量
|
||||||
|
(5)使用echarts饼状图展示 各个订单的 各个类别的有效和无效数量
|
BIN
chk1/.metadata.crc
Normal file
BIN
chk1/.metadata.crc
Normal file
Binary file not shown.
1
chk1/metadata
Normal file
1
chk1/metadata
Normal file
@ -0,0 +1 @@
|
|||||||
|
{"id":"eb35e4a4-21c1-4321-b4e0-3f6cdea19609"}
|
28535
get-pip.py
Normal file
28535
get-pip.py
Normal file
File diff suppressed because it is too large
Load Diff
89
kafka_streaming.py
Normal file
89
kafka_streaming.py
Normal file
@ -0,0 +1,89 @@
|
|||||||
|
import os
|
||||||
|
from pyspark.sql import SparkSession
|
||||||
|
import pyspark.sql.functions as F
|
||||||
|
|
||||||
|
# # os.environ['JAVA_HOME'] = 'C:\\Program Files\\Java\\jdk1.8.0_351'
|
||||||
|
# os.environ['HADOOP_HOME'] = 'D:\\CodeDevelopment\\DevelopmentEnvironment\\hadoop-2.8.1'
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
# 1- 创建 SparkSession
|
||||||
|
spark = SparkSession.builder \
|
||||||
|
.config("spark.sql.shuffle.partitions", 1) \
|
||||||
|
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1")\
|
||||||
|
.appName('kafka_stream') \
|
||||||
|
.master('local[*]') \
|
||||||
|
.getOrCreate()
|
||||||
|
|
||||||
|
# 2- 读取 Kafka 数据流
|
||||||
|
kafka_stream = spark.readStream \
|
||||||
|
.format("kafka") \
|
||||||
|
.option("kafka.bootstrap.servers", "niit-node2:9092") \
|
||||||
|
.option("subscribePattern", "orders") \
|
||||||
|
.load()
|
||||||
|
|
||||||
|
# 3- 解析数据
|
||||||
|
parsed_stream = kafka_stream.selectExpr("cast(value as string) as value", "timestamp") \
|
||||||
|
.withColumn("order_id", F.split(F.col("value"), "\t")[0]) \
|
||||||
|
.withColumn("order_type", F.split(F.col("value"), "\t")[1]) \
|
||||||
|
.withColumn("order_name", F.split(F.col("value"), "\t")[2]) \
|
||||||
|
.withColumn("order_quantity", F.split(F.col("value"), "\t")[3]) \
|
||||||
|
.withColumn("status", F.split(F.col("value"), "\t")[4]) \
|
||||||
|
.drop("value")
|
||||||
|
|
||||||
|
# 4- 使用时间窗口对数据进行分组聚合
|
||||||
|
# 4.1 有效订单数量总和
|
||||||
|
orders_summary = parsed_stream.groupBy(
|
||||||
|
F.window("timestamp", "10 seconds").alias("time_window"), # 2 秒窗口
|
||||||
|
"status"
|
||||||
|
).count() \
|
||||||
|
.withColumnRenamed("count", "count") \
|
||||||
|
.drop("time_window") # 移除时间窗口字段
|
||||||
|
|
||||||
|
# 4.2 各个商品类型的有效订单数量
|
||||||
|
eachOrders_summary = parsed_stream.groupBy(
|
||||||
|
F.window("timestamp", "10 seconds").alias("time_window"),
|
||||||
|
"order_id", "status"
|
||||||
|
).count() \
|
||||||
|
.withColumnRenamed("count", "count") \
|
||||||
|
.drop("time_window") # 移除时间窗口字段
|
||||||
|
|
||||||
|
# 4.3 所有课程的数量
|
||||||
|
order_name_count = parsed_stream.groupBy(
|
||||||
|
F.window("timestamp", "10 seconds").alias("time_window"),
|
||||||
|
"order_name"
|
||||||
|
).count() \
|
||||||
|
.withColumnRenamed("count", "order_name_count") \
|
||||||
|
.drop("time_window") # 移除时间窗口字段
|
||||||
|
|
||||||
|
# 5- 将每组统计结果写入 Kafka
|
||||||
|
def write_to_kafka(batch_df, batch_id, topic_name):
|
||||||
|
batch_df.selectExpr(
|
||||||
|
"cast(null as string) as key", # Kafka 的 key
|
||||||
|
"to_json(struct(*)) as value" # 将所有列转换为 JSON 格式
|
||||||
|
).write \
|
||||||
|
.format("kafka") \
|
||||||
|
.option("kafka.bootstrap.servers", "niit-node2:9092") \
|
||||||
|
.option("topic", topic_name) \
|
||||||
|
.save()
|
||||||
|
|
||||||
|
# 将每组统计结果写入 Kafka topic
|
||||||
|
orders_summary.writeStream \
|
||||||
|
.foreachBatch(lambda df, id: write_to_kafka(df, id, "orders_summary")) \
|
||||||
|
.outputMode("complete") \
|
||||||
|
.trigger(processingTime="10 seconds") \
|
||||||
|
.start()
|
||||||
|
|
||||||
|
eachOrders_summary.writeStream \
|
||||||
|
.foreachBatch(lambda df, id: write_to_kafka(df, id, "eachOrders_summary")) \
|
||||||
|
.outputMode("update") \
|
||||||
|
.trigger(processingTime="10 seconds") \
|
||||||
|
.start()
|
||||||
|
|
||||||
|
order_name_count.writeStream \
|
||||||
|
.foreachBatch(lambda df, id: write_to_kafka(df, id, "order_name_count")) \
|
||||||
|
.outputMode("update") \
|
||||||
|
.trigger(processingTime="10 seconds") \
|
||||||
|
.start()
|
||||||
|
|
||||||
|
# 等待流任务结束
|
||||||
|
spark.streams.awaitAnyTermination()
|
@ -1,36 +1,44 @@
|
|||||||
from pyspark.sql import SparkSession
|
from pyspark.sql import SparkSession
|
||||||
from pyspark.sql.functions import explode
|
import pyspark.sql.functions as F
|
||||||
from pyspark.sql.functions import split
|
import os
|
||||||
|
|
||||||
if __name__ == "__main__":
|
os.environ['JAVA_HOME'] = 'D:\CodeDevelopment\DevelopmentEnvironment\Java\jdk1.8.0_351'
|
||||||
spark = SparkSession \
|
# 配置Hadoop的路径,就是前面解压的那个路径
|
||||||
.builder \
|
os.environ['HADOOP_HOME'] = 'D:\CodeDevelopment\DevelopmentEnvironment\hadoop-2.8.1'
|
||||||
.appName("StructuredNetworkWordCount") \
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
# 1- 创建SparkSession对象
|
||||||
|
spark = SparkSession.builder\
|
||||||
|
.config("spark.sql.shuffle.partitions",1)\
|
||||||
|
.appName('sparksql_read_kafka_1_topic')\
|
||||||
|
.master('local[*]')\
|
||||||
.getOrCreate()
|
.getOrCreate()
|
||||||
print("Spark version: ", spark.version)
|
|
||||||
# Create DataFrame representing the stream of input lines from connection to localhost:9999
|
# 2- 数据输入
|
||||||
lines = spark \
|
# 默认从Topic开头一直消费到结尾
|
||||||
.readStream \
|
init_df = spark.read\
|
||||||
.format("socket") \
|
.format("kafka")\
|
||||||
.option("host", "niit-node2") \
|
.option("kafka.bootstrap.servers","zhao:9092")\
|
||||||
.option("port", 9092) \
|
.option("subscribe","test")\
|
||||||
.load()
|
.load()
|
||||||
# Split the lines into words
|
|
||||||
words = lines.select(
|
|
||||||
explode(
|
|
||||||
split(lines.value, " ")
|
|
||||||
).alias("word")
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate running word count
|
# 3- 数据处理
|
||||||
wordCounts = words.groupBy("word").count()
|
result_df1 = init_df.select(F.expr("cast(value as string) as value"))
|
||||||
|
|
||||||
print("wordCounts: ", wordCounts)
|
# selectExpr = select + F.expr
|
||||||
# Start running the query that prints the running counts to the console
|
result_df2 = init_df.selectExpr("cast(value as string) as value")
|
||||||
query = wordCounts \
|
|
||||||
.writeStream \
|
|
||||||
.outputMode("complete") \
|
|
||||||
.format("console") \
|
|
||||||
.start()
|
|
||||||
|
|
||||||
query.awaitTermination()
|
result_df3 = init_df.withColumn("value",F.expr("cast(value as string)"))
|
||||||
|
|
||||||
|
# 4- 数据输出
|
||||||
|
print("result_df1")
|
||||||
|
result_df1.show()
|
||||||
|
|
||||||
|
print("result_df2")
|
||||||
|
result_df2.show()
|
||||||
|
|
||||||
|
print("result_df3")
|
||||||
|
result_df3.show()
|
||||||
|
|
||||||
|
# 5- 释放资源
|
||||||
|
spark.stop()
|
26
testconnect.py
Normal file
26
testconnect.py
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
from pyspark.sql import SparkSession
|
||||||
|
import os
|
||||||
|
|
||||||
|
os.environ['JAVA_HOME'] = 'D:\CodeDevelopment\DevelopmentEnvironment\Java\jdk-17.0.5'
|
||||||
|
os.environ['HADOOP_HOME'] = 'D:\CodeDevelopment\DevelopmentEnvironment\hadoop-2.8.1'
|
||||||
|
# 创建 SparkSession
|
||||||
|
spark = SparkSession \
|
||||||
|
.builder \
|
||||||
|
.appName("Kafka Example") \
|
||||||
|
.master("local[*]") \
|
||||||
|
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2") \
|
||||||
|
.config("spark.executorEnv.PATH", "D:\CodeDevelopment\DevelopmentEnvironment\Java\jdk-17.0.5") \
|
||||||
|
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
|
||||||
|
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
|
||||||
|
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
|
||||||
|
.getOrCreate()
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# 读取 Kafka 数据
|
||||||
|
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "niit-node2:9092").option("subscribe", "orders").load()
|
||||||
|
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
|
||||||
|
|
||||||
|
# 展示数据
|
||||||
|
df.show()
|
||||||
|
|
65
testkafka.py
Normal file
65
testkafka.py
Normal file
@ -0,0 +1,65 @@
|
|||||||
|
from pyspark.sql import SparkSession
|
||||||
|
from pyspark.sql.functions import col, split, window
|
||||||
|
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
spark = SparkSession.builder.appName("StreamingApp")\
|
||||||
|
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1")\
|
||||||
|
.getOrCreate()
|
||||||
|
|
||||||
|
# 配置Kafka的相关信息
|
||||||
|
kafka_broker = "niit-node2:9092" # Kafka broker地址
|
||||||
|
kafka_topic = "orders" # Kafka中的主题名称
|
||||||
|
|
||||||
|
# 设置从Kafka读取数据
|
||||||
|
kafka_stream_df = spark.readStream \
|
||||||
|
.format("kafka") \
|
||||||
|
.option("kafka.bootstrap.servers", kafka_broker) \
|
||||||
|
.option("subscribe", kafka_topic) \
|
||||||
|
.load()
|
||||||
|
|
||||||
|
# 假设Kafka消息的value字段为字符串,类型是byte类型,先将其转化为字符串
|
||||||
|
orders_df = kafka_stream_df.selectExpr("CAST(value AS STRING)")
|
||||||
|
|
||||||
|
# 定义数据的schema,这里使用tab分隔符来分割
|
||||||
|
schema = StructType([
|
||||||
|
StructField("order_id", StringType(), True),
|
||||||
|
StructField("order_type", StringType(), True),
|
||||||
|
StructField("order_name", StringType(), True),
|
||||||
|
StructField("order_quantity", IntegerType(), True),
|
||||||
|
StructField("date", StringType(), True),
|
||||||
|
StructField("is_valid", StringType(), True) # Y为有效, N为无效
|
||||||
|
])
|
||||||
|
|
||||||
|
# 解析数据,按照\t进行切割
|
||||||
|
orders_parsed_df = orders_df.select(
|
||||||
|
split(orders_df['value'], '\t').alias('cols')
|
||||||
|
).select(
|
||||||
|
col('cols')[0].alias('order_id'),
|
||||||
|
col('cols')[1].alias('order_type'),
|
||||||
|
col('cols')[2].alias('order_name'),
|
||||||
|
col('cols')[3].cast(IntegerType()).alias('order_quantity'),
|
||||||
|
col('cols')[4].alias('date'),
|
||||||
|
col('cols')[5].alias('is_valid')
|
||||||
|
)
|
||||||
|
|
||||||
|
# 过滤出有效和无效的订单
|
||||||
|
valid_orders_df = orders_parsed_df.filter(orders_parsed_df['is_valid'] == 'Y')
|
||||||
|
invalid_orders_df = orders_parsed_df.filter(orders_parsed_df['is_valid'] == 'N')
|
||||||
|
|
||||||
|
# 每隔2秒统计一次有效和无效订单的总和
|
||||||
|
valid_order_count_df = valid_orders_df.groupBy(window(valid_orders_df.date, "2 seconds")).count()
|
||||||
|
invalid_order_count_df = invalid_orders_df.groupBy(window(invalid_orders_df.date, "2 seconds")).count()
|
||||||
|
|
||||||
|
# 合并有效和无效订单统计结果
|
||||||
|
order_count_df = valid_order_count_df.union(invalid_order_count_df)
|
||||||
|
|
||||||
|
# 输出结果到控制台
|
||||||
|
query = order_count_df.writeStream \
|
||||||
|
.outputMode("update") \
|
||||||
|
.format("console") \
|
||||||
|
.option("truncate", "false") \
|
||||||
|
.start()
|
||||||
|
|
||||||
|
# 等待终止
|
||||||
|
query.awaitTermination()
|
14
testnc.py
Normal file
14
testnc.py
Normal file
@ -0,0 +1,14 @@
|
|||||||
|
from pyspark.sql import SparkSession
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName("StreamingApp").getOrCreate() # 创建 SparkSession
|
||||||
|
|
||||||
|
df = spark.readStream.format("socket").option("host", "niit-node2").option("port", "9999").load()
|
||||||
|
|
||||||
|
|
||||||
|
df = df.selectExpr("explode(split(value, '\t'))as word") \
|
||||||
|
.groupBy("word") \
|
||||||
|
.count()
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
df.writeStream.outputMode("complete").format("console").start().awaitTermination()
|
Loading…
x
Reference in New Issue
Block a user