Spark-SQL 查询 ES 大数据量的问题
# Spark-SQL 查询 ES 大数据量的问题
开发者在How to scroll through an Elasticsearch index using elasticsearch-spark? (opens new window) 回复的原文是:
ES-Hadoop uses the scroll endpoint to collect all the data for processing within Spark. ES-Hadoop performs the multiple scroll requests under the hood on its own, requesting the next scroll entry after the data in the current scroll response is fully consumed. I'm not sure I understand what you're looking for in terms of advancing the scroll request on your own. Could you elaborate on your use case?
所以大数据量下,ES-Hadoop
是会自己启动 scroll 去查询所有的数据的。所以完全不用担心大数据量的问题。
# 异常
当遇到 kafka 无法存的问题时,去 google 搜索 java maven Failed to find data source: kafka Structured Streaming + Kafka Integration Guide (opens new window) 得到了[Why does Spark application fail with “ClassNotFoundException: Failed to find data source: kafka” as uber-jar with sbt assembly?]https://intellipaat.com/community/16747/why-does-spark-application-fail-with-classnotfoundexception-failed-to-find-data-source-kafka-as-uber-jar-with-sbt-assembly (opens new window)
最终通过指定 .format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
解决了
mergedAll
// .toDF()
.selectExpr("CAST(uid AS STRING) AS key", "to_json(struct(*)) AS value")
// .selectExpr("CAST(key AS STRING)", "CAST(value) AS STRING")
.write()
// .format("kafka")
.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
.option("kafka.bootstrap.servers", kafkaAddresses)
.option("topic", topicMetric)
.save();
2
3
4
5
6
7
8
9
10
# 参考
看了这篇博客,你还敢说不会Structured Streaming? (opens new window)
[Issue in Spark + kafka Integration (opens new window)](https://stackoverflow.com/questions/58877005/issue-in-spark-kafka-integration)