[ad_1]
I have a fairly large dataframe(million rows), and the requirement is to store each of the row in a separate json file.
For this data frame
root
|-- uniqueID: string
|-- moreData: array
The output should be stored like below for all the rows.
s3://.../folder[i]/<uniqueID>.json
where i is the first letter of the uniqueID
I have looked at other questions and solutions, but they don’t satisfy my requirements.
Trying to do this in a more time optimized way, and from what I have read so far re-partition is not a good option.
Tried writing the df with maxRecordsPerFile
option, but I can’t seem to control the naming of the files.
df.write.mode("overwrite")
.option("maxRecordsPerFile", 1)
.json(outputPath)
I am fairly new to spark, any help is much appreciated.
[ad_2]