I have a fairly large dataframe(million rows), and the requirement is to store each of the row in a separate json file.
For this data frame
root |-- uniqueID: string |-- moreData: array
The output should be stored like below for all the rows.
where i is the first letter of the uniqueID
I have looked at other questions and solutions, but they don’t satisfy my requirements.
Trying to do this in a more time optimized way, and from what I have read so far re-partition is not a good option.
Tried writing the df with
maxRecordsPerFile option, but I can’t seem to control the naming of the files.
df.write.mode("overwrite") .option("maxRecordsPerFile", 1) .json(outputPath)
I am fairly new to spark, any help is much appreciated.
Leave an answer