Comma-Separated Values - CSV
Comma-Separated Values are used as interchange format for tabular data of text. This format is supported by most spreadsheet applications and may be used as database extraction format.
Despite the name the values are often separated by a semicolon ;
.
Even though the format is interpreted differently there exists a formal specification in RFC4180.
The format uses three different characters to structure the data:
- Field Delimiter - separates the columns from each other (e.g.
,
or;
) - Quote - marks columns that may contain other structuring characters (such as Field Delimiters or line break) (e.g.
"
) - Escape Character - used to escape Field Delimiters in columns (e.g.
\
)
Lines are separated by either Line Feed (\n
= ASCII 10) or Carriage Return and Line Feed (\r
= ASCII 13 + \n
= ASCII 10).
Project Info: Apache Pekko Connectors CSV | |
---|---|
Artifact | org.apache.pekko
pekko-connectors-csv
1.1.0
|
JDK versions | OpenJDK 8 OpenJDK 11 OpenJDK 17 OpenJDK 21 |
Scala versions | 2.13.15, 2.12.20, 3.3.4 |
JPMS module name | pekko.stream.connectors.csv |
License | |
API documentation | |
Forums | |
Release notes | GitHub releases |
Issues | Github issues |
Sources | https://github.com/apache/pekko-connectors |
Artifacts¶
val PekkoVersion = "1.1.3"
libraryDependencies ++= Seq(
"org.apache.pekko" %% "pekko-connectors-csv" % "1.1.0",
"org.apache.pekko" %% "pekko-stream" % PekkoVersion
)
<properties>
<pekko.version>1.1.3</pekko.version>
<scala.binary.version>2.13</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.pekko</groupId>
<artifactId>pekko-connectors-csv_${scala.binary.version}</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.pekko</groupId>
<artifactId>pekko-stream_${scala.binary.version}</artifactId>
<version>${pekko.version}</version>
</dependency>
</dependencies>
def versions = [
PekkoVersion: "1.1.3",
ScalaBinary: "2.13"
]
dependencies {
implementation "org.apache.pekko:pekko-connectors-csv_${versions.ScalaBinary}:1.1.0"
implementation "org.apache.pekko:pekko-stream_${versions.ScalaBinary}:${versions.PekkoVersion}"
}
The table below shows direct dependencies of this module and the second tab shows all libraries it depends on transitively.
CSV parsing¶
CSV parsing offers a flow that takes a stream of org.apache.pekko.util.ByteString
and issues a stream of lists of ByteString
.
The incoming data must contain line ends to allow line base framing. The CSV special characters can be specified (as bytes), suitable values are available as constants in CsvParsing
.
The current parser is limited to byte-based character sets (UTF-8, ISO-8859-1, ASCII) and can’t parse double-byte encodings (e.g. UTF-16).
The parser accepts Byte Order Mark (BOM) for UTF-8, but will fail for UTF-16 and UTF-32 Byte Order Marks.
sourceimport org.apache.pekko.stream.connectors.csv.scaladsl.CsvParsing
val flow: Flow[ByteString, List[ByteString], NotUsed]
= CsvParsing.lineScanner(delimiter, quoteChar, escapeChar)
sourceimport org.apache.pekko.stream.connectors.csv.javadsl.CsvParsing;
Flow<ByteString, Collection<ByteString>, NotUsed> flow =
CsvParsing.lineScanner(delimiter, quoteChar, escapeChar);
In this sample we read a single line of CSV formatted data into a list of column elements:
sourceimport org.apache.pekko.stream.connectors.csv.scaladsl.CsvParsing
Source.single(ByteString("eins,zwei,drei\n"))
.via(CsvParsing.lineScanner())
.runWith(Sink.head)
result should be(List(ByteString("eins"), ByteString("zwei"), ByteString("drei")))
sourceimport org.apache.pekko.stream.connectors.csv.javadsl.CsvParsing;
Source.single(ByteString.fromString("eins,zwei,drei\n"))
.via(CsvParsing.lineScanner())
.runWith(Sink.head(), system);
To convert the ByteString
columns as String
, a map
operation can be added to the Flow:
sourceimport org.apache.pekko.stream.connectors.csv.scaladsl.CsvParsing
Source.single(ByteString("eins,zwei,drei\n"))
.via(CsvParsing.lineScanner())
.map(_.map(_.utf8String))
.runWith(Sink.head)
result should be(List("eins", "zwei", "drei"))
sourceimport org.apache.pekko.stream.connectors.csv.javadsl.CsvParsing;
Source.single(ByteString.fromString("eins,zwei,drei\n"))
.via(CsvParsing.lineScanner())
.map(line -> line.stream().map(ByteString::utf8String).collect(Collectors.toList()))
.runWith(Sink.head(), system);
CSV conversion into a map¶
The column-based nature of CSV files can be used to read it into a map of column names and their ByteString
values, or alternatively to String
values. The column names can be either provided in code or the first line of data can be interpreted as the column names.
sourceimport org.apache.pekko.stream.connectors.csv.scaladsl.CsvToMap
// keep values as ByteString
val flow1: Flow[List[ByteString], Map[String, ByteString], NotUsed]
= CsvToMap.toMap()
val flow2: Flow[List[ByteString], Map[String, ByteString], NotUsed]
= CsvToMap.toMap(StandardCharsets.UTF_8)
val flow3: Flow[List[ByteString], Map[String, ByteString], NotUsed]
= CsvToMap.withHeaders("column1", "column2", "column3")
// values as String (decode ByteString)
val flow4: Flow[List[ByteString], Map[String, String], NotUsed]
= CsvToMap.toMapAsStrings(StandardCharsets.UTF_8)
val flow5: Flow[List[ByteString], Map[String, String], NotUsed]
= CsvToMap.withHeadersAsStrings(StandardCharsets.UTF_8, "column1", "column2", "column3")
// values as String (decode ByteString)
val flow6: Flow[List[ByteString], Map[String, String], NotUsed]
= CsvToMap.toMapAsStringsCombineAll(StandardCharsets.UTF_8, Option.empty)
sourceimport org.apache.pekko.stream.connectors.csv.javadsl.CsvParsing;
import org.apache.pekko.stream.connectors.csv.javadsl.CsvToMap;
// keep values as ByteString
Flow<Collection<ByteString>, Map<String, ByteString>, ?> flow1 = CsvToMap.toMap();
Flow<Collection<ByteString>, Map<String, ByteString>, ?> flow2 =
CsvToMap.toMap(StandardCharsets.UTF_8);
Flow<Collection<ByteString>, Map<String, ByteString>, ?> flow3 =
CsvToMap.withHeaders("column1", "column2", "column3");
// values as String (decode ByteString)
Flow<Collection<ByteString>, Map<String, String>, ?> flow4 =
CsvToMap.toMapAsStrings(StandardCharsets.UTF_8);
Flow<Collection<ByteString>, Map<String, String>, ?> flow5 =
CsvToMap.withHeadersAsStrings(StandardCharsets.UTF_8, "column1", "column2", "column3");
// values as String (decode ByteString)
Flow<Collection<ByteString>, Map<String, String>, ?> flow6 =
CsvToMap.toMapAsStringsCombineAll(
StandardCharsets.UTF_8, Optional.empty(), Optional.empty());
This example uses the first line (the header line) in the CSV data as column names:
source import org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString("""eins,zwei,drei
|11,12,13
|21,22,23
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMap())
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> ByteString("11"), "zwei" -> ByteString("12"), "drei" -> ByteString("13")),
Map("eins" -> ByteString("21"), "zwei" -> ByteString("22"), "drei" -> ByteString("23"))))
// values as String
Source
.single(ByteString("""eins,zwei,drei
|11,12,13
|21,22,23
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapAsStrings())
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> "11", "zwei" -> "12", "drei" -> "13"),
Map("eins" -> "21", "zwei" -> "22", "drei" -> "23")))
import org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString("""eins,zwei,drei,vier,fünt
|11,12,13
|21,22,23
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapAsStringsCombineAll(headerPlaceholder = Option.empty))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> "11", "zwei" -> "12", "drei" -> "13", "vier" -> "", "fünt" -> ""),
Map("eins" -> "21", "zwei" -> "22", "drei" -> "23", "vier" -> "", "fünt" -> "")))
import org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString("""eins,zwei,drei
|11,12,13,14
|21,22,23
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapAsStringsCombineAll(headerPlaceholder = Option.empty))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> "11", "zwei" -> "12", "drei" -> "13", "MissingHeader0" -> "14"),
Map("eins" -> "21", "zwei" -> "22", "drei" -> "23")))
import org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString("""eins,zwei
|11,12,13
|21,22,
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapAsStringsCombineAll(headerPlaceholder = Option("MyCustomHeader")))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> "11", "zwei" -> "12", "MyCustomHeader0" -> "13"),
Map("eins" -> "21", "zwei" -> "22", "MyCustomHeader0" -> "")))
import org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString("""eins,zwei,drei,fünt
|11,12,13
|21,22,23
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapAsStringsCombineAll(customFieldValuePlaceholder = Option("missing")))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> "11", "zwei" -> "12", "drei" -> "13", "fünt" -> "missing"),
Map("eins" -> "21", "zwei" -> "22", "drei" -> "23", "fünt" -> "missing")))
import org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString("""eins,zwei,drei,vier,fünt
|11,12,13
|21,22,23
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapCombineAll(headerPlaceholder = Option.empty))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> ByteString("11"),
"zwei" -> ByteString("12"),
"drei" -> ByteString("13"),
"vier" -> ByteString.empty,
"fünt" -> ByteString.empty),
Map("eins" -> ByteString("21"),
"zwei" -> ByteString("22"),
"drei" -> ByteString("23"),
"vier" -> ByteString.empty,
"fünt" -> ByteString.empty)))
import org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString("""eins,zwei,drei
|11,12,13,14,15
|21,22,23
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapCombineAll(headerPlaceholder = Option.empty))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> ByteString("11"),
"zwei" -> ByteString("12"),
"drei" -> ByteString("13"),
"MissingHeader0" -> ByteString("14"),
"MissingHeader1" -> ByteString("15")),
Map("eins" -> ByteString("21"), "zwei" -> ByteString("22"), "drei" -> ByteString("23"))))
import org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString("""eins,zwei
|11,12,13
|21,22,
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapCombineAll(headerPlaceholder = Option("MyCustomHeader")))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> ByteString("11"), "zwei" -> ByteString("12"), "MyCustomHeader0" -> ByteString("13")),
Map("eins" -> ByteString("21"), "zwei" -> ByteString("22"), "MyCustomHeader0" -> ByteString.empty)))
import org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString("""eins,zwei,drei,fünt
|11,12,13
|21,22,
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapCombineAll(headerPlaceholder = Option("MyCustomHeader"), customFieldValuePlaceholder = Option(ByteString("missing"))))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> ByteString("11"),
"zwei" -> ByteString("12"),
"drei" -> ByteString("13"),
"fünt" -> ByteString("missing")),
Map("eins" -> ByteString("21"),
"zwei" -> ByteString("22"),
"drei" -> ByteString.empty,
"fünt" -> ByteString("missing"))))
sourceimport org.apache.pekko.stream.connectors.csv.javadsl.CsvParsing;
import org.apache.pekko.stream.connectors.csv.javadsl.CsvToMap;
// values as ByteString
Source.single(ByteString.fromString("eins,zwei,drei\n1,2,3"))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMap(StandardCharsets.UTF_8))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo(ByteString.fromString("1")));
assertThat(map.get("zwei"), equalTo(ByteString.fromString("2")));
assertThat(map.get("drei"), equalTo(ByteString.fromString("3")));
// values as String
Source.single(ByteString.fromString("eins,zwei,drei\n1,2,3"))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapAsStrings(StandardCharsets.UTF_8))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo("1"));
assertThat(map.get("zwei"), equalTo("2"));
assertThat(map.get("drei"), equalTo("3"));
// values as ByteString
Source.single(ByteString.fromString("eins,zwei,drei\n1,2,3,4,5"))
.via(CsvParsing.lineScanner())
.via(
CsvToMap.toMapCombineAll(
StandardCharsets.UTF_8, Optional.empty(), Optional.empty()))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo(ByteString.fromString("1")));
assertThat(map.get("zwei"), equalTo(ByteString.fromString("2")));
assertThat(map.get("drei"), equalTo(ByteString.fromString("3")));
assertThat(map.get("MissingHeader0"), equalTo(ByteString.fromString("4")));
assertThat(map.get("MissingHeader1"), equalTo(ByteString.fromString("5")));
// values as ByteString
Source.single(ByteString.fromString("eins,zwei,drei,vier,fünt\n1,2,3"))
.via(CsvParsing.lineScanner())
.via(
CsvToMap.toMapCombineAll(
StandardCharsets.UTF_8, Optional.empty(), Optional.empty()))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo(ByteString.fromString("1")));
assertThat(map.get("zwei"), equalTo(ByteString.fromString("2")));
assertThat(map.get("drei"), equalTo(ByteString.fromString("3")));
assertThat(map.get("vier"), equalTo(ByteString.fromString("")));
assertThat(map.get("fünt"), equalTo(ByteString.fromString("")));
// values as ByteString
Source.single(ByteString.fromString("eins,zwei,drei\n1,2,3,4,5"))
.via(CsvParsing.lineScanner())
.via(
CsvToMap.toMapCombineAll(
StandardCharsets.UTF_8, Optional.empty(), Optional.of("MyCustomHeader")))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo(ByteString.fromString("1")));
assertThat(map.get("zwei"), equalTo(ByteString.fromString("2")));
assertThat(map.get("drei"), equalTo(ByteString.fromString("3")));
assertThat(map.get("MyCustomHeader0"), equalTo(ByteString.fromString("4")));
assertThat(map.get("MyCustomHeader1"), equalTo(ByteString.fromString("5")));
// values as ByteString
Source.single(ByteString.fromString("eins,zwei,drei\n1,2"))
.via(CsvParsing.lineScanner())
.via(
CsvToMap.toMapCombineAll(
StandardCharsets.UTF_8,
Optional.of(ByteString.fromString("missing")),
Optional.of("MyCustomHeader")))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo(ByteString.fromString("1")));
assertThat(map.get("zwei"), equalTo(ByteString.fromString("2")));
assertThat(map.get("drei"), equalTo(ByteString.fromString("missing")));
This sample will generate the same output as above, but the column names are specified in the code:
sourceimport org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvParsing, CsvToMap }
// values as ByteString
Source
.single(ByteString(
"""11,12,13
|21,22,23
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.withHeaders("eins", "zwei", "drei"))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> ByteString("11"), "zwei" -> ByteString("12"), "drei" -> ByteString("13")),
Map("eins" -> ByteString("21"), "zwei" -> ByteString("22"), "drei" -> ByteString("23"))))
// values as String
Source
.single(ByteString("""11,12,13
|21,22,23
|""".stripMargin))
.via(CsvParsing.lineScanner())
.via(CsvToMap.withHeadersAsStrings(StandardCharsets.UTF_8, "eins", "zwei", "drei"))
.runWith(Sink.seq)
result should be(
Seq(
Map("eins" -> "11", "zwei" -> "12", "drei" -> "13"),
Map("eins" -> "21", "zwei" -> "22", "drei" -> "23")))
sourceimport org.apache.pekko.stream.connectors.csv.javadsl.CsvParsing;
import org.apache.pekko.stream.connectors.csv.javadsl.CsvToMap;
// values as ByteString
Source.single(ByteString.fromString("1,2,3"))
.via(CsvParsing.lineScanner())
.via(CsvToMap.withHeaders("eins", "zwei", "drei"))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo(ByteString.fromString("1")));
assertThat(map.get("zwei"), equalTo(ByteString.fromString("2")));
assertThat(map.get("drei"), equalTo(ByteString.fromString("3")));
// values as String
Source.single(ByteString.fromString("1,2,3"))
.via(CsvParsing.lineScanner())
.via(CsvToMap.withHeadersAsStrings(StandardCharsets.UTF_8, "eins", "zwei", "drei"))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo("1"));
assertThat(map.get("zwei"), equalTo("2"));
assertThat(map.get("drei"), equalTo("3"));
// values as String
Source.single(ByteString.fromString("eins,zwei,drei,vier,fünt\n1,2,3"))
.via(CsvParsing.lineScanner())
.via(
CsvToMap.toMapAsStringsCombineAll(
StandardCharsets.UTF_8, Optional.empty(), Optional.empty()))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo("1"));
assertThat(map.get("zwei"), equalTo("2"));
assertThat(map.get("drei"), equalTo("3"));
assertThat(map.get("vier"), equalTo(""));
assertThat(map.get("fünt"), equalTo(""));
// values as String
Source.single(ByteString.fromString("eins,zwei,drei\n1,2,3,4,5"))
.via(CsvParsing.lineScanner())
.via(
CsvToMap.toMapAsStringsCombineAll(
StandardCharsets.UTF_8, Optional.empty(), Optional.empty()))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo("1"));
assertThat(map.get("zwei"), equalTo("2"));
assertThat(map.get("drei"), equalTo("3"));
assertThat(map.get("MissingHeader0"), equalTo("4"));
assertThat(map.get("MissingHeader1"), equalTo("5"));
// values as String
Source.single(ByteString.fromString("eins,zwei,drei\n1,2,3,4,5"))
.via(CsvParsing.lineScanner())
.via(
CsvToMap.toMapAsStringsCombineAll(
StandardCharsets.UTF_8, Optional.empty(), Optional.of("MyCustomHeader")))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo("1"));
assertThat(map.get("zwei"), equalTo("2"));
assertThat(map.get("drei"), equalTo("3"));
assertThat(map.get("MyCustomHeader0"), equalTo("4"));
assertThat(map.get("MyCustomHeader1"), equalTo("5"));
// values as String
Source.single(ByteString.fromString("eins,zwei,drei\n1,2"))
.via(CsvParsing.lineScanner())
.via(
CsvToMap.toMapAsStringsCombineAll(
StandardCharsets.UTF_8, Optional.of("missing"), Optional.of("MyCustomHeader")))
.runWith(Sink.head(), system);
assertThat(map.get("eins"), equalTo("1"));
assertThat(map.get("zwei"), equalTo("2"));
assertThat(map.get("drei"), equalTo("missing"));
CSV formatting¶
To emit CSV files immutable.Seq[String]
can be formatted into ByteString
e.g to be written to file. The formatter takes care of quoting and escaping.
Certain CSV readers (e.g. Microsoft Excel) require CSV files to indicate their character encoding with a Byte Order Mark (BOM) in the first bytes of the file. Choose an appropriate Byte Order Mark matching the selected character set from the constants in ByteOrderMark
(Unicode FAQ on Byte Order Mark).
sourceimport org.apache.pekko.stream.connectors.csv.scaladsl.{ CsvFormatting, CsvQuotingStyle }
val flow: Flow[immutable.Seq[String], ByteString, _]
= CsvFormatting.format(delimiter,
quoteChar,
escapeChar,
endOfLine,
CsvQuotingStyle.Required,
charset = StandardCharsets.UTF_8,
byteOrderMark = None)
sourceimport org.apache.pekko.stream.connectors.csv.javadsl.CsvFormatting;
import org.apache.pekko.stream.connectors.csv.javadsl.CsvQuotingStyle;
Flow<Collection<String>, ByteString, ?> flow1 = CsvFormatting.format();
Flow<Collection<String>, ByteString, ?> flow2 =
CsvFormatting.format(
delimiter,
quoteChar,
escapeChar,
endOfLine,
CsvQuotingStyle.REQUIRED,
charset,
byteOrderMark);
This example uses the default configuration:
- Delimiter: comma (,)
- Quote char: double quote (")
- Escape char: backslash (\)
- Line ending: Carriage Return and Line Feed (
\r
= ASCII 13 +\n
= ASCII 10) - Quoting style: quote only if required
- Charset: UTF-8
- No Byte Order Mark
sourceimport org.apache.pekko.stream.connectors.csv.scaladsl.CsvFormatting
Source
.single(List("eins", "zwei", "drei"))
.via(CsvFormatting.format())
.runWith(Sink.head)
sourceimport org.apache.pekko.stream.connectors.csv.javadsl.CsvFormatting;
import org.apache.pekko.stream.connectors.csv.javadsl.CsvQuotingStyle;
Source.single(Arrays.asList("one", "two", "three"))
.via(CsvFormatting.format())
.runWith(Sink.head(), system);