Data exchange formats

This post is also available in audible format on my podcast. The post discusses the most common and widely used data exchange formats.

There are various scenarios in which both software and systems engineers need a way to exchange data between users, systems, scripts and applications using mutually understandable protocols and formats. The most widely accepted data exchange formats are the following:

XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium’s XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services.

Several schema systems exist to aid in the definition of XML-based languages, while programmers have developed many application programming interfaces (APIs) to aid the processing of XML data.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

There are languages developed specifically to express XML schemas. The document type definition (DTD) language, which is native to the XML specification, is a schema language that is of relatively limited capability, but that also has other uses in XML aside from the expression of schemas. Two more expressive XML schema languages in widespread use are XML Schema (with a capital S) and RELAX NG. The mechanism for associating an XML document with a schema varies according to the schema language. The association may be achieved via markup within the XML document itself, or via some external means.

The XML specification is defined at: https://www.w3.org/TR/REC-xml/

YAML

YAML (a recursive acronym for “YAML Ain’t Markup Language”) is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax which intentionally differs from SGML . It uses both Python-style indentation to indicate nesting, and a more compact format that uses [] for lists and {} for maps, making YAML 1.2 a superset of JSON.

Custom data types are allowed, but YAML natively encodes scalars (such as strings, integers, and floats), lists, and associative arrays (also known as maps or dictionaries). These data types are based on the Perl programming language, though all commonly used high-level programming languages share very similar concepts. The colon-centered syntax, used for expressing key-value pairs, is inspired by electronic mail headers as defined in RFC 0822, and the document separator “—” is borrowed from MIME (RFC 2046). Escape sequences are reused from C, and whitespace wrapping for multi-line strings is inspired from HTML. Lists and hashes can contain nested lists and hashes, forming a tree structure; arbitrary graphs can be represented using YAML aliases (similar to XML in SOAP). YAML is intended to be read and written in streams, a feature inspired by SAX.

Support for reading and writing YAML is available for many programming languages. Some source-code editors such as Emacs and various integrated development environments have features that make editing YAML easier, such as folding up nested structures or automatically highlighting syntax errors.

More details about the YAML specification can be found at: https://yaml.org/

CSV

comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.

In addition, the term “CSV” also denotes some closely related delimiter-separated formats that use different field delimiters, for example, semicolons. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons are often used in some European countries, such as Italy, instead of commas.

CSV is a common data exchange format that is widely supported by consumer, business, and scientific applications. Among its most common uses is moving tabular data between programs that natively operate on incompatible (often proprietary or undocumented) formats. This works despite lack of adherence to RFC 4180 (or any other standard), because so many programs support variations on the CSV format for data import. For example, a user may need to transfer information from a database program that stores data in a proprietary format, to a spreadsheet that uses a completely different format. The database program most likely can export its data as “CSV”; the exported CSV file can then be imported by the spreadsheet program.

RFC 4180 proposes a specification for the CSV format; however, actual practice often does not follow the RFC and the term “CSV” might refer to any file that:

  1. is plain text using a character set such as ASCII, various Unicode character sets (e.g. UTF-8), EBCDIC, or Shift JIS,
  2. consists of records (typically one record per line),
  3. with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces),
  4. where every record has the same sequence of fields.

Within these general constraints, many variations are in use. Therefore, without additional information (such as whether RFC 4180 is honored), a file claimed simply to be in “CSV” format is not fully specified. As a result, many applications supporting CSV files allow users to preview the first few lines of the file and then specify the delimiter character(s), quoting rules, etc. If a particular CSV file’s variations fall outside what a particular receiving program supports, it is often feasible to examine and edit the file by hand (i.e., with a text editor) or write a script or program to produce a conforming format.

JSON

JavaScript Object Notation (JSON) is an open-standard file format or data interchange format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value). It is a very common data format, with a diverse range of applications, such as serving as replacement for XML in AJAX systems.

JSON is a language-independent data format. It was derived from JavaScript, but many modern programming languages include code to generate and parse JSON-format data. The official Internet media type for JSON is application/json. JSON filenames use the extension .json. Douglas Crockford originally specified the JSON format in the early 2000s. JSON was first standardized in 2013, as ECMA-404. The latest JSON format standard was published in 2017 as RFC 8259, and remains consistent with ECMA-404. That same year, JSON was also standardized as ISO/IEC 21778:2017. The ECMA and ISO standards describe only the allowed syntax, whereas the RFC covers some security and interoperability considerations.

JSON is also extensively used in cloud computing services and applications to achieve Infrastructure As Code (IAC). IAC provides the ability to cloud administrators to design and provision cloud infrastructure in an automated fashion. JSON files are being used as IAC templates and describe the resources which will be provisioned in the cloud infrastructure and the features and dependencies of these cloud resources. Such an example is Azure Resource Manager (ARM) templates, which are written in JSON format.

More details about the JSON specification can be found at: https://www.json.org/json-en.html

HTTP Content-Type

There are various content types available. One case in which content types are categorized is the HTTP protocol header. The content type field is supported by the HTTP/ 1.1 and HTTP/ 2.0 protocol.

The Content-Type entity header is used to indicate the media type of the resource. In responses, a Content-Type header tells the client what the content type of the returned content actually is. Browsers will do MIME sniffing in some cases and will not necessarily follow the value of this header; to prevent this behavior, the header X-Content-Type-Options can be set to nosniff. In requests, (such as POST or PUT), the client tells the server what type of data is actually sent.