Talk:Comparison of data-serialization formats

Latest comment: 1 month ago by Zzo38 in topic PostScript binary format

Reason for this article edit

This content was a long list in the main XML article. I removed the list to put it here, because the XML article is already very long. Hervegirod (talk) 00:45, 6 August 2009 (UTC)Reply

I don't think this article should be deleted if it is still up for deletion. There are lots of these articles, they are useful for finding out information and comparing different things quickly.

SeanJA (talk) 05:31, 12 September 2009 (UTC)Reply

Suggestions edit

A common term for "data serialization format" is encoding. You may want to include in this comparison:


I second the inclusion of XDR Jann.poppinga (talk) 10:57, 19 March 2010 (UTC)Reply

Shouldn't Boost Serialization be included here? — Preceding unsigned comment added by 98.171.183.235 (talk) 00:54, 25 October 2022 (UTC)Reply

This section is in the wrong place edit

This section should be on the XML page...

XML edit

Advantages edit

  • XML provides a basic syntax that can be used to share information between different kinds of computers, different applications, and different organizations. XML data is stored in plain text format.[1] This software- and hardware-independent way of storing data allows different incompatible systems to share data without needing to pass them through many layers of conversion. This also makes it easier to expand or upgrade to new operating systems, new applications, or new browsers, without losing any data.
  • It supports Unicode, allowing almost any information in any written human language to be communicated.
  • It can represent common computer science data structures: records, lists and trees.
  • Its self-documenting format describes structure and field names as well as specific values.
  • The strict syntax and parsing requirements make the necessary parsing algorithms extremely simple, efficient, and consistent.
  • Content-based XML markup enhances searchability, making it possible for agents and search engines to categorize data instead of wasting processing power on context-based full-text searches.
  • The hierarchical structure is suitable for most (but not all) types of documents.
  • It is platform-independent, thus relatively immune to changes in technology.
  • Its predecessor, SGML, has been in use since 1986, so there is extensive experience and software available.

Disadvantages edit

  • XML syntax is redundant or large relative to binary representations of similar data,[2] especially with tabular data. However, this is comparing apples to oranges: binary versus text-based representations. This is not specifically a disadvantage of XML as the same applies to JSON and other text-based formats.
  • The redundancy may affect application efficiency through higher storage, transmission and processing costs.[3][4]. However, efficient stream-based parsers do not require memory storage of XML and can efficiently extract data, e.g. SAX and pull-parsing.
  • The hierarchical model for representation is limited in comparison to an object oriented graph.[5][6]
  • Expressing overlapping (non-hierarchical) node relationships requires extra effort.[7]. However, SOAP encoding demonstrates the ease by which graphs are serializable using proper ID and IDREF usage.
  • Transformations, even identity transforms, result in changes to format (whitespace, attribute ordering, attribute quoting, whitespace around attributes, newlines). These problems can make diff-ing the XML source very difficult except via Canonical XML.
  • Unlike JSON or YAML, XML does not map directly and unambiguously to an associative array. — Preceding unsigned comment added by Cowlinator (talkcontribs) 16:00, 16 February 2021 (UTC)Reply

References

  1. ^ "How Can XML be Used?". W3schools.com. Retrieved 2009-07-31.
  2. ^ Harold, Elliotte Rusty (2002). Processing XML with Java(tm): a guide to SAX, DOM, JDOM, JAXP, and TrAX. Addison-Wesley. ISBN 0201771861.XML documents are too verbose compared with binary equivalents.
  3. ^ Harold, Elliotte Rusty (2002). XML in a Nutshell: A Desktop Quick Reference. O'Reilly. ISBN 0596002920. XML documents are very verbose and searching is inefficient for high-performance largescale database applications.
  4. ^ However, the Binary XML effort strives to alleviate these problems by using a binary representation for the XML document. For example, the Java reference implementation of the Fast Infoset standard parsing speed is better by a factor 10 compared to Java Xerces, and by a factor 4 compared to the Piccolo driver, one of the fastest Java-based XML parser [1].
  5. ^ A hierarchical model only gives a fixed, monolithic view of the tree structure. For example, either actors under movies, or movies under actors, but not both.
  6. ^ Lim, Ee-Peng (2002). Digital Libraries: People, Knowledge, and Technology. Springer. ISBN 3540002618.Discusses some of the limitation with fixed hierarchy. Proceedings of the 5th International Conference on Asian Digital Libraries, ICADL 2002, held in Singapore in December 2002.
  7. ^ Searle, Leroy F. (2004). Voice, text, hypertext: emerging practices in textual studies. University of Washington Press. ISBN 0295983051. Proposes an alternative system for encoding overlapping elements.

Human Readable? edit

XML should only be tagged as partially human-readable, the simpler XML files, basic XML files can be, but onces xmlns and xsd come into play, it quickly becomes not human-readable. Another factor is that it's not always possible to properly reformat/indent XML for readability without affecting content. 81.220.246.44 (talk) 14:20, 24 October 2014 (UTC)Reply

Quite Human Readable! edit

Concerning the "not human-readable" implements mentioned above, XML namespace specification attributes ("xmlns") and XML schema definitions (XSDs) are text just like XML, and perfectly human-readable. And you absolutely can reformat/indent XML w/out affecting content; that/explicit value delimitation via tag/attrib is the whole point/benefit over whitespace-delimited encoders like YAML (1 detriment of which is the negative effect of improper/varied indentation). — Preceding unsigned comment added by 192.91.171.42 (talk) 17:11, 17 January 2020 (UTC)Reply

JSON Associative Array Error edit

The JSON associative array sample - {42: true, "A to Z": [1, 2, 3]} - looks wrong to me (and also to JSONLint). In JSON the property names ("keys" if you will) must be double-quoted strings. Neither numbers nor unquoted strings are valid, hence 42 cannot be a property name, although "A to Z" can, as can "42".

See JSON.org, as follows:

  • An object is an unordered set of name/value pairs
  • pair
    • string : value
  • A string is a sequence of zero or more Unicode characters, wrapped in double quotes

--Mikepeat (talk) 15:30, 10 January 2011 (UTC)Reply

Missing Apache Avro edit

Especially for the subsection about "binary formats", but also for the "Overview", I would expect some information about Apache Avro: http://en.wikipedia.org/wiki/Apache_Avro ... till now I don't have enough own knowledge to write something about it --217.24.206.242 (talk) 11:10, 18 September 2012 (UTC)Reply

Missing Java Serialization edit

As I understand the intention of this article, Java Serialization should be a part of it. It is one of the commonly used object serialization formats (e.g. for RMI communication). — Preceding unsigned comment added by 217.18.178.110 (talk) 12:52, 20 June 2014 (UTC)Reply

Missing Python Pickle edit

As I understand the suggestion has been made that Java Serialization should be part of this article, what about other language-specific serialization formats, such as Python's pickle? 195.212.29.89 (talk) 07:00, 25 September 2014 (UTC)Reply

Missing Microsoft Bond edit

https://github.com/Microsoft/bond/

From the github: "Bond is a cross-platform framework for working with schematized data. It supports cross-language de/serialization and powerful generic mechanisms for efficiently manipulating data. Bond is broadly used at Microsoft in high scale services." — Preceding unsigned comment added by 82.136.100.19 (talk) 10:11, 30 January 2015 (UTC)Reply

Missing other Protocol Buffers flavors edit

To be complete, FlatBuffers (http://google.github.io/flatbuffers/) and Cap'N Proto (https://capnproto.org/) could be mentiond. 128.237.28.16 (talk) 16:00, 24 February 2015 (UTC)Reply

Agree. Cap'n Proto is very interesting and would be a good comparison. I do not have enough in-depth knowledge to create an official entry. CaliViking (talk) 18:33, 29 September 2022 (UTC)Reply

Misleading "Standardized?" Column edit

The term Standardized leads to a page describing National and International Standards. Many of the of the entries are misleadingly listed as "Standardized" when in fact they are not standardized protocols, never having been approved by a due-process ANSI or ISO approved standards development organization. For example, Apache is not an ANSI or ISO approved standards development organization and therefore Avro is not a standardized protocol unless it is submitted and approved by such a body. — Preceding unsigned comment added by Posicks (talkcontribs) 15:45, 9 May 2015 (UTC)Reply

Something is standardized if a useful specification is publicly aviable. --195.14.219.99 (talk) 21:01, 3 November 2015 (UTC)Reply

The same can be said for Protocol Buffers - the link just refers to its own documentation. — Preceding unsigned comment added by 148.80.255.144 (talk) 20:44, 23 February 2018 (UTC)Reply

Missing Gob edit

Created for Go programming language. https://golang.org/pkg/encoding/gob/

Missing Npy edit

Created as a simple way to serialize (Python's) Numpy objects. https://www.numpy.org/devdocs/reference/generated/numpy.lib.format.html

EDN is missing edit

https://github.com/edn-format/edn — Preceding unsigned comment added by 164.144.252.29 (talk) 18:56, 11 November 2015 (UTC)Reply

EDN seems to have an article here: Extensible Data Notation. 50.53.1.21 (talk) 22:04, 29 October 2017 (UTC)Reply

Missing Cap'n Proto edit

Looks very interesting. I don't have the in-depth knowledge to write a good entry. Creator of the format is very knowledgeable in the field as the primary author of Proto Buffers version 2. See also https://capnproto.org/ , https://github.com/capnproto/ , https://groups.google.com/g/capnproto CaliViking (talk) 18:33, 29 September 2022 (UTC)Reply

External links modified edit

Hello fellow Wikipedians,

I have just modified one external link on Comparison of data serialization formats. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 16:54, 11 August 2017 (UTC)Reply

YAML IDL edit

Shouldn't "JSON Schema" be added as a schema spec language for YAML?

Arnauld (talk) 11:32, 19 March 2019 (UTC)Reply

Swagger edit

Add swagger a comparator 148.80.255.166 (talk) 13:31, 29 April 2019 (UTC)Reply

HJSON: A useful compromise between JSON and YAML edit

I have added a row for HJSON, which an IP removed. The reasoning is that we don’t include human-readable “user interface” formats, but we already have YAML here, which is a human-readable superset of JSON.

Not having HJSON on this page makes the Wikipedia a worse place, because it deprives its readers from knowing about a perfectly useful serialization format which has some features JSON lacks (comments, multi-line strings) while not having the complexity of YAML. As per WP:NNC, this content should be restored. No, I have no relation to HJSON, except that it’s a useful data format I wish I knew about earlier. Samboy (talk) 19:56, 13 August 2019 (UTC)Reply

This is a comparison of stuff that already has a Wikipedia article. We shouldn't add external links. - MrOllie (talk) 20:12, 13 August 2019 (UTC)Reply
HJSON doesn’t have enough stuff in books or magazines to make a full article, but I think we can add a sentence mentioning it to the JSON article because of Edelman, Jason; Lowe, Scott; Oswalt, Matt. Network Programmability and Automation. O'Reilly Media. for data representation you can pick one of the following: YAML, YAMLEX, JSON, JSON5, HJSON, or even pure Python Samboy (talk) 20:27, 13 August 2019 (UTC)Reply
I didn't claim that Hjson is a "user interface format", I claimed it is a "user interface", period. Just see the title of hjson.org, which states "Hjson, a user interface for JSON". The comparison to YAML is flawed, since YAML is a data-serialization format, while Hjson doesn't even claim to be one. Likening Hjson to the programming language Python isn't helping the case. Shall we include Python in the list as well? I have yet to see anything that makes Hjson noteworthy in either this article or the JSON article. It's great that you have found a UI to JSON that you like, but Wikipedia is not a linkfarm. 185.213.154.172 (talk) 20:43, 13 August 2019 (UTC)Reply
I agree that we shouldn’t make Wikipedia a link farm, but I also feel we should not completely ignore useful reliably sourced information. HJSON (and JSON5) are both mentioned in a reliable source, and mentioning them in the JSON article has a much lower bar of entry than a standalone Wikipedia article; as per the WP:LINKFARM concerns, I did not hotlink to the web pages (I assume our readers are smart enough to perform a Google search), but I think naming them in a single short sentence in JSON (where we already have an extensive discussion of YAML) is acceptable now that I have a reliable source which names them. 21:12, 13 August 2019 (UTC)
You rewrote and replace your comment, while I was replying to it. So I had to start my reply from scratch. You seem to have understood that the arguments where flawed. My understanding is that including Hjson in this article would be wrong in regards to the article's topic, and to include it in the JSON article is giving it undue weight. Discussing YAML in the JSON article is completely different. YAML is a standard, it a serialization format (not a "configuration" format), it's a JSON superset, et cetera. 185.213.154.172 (talk) 21:20, 13 August 2019 (UTC)Reply
Continuing the Python comparison (which also answers your deleted comment): Including Hjson in a serialization format discussion, is like including Python instead of Pickle. 185.213.154.172 (talk) 21:32, 13 August 2019 (UTC)Reply
By YAML being a standard, I was referring to how exceedingly common it is. I.e. a de facto standard. 185.213.154.172 (talk) 21:34, 13 August 2019 (UTC)Reply
I agree with MrOllie above. I would like to add that when an article for Hjson exists, and the configuration file format is shown to be noteworthy, it seems like it would best fit in the "See also" section of the JSON article, under "Related formats". 185.213.154.172 (talk) 22:20, 13 August 2019 (UTC)Reply

Additional and upcoming formats possibly not worthy of mentioning in the main article, but shall be gathered anyway edit

Okay, there are a lot of formats out there. I open this section to collect and reference them, and if they reach the notability threshold they can be included in the main article. I'm sure there's plenty. Also there is some discrepancy/overlapping between data-serialization formats, data exchange formats and configuration file formats, I do not make distinction here since - in my opinion - they are mostly the same set with very similar purposes and only slightly specificattributes. --grin 09:39, 20 February 2020 (UTC)Reply

@Grin: TOML isn't a serialization format used as a configuration file format (as often happens), it's explicitly designed as a configuration file format. To quote the official objectives: "TOML aims to be a minimal configuration file format that's easy to read due to obvious semantics."
The same goes for INI files, and HOCON (it's even in the name: Human-Optimized Config Object Notation). To quote the HOCON informal specification: "The primary goal is: keep the semantics (tree structure; set of types; encoding/escaping) from JSON, but make it more convenient as a human-editable config file format.". Thus everything that HOCON can serialize, JSON can serialize. HOCON is entirely superfluous when it comes to serialization (which this article is about).
This article is crowded as is. If someone wants a table comparing configuration file formats, then why not create the article Comparison of configuration file formats? Then this article could link to that, and that article could link to this. Nothing's stopping that, right?
193.138.218.217 (talk) 11:29, 20 June 2020 (UTC)Reply
Obviously not right since I guess it would be deleted within the first few minutes, with the justification "there's already data-serialisation format article, insert it there", or at least this have a pretty high probability of happening. But it's not my child, do as you please. *shrug* --grin 14:12, 21 June 2020 (UTC)Reply
@193.138.218.217, I'm not sure why you are talking about serialization formats and configuration file formats as if they are mutually exclusive. Anyone can use a configuration file format for serialization, and anyone can use a serialization format as a configuration file. The intended usage is irrelevant. All of the below formats should be added to the article. ----Cowlinator (talk) 15:43, 16 February 2021 (UTC)Reply

Human readable edit

Binary edit

  • UAVCAN -- a pub/sub protocol that defines an interface definition language (DSDL)

In what way is YAML not standardized? edit

In what way is YAML not standardized? ----Cowlinator (talk) 16:02, 16 February 2021 (UTC)Reply

Other data serialization formats edit

Hi,

Other media types like images, audio and video are data too. The main difference might be that they usually are binary encoded, but that's okay since there are plenty of binary data formats already in the article Comparison of data-serialization formats.

I suggest referring to Media types as they're standardized.

Have a nice day :) Dun Nic (talk) 19:22, 14 October 2022 (UTC)Reply

Additional characteristics for comparison edit

Hi, I' missing a key characteristic (at least, it's key to me).

Lacking a better name for it, we can refer to it as streaming. A data serialization format supporting streaming would mean that it supports a unlimited amount of items in one data stream.

For instance:

  • CSV supports one item per line.
  • Log files support one item per line.
  • YAML supports multiple "documents".
  • Multipart/form-data supports many "parts" and even of different MIME types.
  • JSON on the other hand, does not allow concatenation of multiple documents.
  • The same goes for XML, it demands only one root node (closed).

Have a nice day :) Dun Nic (talk) 19:39, 14 October 2022 (UTC)Reply

Yes, we should add a stramable field. It can be argued though that JSON is partially streamable, as there is no rule against sending multiple objects in one document.
I propose:
- Streambale: Yes. This means it is explictly designed for streaming or live-appendation, such as CSV and log files.
- Streamable: Somewhat. This means it was not designed for it but it has methods to stream it. (Such as making a file/stream an implicit array of JSON objects).
- Streamable: No. Formats which cannot be streamed, such as XML it would inherntly violate the structure. Tryoxiss (talk) 00:59, 7 December 2023 (UTC)Reply

Missing RDF edit

RDF seems legit to me :) Dun Nic (talk) 19:39, 14 October 2022 (UTC)Reply

PostScript binary format edit

There is also the PostScript binary format. It has the advantage that you might not need to parse all of the data to find something; each part contains the address of the sub-parts. However, it also has disadvantages such as lack of 64-bit integers, and strings cannot exceed 64K. --Zzo38 (talk) 01:34, 6 April 2024 (UTC)Reply