Innovimax-SARL/QuiXDM

View on GitHub
README.md

Summary

Maintainability
Test Coverage
[![Build Status](https://travis-ci.org/Innovimax-SARL/QuiXDM.svg?branch=master)](https://travis-ci.org/Innovimax-SARL/QuiXDM)
[![Coverity Scan Build Status](https://scan.coverity.com/projects/6518/badge.svg)](https://scan.coverity.com/projects/6518)
[![Code Climate](https://codeclimate.com/github/Innovimax-SARL/QuiXDM/badges/gpa.svg)](https://codeclimate.com/github/Innovimax-SARL/QuiXDM)

# QuiXDM
**QuiXDM** is an *ubiquitous* open-source datamodel to process in a Streaming fashion:
* [x] XML (via [XQuery and XPath Data Model (XDM)](https://www.w3.org/TR/xpath-datamodel)) 
* [x] JSON 
  * [x] YAML
* [x] RDF Triple 
  * [ ] Quad
* [x] CSV
  * [ ] TSV
* [ ] HTML

# Getting Started

To install it
*  Checkout this code. It's Java 1.8+ compliant
*  Get access to Saxon 9.7: http://saxon.sourceforge.net/
*  Get access to Jackson Core 2.7.4: https://github.com/FasterXML/jackson-core
*  and few other dependencies (see pom.xml)

# Why QuiXDM?
There is SAX,StAX, DOM, Jackson, Jena, CSVParser, HTMLParser out there for processing data

 Feature\API | SAX | StAX | DOM | Jackson | **QuiXDM**
------|-----|------|-----|-------|----
in memory/streaming | streaming | streaming | in memory | streaming | **streaming**
push/pull | push | pull | -- | pull | **pull**
data model | low level XML | low level XML | low level XML | low level JSON | **XPath Data Model**
handle sequence | no | no | no | no | **yes**
handle json/yaml | no | no | no | yes | **yes**
handle rdf  | no | no | no | no  | **yes**
handle csv  | no | no | no | no  | **yes**
handle html | no | no | no | no  | **yes**

# How does it work?
It uses a consistent datamodel to represent all those contents in streaming.

```ANTLR
// Here is the grammar of events
sequence       := START_SEQUENCE, (document|json_yaml|table|semantic)*, END_SEQUENCE
document       := START_DOCUMENT, (PROCESSING-INSTRUCTION|COMMENT)*, element, (PROCESSING-INSTRUCTION|COMMENT)*, END_DOCUMENT
json_yaml      := START_JSON, object, END_JSON
table          := START_TABLE, header*, array_of_array, END_TABLE
semantic       := START_RDF, statement*, END_RDF
element        := START_ELEMENT, (NAMESPACE|ATTRIBUTE)*, (TEXT|element|PROCESSING-INSTRUCTION|COMMENT)*, END_ELEMENT
object         := START_OBJECT, (KEY_NAME, value)*, END_OBJECT
value          := object|array|flat_value
flat_value     := VALUE_FALSE|VALUE_TRUE|VALUE_NUMBER|VALUE_NULL|VALUE_STRING
array          := START_ARRAY, value*, END_ARRAY
array_of_array := START_ARRAY, flat_array+, END_ARRAY
flat_array     := START_ARRAY, flat_value*, END_ARRAY
statement      := START_PREDICATE, SUBJECT, OBJECT, GRAPH?, END_PREDICATE
```
Mostly look at [QuiXToken.java](https://github.com/innovimax/QuiXDM/blob/master/src/main/java/innovimax/quixproc/datamodel/QuiXToken.java)

# Use
## With Object creation (à la [javax.xml.stream.XMLEventReader](https://docs.oracle.com/javase/8/docs/api/index.html?javax/xml/stream/XMLEventReader.html))
Simplest way to use, is to instantiate [innovimax.quixproc.datamodel.in.QuiXEventStreamReader.java](https://github.com/innovimax/QuiXDM/blob/master/src/main/java/innovimax/quixproc/datamodel/in/QuiXEventStreamReader.java)
```java
Iterable<Source> sources = 
        "/tmp/file/file_aaa.xml",    
        "/tmp/file/file_aab.json",
        "/tmp/file/file_aac.csv",
        "/tmp/file/file_aad.yml",
        "/tmp/file/file_aae.n3"    
;
QuiXEventStreamReader qesr = new QuiXEventStreamReader(sources);
while(qesr.hasNext()) {
    System.out.println(qesr.next());
}
```
## Lightweight iterator without Object creation (à la [javax.xml.stream.XMLStreamReader](https://docs.oracle.com/javase/8/docs/api/index.html?javax/xml/stream/XMLStreamReader.html))
***TODO***



# Why [QuiXCharStream](https://github.com/innovimax/QuiXDM/blob/master/src/main/java/innovimax/quixproc/datamodel/QuiXCharStream.java) and [QuiXQName](https://github.com/innovimax/QuiXDM/blob/master/src/main/java/innovimax/quixproc/datamodel/QuiXQName.java)?
Well it comes from the fact that Streaming interface in XML should really be streaming.
The truth is that there is no such character streaming interface in Java.
 * String is definitely not streamable and limited to 2^31 characters
 * CharSequence, which could have been, is neither because it has [length()](http://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#length--)
 * CharIterator doesn't exist in the JDK (but you can find it [here](http://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/chars/CharIterator.html))
 * CharSequence.chars() returns IntStream (instead of CharStream because Java 8 people didn't want to add it) 
 * Java 8 Stream<Char> implies that every char is boxed (which means it's highly INEFFICIENT)
 
Having such context, that's why [QuiXCharStream](https://github.com/innovimax/QuiXDM/blob/master/src/main/java/innovimax/quixproc/datamodel/QuiXCharStream.java) and [QuiXQName](https://github.com/innovimax/QuiXDM/blob/master/src/main/java/innovimax/quixproc/datamodel/QuiXQName.java) went live in order to :
 * be able to address the TEXT recombination issue (text() node in XDM cannot be contiguous)
 * be able to stream even corner cases XML:
   * huge string
   * huge names
   * huge namespace uris

# Contributors
[Innovimax](http://innovimax.fr) is contributing to this work
# Related Projects
QuiXDM can be used standalone

This is the data model of QuiXPath and QuiXProc

It is part of two bigger projects :

*  QuiXProc: https://github.com/innovimax/quixproc
*  QuiXPath: https://github.com/innovimax/quixpath