Wednesday, 21 July 2021

Python Asyncio Multithreading

AsyncIO is a relatively new framework to achieve concurrency in python. In this article, I will compare it with traditional methods like multithreading and multiprocessing. Before jumping into examples, I will add a few refreshers about concurrency in python.

CPython enforces GIL (Global Interpreter Lock) which prevents taking full advantage of multithreading. Each thread needs to acquire this mutually exclusive lock before running any bytecode
Multithreading is usually preferred for network I/O or disk I/O as threads need not compete hard among themselves for acquiring GIL.
Multiprocessing is usually preferred for CPU intensive tasks. Multiprocessing doesn’t need GIL as each process has its state, however, creating and destroying processes is not trivial.
Multithreading with threading module is preemptive, which entails voluntary and involuntary swapping of threads.
AsyncIO is a single thread single process cooperative multitasking. An asyncio task has exclusive use of CPU until it wishes to give it up to the coordinator or event loop. (Will cover the terminology later)

Hands-on Examples

Delay Messages

The program delays printing message. While the MainThread is sleeping, the CPU is idle, which is a bad use of resources.

import logging
logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

num_word_mapping = {1: 'ONE', 2: 'TWO', 3: "THREE", 4: "FOUR", 5: "FIVE", 6: "SIX", 7: "SEVEN", 8: "EIGHT",
                   9: "NINE", 10: "TEN"}

def delay_message(delay, message):
    logging.info(f"{message} received")
    time.sleep(delay)
    logging.info(f"Printing {message}")

def main():
    logging.info("Main started")
    delay_message(2, num_word_mapping[2])
    delay_message(3, num_word_mapping[3])
    logging.info("Main Ended")

main()

12:39:00:MainThread:Main started
12:39:00:MainThread:TWO received
12:39:02:MainThread:Printing TWO
12:39:02:MainThread:THREE received
12:39:05:MainThread:Printing THREE
12:39:05:MainThread:Main Ended

Concurrency with Threading

Using python’s threading module to run multiple invocations of delay_message on separate non-daemon threads. Unsurprisingly, the program executes faster than the above synchronous version by two seconds. OS swaps threads when a thread is idle (sleeping). You can relate sleeping to making a system call to communicate with an external environment.

def delay_message(delay, message):
    logging.info(f"{message} received")
    time.sleep(delay)
    logging.info(f"Printing {message}")

def main():
    logging.info("Main started")
    threads = [threading.Thread(target=delay_message, args=(delay, message)) for delay, message in zip([2, 3], 
                                                                            [num_word_mapping[2], num_word_mapping[3]])]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join() # waits for thread to complete its task
    logging.info("Main Ended")
main()

12:39:05:MainThread:Main started
12:39:05:Thread-4:TWO received
12:39:05:Thread-5:THREE received
12:39:07:Thread-4:Printing TWO
12:39:08:Thread-5:Printing THREE
12:39:08:MainThread:Main Ended

Thread pool

Though threads are lightweight, creating and destroying a large number of threads is expensive. concurrent.futures is built on top of the threading module and provides a neat interface to create ThreadPool and ProcessPool. Instead of creating a new thread for a function call, it reuses an existing thread in the pool.

import concurrent.futures as cf
import logging
import time

logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

num_word_mapping = {1: 'ONE', 2: 'TWO', 3: "THREE", 4: "FOUR", 5: "FIVE", 6: "SIX", 7: "SEVEN", 8: "EIGHT",
                    9: "NINE", 10: "TEN"}


def delay_message(delay, message):
    logging.info(f"{message} received")
    time.sleep(delay)
    logging.info(f"Printing {message}")
    return message


if __name__ == '__main__':
    with cf.ThreadPoolExecutor(max_workers=2) as executor:
        future_to_mapping = {executor.submit(delay_message, i, num_word_mapping[i]): num_word_mapping[i] for i in
                             range(2, 4)}
        for future in cf.as_completed(future_to_mapping):
            logging.info(f"{future.result()} Done")

10:42:36:ThreadPoolExecutor-0_0:TWO received
10:42:36:ThreadPoolExecutor-0_1:THREE received
10:42:38:ThreadPoolExecutor-0_0:Printing TWO
10:42:38:MainThread:TWO Done
10:42:39:ThreadPoolExecutor-0_1:Printing THREE
10:42:39:MainThread:THREE Done

Concurrency with AsyncIO

Coroutine: Unlike a conventional function with a single point of exit, a coroutine can pause and resume its execution. Creation of coroutine is as simple as using async keyword before declaring a function.
Event Loop Or Coordinator: Coroutine that manages other coroutines. You can think of it as a scheduler or master.
Awaitable object Coroutine, Tasks, and Future are awaitable objects. A coroutine can await on awaitable objects. While a coroutine is awaiting on an awaitable object, its execution is temporarily suspended and resumed after Future is done.

import asyncio
import logging
import time

logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

num_word_mapping = {1: 'ONE', 2: 'TWO', 3: "THREE", 4: "FOUR", 5: "FIVE", 6: "SIX", 7: "SEVEN", 8: "EIGHT",
                   9: "NINE", 10: "TEN"}

async def delay_message(delay, message):
    logging.info(f"{message} received")
    await asyncio.sleep(delay) # time.sleep is blocking call. Hence, it cannot be awaited and we have to use asyncio.sleep
    logging.info(f"Printing {message}")
    
async def main():
    logging.info("Main started")
    logging.info(f'Current registered tasks: {len(asyncio.all_tasks())}')
    logging.info("Creating tasks")
    task_1 = asyncio.create_task(delay_message(2, num_word_mapping[2])) 
    task_2 = asyncio.create_task(delay_message(3, num_word_mapping[3]))
    logging.info(f'Current registered tasks: {len(asyncio.all_tasks())}')
    await task_1 # suspends execution of coroutine and gives control back to event loop while awaiting task completion.
    await task_2
    logging.info("Main Ended")

if __name__ == '__main__':
    
    asyncio.run(main()) # creats an envent loop

07:35:32:MainThread:Main started
07:35:32:MainThread:Current registered tasks: 1
07:35:32:MainThread:Creating tasks
07:35:32:MainThread:Current registered tasks: 3
07:35:32:MainThread:TWO received
07:35:32:MainThread:THREE received
07:35:34:MainThread:Printing TWO
07:35:35:MainThread:Printing THREE
07:35:35:MainThread:Main Ended

Though the program is running on a single thread, it can achieve the same level of performance as the multithreaded code by cooperative multitasking.

A better way to create AsyncIO tasks

Using asyncio.gather to create multiple tasks in one shot.

import asyncio
import logging
import time

logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

num_word_mapping = {1: 'ONE', 2: 'TWO', 3: "THREE", 4: "FOUR", 5: "FIVE", 6: "SIX", 7: "SEVEN", 8: "EIGHT",
                   9: "NINE", 10: "TEN"}

async def delay_message(delay, message):
    logging.info(f"{message} received")
    await asyncio.sleep(delay) # time.sleep is blocking call. Hence, it cannot be awaited and we have to use asyncio.sleep
    logging.info(f"Printing {message}")
    
async def main():
    logging.info("Main started")
    logging.info("Creating multiple tasks with asyncio.gather")
    await asyncio.gather(*[delay_message(i+1, num_word_mapping[i+1]) for i in range(5)]) # awaits completion of all tasks
    logging.info("Main Ended")

if __name__ == '__main__':
    
    asyncio.run(main()) # creats an envent loop

08:09:20:MainThread:Main started
08:09:20:MainThread:ONE received
08:09:20:MainThread:TWO received
08:09:20:MainThread:THREE received
08:09:20:MainThread:FOUR received
08:09:20:MainThread:FIVE received
08:09:21:MainThread:Printing ONE
08:09:22:MainThread:Printing TWO
08:09:23:MainThread:Printing THREE
08:09:24:MainThread:Printing FOUR
08:09:25:MainThread:Printing FIVE
08:09:25:MainThread:Main Ended

Caution about Blocking Calls in AsyncIO Tasks

As I told earlier, an asyncio task has an exclusive right to use CPU until it volunteers to give up. If by mistake a blocking call sneaks into your task, it is going to stall the progress of the program.

import asyncio
import logging
import time

logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

num_word_mapping = {1: 'ONE', 2: 'TWO', 3: "THREE", 4: "FOUR", 5: "FIVE", 6: "SIX", 7: "SEVEN", 8: "EIGHT",
                   9: "NINE", 10: "TEN"}

async def delay_message(delay, message):
    logging.info(f"{message} received")
    if message != 'THREE':
        await asyncio.sleep(delay) # non-blocking call. gives up execution
    else:
        time.sleep(delay) # blocking call
    logging.info(f"Printing {message}")
    
async def main():
    logging.info("Main started")
    logging.info("Creating multiple tasks with asyncio.gather")
    await asyncio.gather(*[delay_message(i+1, num_word_mapping[i+1]) for i in range(5)]) # awaits completion of all tasks
    logging.info("Main Ended")

if __name__ == '__main__':

    asyncio.run(main()) # creats an envent loop

11:07:31:MainThread:Main started
11:07:31:MainThread:Creating multiple tasks with asyncio.gather
11:07:31:MainThread:ONE received
11:07:31:MainThread:TWO received
11:07:31:MainThread:THREE received
11:07:34:MainThread:Printing THREE
11:07:34:MainThread:FOUR received
11:07:34:MainThread:FIVE received
11:07:34:MainThread:Printing ONE
11:07:34:MainThread:Printing TWO
11:07:38:MainThread:Printing FOUR
11:07:39:MainThread:Printing FIVE
11:07:39:MainThread:Main Ended

When the delay_message receives message THREE, it makes a blocking call and doesn’t give up control to the event loop until it completes the task, thus stalling the progress of execution. Hence, it takes three seconds more than the previous run. Though this example seems to be tailored, it can happen if you are not careful. On the other hand, threading is preemptive, where OS preemptively switches thread if it is waiting on a blocking call.

Race Conditions

A multithreaded code can quickly fall apart if it doesn’t account for race conditions. It especially becomes tricky when using external libraries, as we need to verify if they support multithreaded code. For example, the session object of popular requests module is not thread-safe. Hence, trying to parallelize network requests using a session object can produce unintended results.

import concurrent.futures as cf
import logging
import time

logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

class DbUpdate:
    def __init__(self):
        self.value = 0

    def update(self):
        logging.info("Update Started")
        logging.info("Sleeping")
        time.sleep(2) # thread gets switched
        logging.info("Reading Value From Db")
        tmp = self.value**2 + 1
        logging.info("Updating Value")
        self.value = tmp
        logging.info("Update Finished")
        
db = DbUpdate()
with cf.ThreadPoolExecutor(max_workers=5) as executor:
    updates = [executor.submit(db.update) for _ in range(2)]
logging.info(f"Final value is {db.value}")

20:28:15:ThreadPoolExecutor-0_0:Update Started
20:28:15:ThreadPoolExecutor-0_0:Sleeping
20:28:15:ThreadPoolExecutor-0_1:Update Started
20:28:15:ThreadPoolExecutor-0_1:Sleeping
20:28:17:ThreadPoolExecutor-0_0:Reading Value From Db
20:28:17:ThreadPoolExecutor-0_1:Reading Value From Db
20:28:17:ThreadPoolExecutor-0_0:Updating Value
20:28:17:ThreadPoolExecutor-0_1:Updating Value
20:28:17:ThreadPoolExecutor-0_1:Update Finished
20:28:17:ThreadPoolExecutor-0_0:Update Finished
20:28:17:MainThread:Final value is 1

The final value should ideally be 2. However, due to preemptive swapping of threads, thread-0 got swapped before updating value, hence the updates were erroneous producing final value as 1. We have to use locks to prevent this from happening.

import concurrent.futures as cf
import logging
import time
import threading

LOCK = threading.Lock()

logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

class DbUpdate:
    def __init__(self):
        self.value = 0

    def update(self):
        logging.info("Update Started")
        logging.info("Sleeping")
        time.sleep(2) # thread gets switched
        with LOCK:
            logging.info("Reading Value From Db")
            tmp = self.value**2 + 1
            logging.info("Updating Value")
            self.value = tmp
            logging.info("Update Finished")
        
db = DbUpdate()
with cf.ThreadPoolExecutor(max_workers=5) as executor:
    updates = [executor.submit(db.update) for _ in range(2)]
logging.info(f"Final value is {db.value}")

21:02:45:ThreadPoolExecutor-0_0:Update Started
21:02:45:ThreadPoolExecutor-0_0:Sleeping
21:02:45:ThreadPoolExecutor-0_1:Update Started
21:02:45:ThreadPoolExecutor-0_1:Sleeping
21:02:47:ThreadPoolExecutor-0_0:Reading Value From Db
21:02:47:ThreadPoolExecutor-0_0:Updating Value
21:02:47:ThreadPoolExecutor-0_0:Update Finished
21:02:47:ThreadPoolExecutor-0_1:Reading Value From Db
21:02:47:ThreadPoolExecutor-0_1:Updating Value
21:02:47:ThreadPoolExecutor-0_1:Update Finished
21:02:47:MainThread:Final value is 2

Race Conditions are Rare with AsyncIO

Since the task has complete control on when to suspend execution, race conditions are rare with asyncio.

import asyncio
import logging
import time

logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

class DbUpdate:
    def __init__(self):
        self.value = 0

    async def update(self):
        logging.info("Update Started")
        logging.info("Sleeping")
        await asyncio.sleep(1)
        logging.info("Reading Value From Db")
        tmp = self.value**2 + 1
        logging.info("Updating Value")
        self.value = tmp
        logging.info("Update Finished")
        
async def main():
    db = DbUpdate()
    await asyncio.gather(*[db.update() for _ in range(2)])
    logging.info(f"Final value is {db.value}")
    
asyncio.run(main())

20:35:49:MainThread:Update Started
20:35:49:MainThread:Sleeping
20:35:49:MainThread:Update Started
20:35:49:MainThread:Sleeping
20:35:50:MainThread:Reading Value From Db
20:35:50:MainThread:Updating Value
20:35:50:MainThread:Update Finished
20:35:50:MainThread:Reading Value From Db
20:35:50:MainThread:Updating Value
20:35:50:MainThread:Update Finished
20:35:50:MainThread:Final value is 2

As you can see, once the task got resumed after sleeping, it didn’t give up control until it completed the execution of coroutine. With threading, thread swapping in not very obvious, but with asyncio, we can control on when exactly the coroutine execution should be suspended. Nonetheless, it can go wrong when two coroutines enter a deadlock.

import asyncio 

async def foo():
    await boo()
    
async def boo():
    await foo()
    
async def main():
    await asyncio.gather(*[foo(), boo()])
    
asyncio.run(main())

Multiprocessing

As aforementioned, multiprocessing comes really handy when implementing CPU intensive programs. Below code executes merge sorting on 1000lists with 30000 elements. Bear with me if below implementation of merge sort is bit clumsy.

Synchronous version

import concurrent.futures as cf
import logging
import math
import numpy as np
import time
import threading

logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

r_lists = [[np.random.randint(500000) for _ in range(30000)] for _ in range(1000)]

def merge(l_1, l_2):
    out = []
    key_1 = 0
    key_2 = 0
    for i in range(len(l_1) + len(l_2)):
        if l_1[key_1] < l_2[key_2]:
            out.append(l_1[key_1])
            key_1 += 1
            if key_1 == len(l_1):
                out = out + l_2[key_2:]
                break
        else:
            out.append(l_2[key_2])
            key_2 += 1
            if key_2 == len(l_2):
                out = out + l_1[key_1:]
                break
    return out

def merge_sort(l):
    if len(l) == 1:
        return l
    mid_point = math.floor((len(l) + 1) / 2)
    l_1, l_2 = merge_sort(l[:mid_point]), merge_sort(l[mid_point:])
    out = merge(l_1, l_2)
    del l_1, l_2
    return out

if __name__ == '__main__':
    logging.info("Starting Sorting")
    for r_list in r_lists:
        _ = merge_sort(r_list)
    logging.info("Sorting Completed")

21:24:07:MainThread:Starting Sorting
21:26:10:MainThread:Sorting Completed

Asynchronous version

import concurrent.futures as cf
import logging
import math
import numpy as np
import time
import threading

logger_format = '%(asctime)s:%(threadName)s:%(message)s'
logging.basicConfig(format=logger_format, level=logging.INFO, datefmt="%H:%M:%S")

r_lists = [[np.random.randint(500000) for _ in range(30000)] for _ in range(1000)]

def merge(l_1, l_2):
    out = []
    key_1 = 0
    key_2 = 0
    for i in range(len(l_1) + len(l_2)):
        if l_1[key_1] < l_2[key_2]:
            out.append(l_1[key_1])
            key_1 += 1
            if key_1 == len(l_1):
                out = out + l_2[key_2:]
                break
        else:
            out.append(l_2[key_2])
            key_2 += 1
            if key_2 == len(l_2):
                out = out + l_1[key_1:]
                break
    return out

def merge_sort(l):
    if len(l) == 1:
        return l
    mid_point = math.floor((len(l) + 1) / 2)
    l_1, l_2 = merge_sort(l[:mid_point]), merge_sort(l[mid_point:])
    out = merge(l_1, l_2)
    del l_1, l_2
    return out

if __name__ == '__main__':
    logging.info("Starting Sorting")
    with cf.ProcessPoolExecutor() as executor:
        sorted_lists_futures = [executor.submit(merge_sort, r_list) for r_list in r_lists]
    logging.info("Sorting Completed")

21:29:33:MainThread:Starting Sorting
21:30:03:MainThread:Sorting Completed

By default, the number of processes is equal to the number of processors on the machine. You can observe a considerable improvement in execution time between two versions.

taken from

https://medium.com/analytics-vidhya/asyncio-threading-and-multiprocessing-in-python-4f5ff6ca75e8

Saturday, 17 July 2021

Comparaison of different file formats in Big Data

In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes. It is important to know and leverage their specificities when choosing a specific format type.

Some formats are more relevant for certain usages or treatments, such as in Business Intelligence, network communication, web application, batch or stream processing. For instance, CSV format is very understandably and, while everyone critics its lack of formalism, it is still widely used. In case of web communication, JSON is privileged and cover most of the world communication even if, in some cases, XML might do the same work. The latest format benefit from a large ecosystem of tools and is widely used in the enterprise world. In terms of streaming processing, AVRO and Protocol Buffers are a privileged formats. In addition, protocol buffers is the foundation of gRPC (RPC stands for Remote Procedure Call) on which rely the Kubernetes ecosystem. Finally, column storage offered by ORC and Parquet provide a significant performance boost in data analytics.

Consideration to be taken into account for choosing a file format

The most popular and representative file formats are described with the various considerations to keep in mind when choosing one format over another one. But first, let’s review their main characteristics in regards to storage, queries types, batch versus streaming usages, …

Text versus binary

Text based file formats are easier to use. They can be read by the end user who can also modify the file content with a text editor. Those formats are usually compressed to reduce their storage footprints. Binary file formats require a tool or a library to be created and consumed but they provides better performance by optimizing the data serialization.

Data type

Some format don’t allow to declare multipe types of data. Types are classifies in 2 categories. Scalar types holds a “single” value (eg integer, boolean, string, null, …). Complex types are a compound of scalar types (eg arrays, objects, …). Declaring the type of a value provides some advantages. The consumer can distinguish for example a number from a string or a null value from the string “null”. Once encoded in binary form, the storage gain is significant. For example, the string "1234" uses 4 bytes of storage while the boolean 1234 requires only 2 bytes.

Schema enforcement

The schema can be associated with the data or left to the consumer who is assumed to know and understand how to interpret the data. It can be provided with the data or separately. The usage of a schema garanties that the dataset is valid and provide extra information such as the type and the format of a value. Schemas are also used in binary file formats to encrypt and decrypt the content.

Support Schema evolution

The schema store the definition of each attribute and its types. Unless your data is guaranteed to never change (e.g. of archives) or is by nature immutable (e.g. of logs), you’ll need to think about schema evolution in order to figure out if your data schema changes over time. In other word, schema evolution allows to update the schema used to write new data while maintaining backwards compatibility with the schema of your old data.

Row and column storage, OLTP versus OLAP

In row based storage, data is stored row by row, such that the first column of row will be next to the last column of the previous row. This architecture is appropriated to add data easily and quickly. It’s also suitable in case where the entire row of data needs to be accessed or processed simultaneously. This is commonly used for Online Transactional Processing (OLTP). OLTP systems usually process CRUD queries (Read, Insert, Update and Delete) at a record level. OLTP transactions are usually very specific in the task they perform, and usually involve a single record or a small selection of records. The main emphasis for OLTP systems is focus on maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second.

Inversely, in column-based storage, data is stored such that each row of a column will be next to other rows from that same column. That is most useful when performing analytics queries which require only a subset of columns examined over very large data sets. This way of processing data is usually called OLAP (OnLine Analytical Processing) query. OLAP is an approach designed to quickly answer analytics queries involving multiple dimensions. This approach has played a critical role in business intelligence analytics, especially regarding to Big Data. Storing data in column avoid the excessive processing times of reading irrelevant information across a data set by ignoring all the data that doesn’t apply to a particular query. It also provides a greater compression ratio by aligning data of a same type and optimizing null values of sparse columns.

Example grouping records by color

Example grouping properties by color

Splittable

In a distributed file system like Hadoop HDFS, it is important to have a file that can be divided into several pieces. In Hadoop, the HDFS file system stores data in chunk and the data processing is initially distributed according to those chunks. To be able to begin reading data at any point within a file is helpful in order to take the full advantages of Hadoop’s distributed processing. When the file format is not splittable, it is the user responsibility to partitioned his data into several files to enable parallelism.

Compression (Bzip2, LZO, Sappy,…)

A system is a slow as its slowest components and, most of the time, the slowest components are the disks. Using compression reduce the size of the data set being stored and thereby reduce the amount of read IO to perform. It also speeds up file transfers over the network. Column based storage brings higher compression ration and better performance compared to row based storage because similar pieces of data are stored together. As mention earlier, it is particularly effective on sparse columns which contains several null values.

In addition, compression mostly apply to text file formats unless the binary file format include compression in its definition. Attempting to compress a binary file often lead to heavier generated file compare to its original. Also, when comparing compression ratio, we must also put into perspective the original file size. It wouldn’t be fare to compare the compression ratio of JSON versus XML if we don’t take into consideration that the XML is originally larger than its JSON counterpart.

There are many kind of compression algorithms which differ according of the ratio of compression, speed, performance, splittable, …

Batch and Streaming

Batch processing read, analyses and transform multiple records at once. In opposition, stream processing works in real time and process records as they arrive. Sometimes, a same data set may be processed both in streaming and in batch. Let’s consider financial transactions received inside a system. The data arrive in Streaming and could be processed instantly to detect fraud detection. Complementary, once the data is store, a job could run periodically in batch to prepare some reports. In such case, we might rely on one or two file formats between the format of the data arriving over the wire and the format of the data being stored in our Data Warehouse or Data Lake.

Ecosystem

It is often considered a good practice to respect the usage and the culture of an ecosystem when it comes to chooses between multiple alternative. For example, when choosing between Parquet or ORC with Hive, it is recommand to use the former on Cloudera platforms and the latter on Hortonworks platforms. Chances are that the integration will be smoother and the documentation and support from the community much helpful.

Popular File formats

CSV

CSV is a text format which use a newline as a record delimiter with an optional header. It is easy to edit manually and very understandably by human perspective. CSV doesn’t support schema evolution even though sometimes the header is optionally considered as the schema of the data. In its complex form, it’s not splittable because it doesn’t contain a specific character where you could based on to split the text while remaining relevant.

Generally in Big Data, CSV seems to be used for processing but, strictly speaking, it is closer to the TSV format. The differences between the two format are subtle.

CSV format is not a fully standardized format. Its field delimiters could be for example, colon or semicolons. It become difficult to parse when the field itself may also contain these kind of quotes or even embedded line breaks. In addition, the field data could be enclosed by a quotation marks. These quotes raise sometimes irrelevant structure of data about the identification of the beginning or ending of sentences or even a null values. Unlike TSV, CSV embodies an escape character (\) into its data. Thus, all those various considerations make it complex for parsing. Whereas, TSV being generally delimited either by tabulation or comma, it doesn’t support an escape character. This allows to distinguish easily each field and eventually help to parse easily than CSV. This is the reason why it is mainly used in big data case.

Advantage
- Excellent compression ratio
- Easy reading
- Supports batch and streaming processing
Drawback
- No support a null values, identical to empty values
- No garantie to be splittable and no schema evolution support
- Non standard format, everyone with its own interpretations
Ecosystems
- Supported by a wide range of applications (Hadoop, Spark, kafka,…) and languages (for example CSV for Node.js)
- One of the most popular formats used because of its simplicity, but usually not recommanded

Example of CSV format

Player Name;Position;Nicknames;Years Active
Skippy Peterson;First Base;"""Blue Dog"", ""The Magician""";1908-1913
Bud Grimsby;Center Field;"""The Reaper"", ""Longneck""";1910-1917
Vic Crumb;Shortstop;"""Fat Vic"", ""Icy Hot""";1911-1912

JSON (JavaScript object notation)

JSON is a text format containing a record which might be in several form (string, integer, booleans, array, object, key-value, nested data…). Being human and machine readable, JSON is relatively lightweight format and privileged in web application. It widely supported by many software and comparatively simple to implement in multiples languages. JSON natively supports serialization and deserialization process.

Its major inconvenience is related to not being splittable. The alternative used format in this case is generally a JSON lines. This one contains several lines where each individual line is a valid JSON object, separated by newline character \n. Because JSON garanti de absence of newline characters in the data being serialized, we can use the presence of those characters to split the datasets into multiple chunks.

Advantage
- JSON is a simple syntax. It can be opened by any text editor
- There are many tools available to validate the schema
- JSON is compressible and used as data exchange format
- It supports batch and streaming processing
- It stores metadata with data and supports schema evolution
- JSON is lightweight text-based format in comparison to XML
Drawback
- JSON is not splittable and lack indexing as many text formats
Ecosystems
- It is privileged format for web applications
- JSON is a widely used file format for NoSQL databases such as MongoDB, Couchbase and Azure Cosmos DB. or example,in MongoDB, BSON is used instead the standard format. This rchitecture of MongoDB is easier to modify than structured database as Postgres where you’d need to extract the whole document for changing. ndeed, BSON is the binary encoding of JSON-like documents. It’s indexed and allows to be parsed much ore quickly than standard JSON. However, this doesn’t mean you can’t think of MongoDB as a JSON atabase.
- JSON is also an used language in GraphQL on which rely a data fully typed. GraphQL is a query language for an API. Instead of working with rigid server-defined endpoints, you can send queries to get exactly the data you are looking for in one request.

Example showing respectively a JSON format and a JSON Lines format

{
  "fruits": [{
    "kiwis": 3,
    "mangues": 4,
    "pommes": null
  },
  {
    "panier": true
  }],
  "legumes": {
    "patates": "amandine",
    "poireaux": false
  },
  "viandes": [
    "poisson","poulet","boeuf"
  ]
}

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

XML

XML is a markup language created by W3C to define a syntax for encoding documents which are both humans and machines readable. XML allows to define more and less complexes languages. It also enables to establish a standards file format for exchange between applications.

Like JSON, XML is also used to serialize and encapsulate data. Its syntax is verbose, especially for human readers, relative to other alternatives ‘text-based’ formats.

Advantage
- Supports batch/streaming processing.
- Stores meta data along the data and supports the schema evolution.
- Being a verbose format, It provides a good ratio of compression compare to JSON file or other text file format.
Drawback
- Not splittable since XML has an opening tag at the beginning and a closing tag at the end. You cannot start processing at any point in the middle of those tags.
- The redundant nature of the syntax causes higher storage and transportation cost when the volume of data is large.
- XML namespaces are problematic to use. The support of namespace can be difficult to correctly implement in an XML parser.
- Difficulties to parse requiring the selection and the usage of an appropriate DOM or SAX parser.
Ecosystems
- Large historical presence in companies which tends to decrease over time.

Example of XML

<?xml version="1.0" encoding="UTF-8"?>
<scientists>
  <!-- Some important physicists -->
  <scientist id="phys0">
    <name firtname="Galileo" lastname="Galilei" />
    <dates birth="1564-02-15" death="1642-01-08" />
    <contributions>Relativity of movement &amp; heliocentric system</contributions>
  </scientist>
</scientists>

AVRO

Avro is framework developed within Apache’s Hadoop project. It is a row-based storage format which is widely used as a serialization process. AVRO stores its schema in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format by doing it compact and efficient. A key feature of AVRO is related to the fact it handle easily schema evolution. It attach metadata into their data in each record.

Advantage
- Avro is a data serialization system.
- It is splittable (AVRO has a sync marker to separate the block) and compressible.
- Avro is good file format for data exchange. It has a data storage which is very compact, fast and eficient for analytics.
- It highly supports schema evolution (at different time and independently).
- It supports batch and is very relevant for streaming.
- Avro schemas defined in JSON, easy to read and parse.
- The data is always accompanied by schema, which allow full processing on the data.
Drawback
- Its data is not readable by human.
- Not integrated to every languages.
Ecosystems
- Widely used in many application (Kafka, Spark, … ).
- Avro is a remote procedure call (RPC).

Protocol Buffers

Protocol buffers has been developed by Google and open sourced in 2008. It is language-neutral, an extensible way of serializing structured data. It’s used in communications protocols, data storage, and more. It is easy to read and understandably by human. Unlike Avro, the schema is required to interpret the data

Advantage
- Data is fully typed.
- Protobuf supports schema evolution and batch/streaming processing.
- Mainly used to serialize data, like AVRO.
- Having a predefined and larger set of data types, messages serialized on Protobuf can be automatically validated by the code that is responsible to exchange them.
Drawback
- Not splittable and not compressible.
- No Map Reduce support.
- Require a reference file with the schema.
- Not designed to handle large messages. Since it doesn’t support random access, you’ll have to read the whole file, even if you only want to access a specific item.
Ecosystems
- Tools are available on most programming languages around, like JavaScript, Java, PHP, C#, Ruby, Objective C, Python, C++ and Go.
- Foundation of gRPC widely used accross the kubernetes ecosystem.
- ProfaneDB is a database for Protocol Buffers objects.

syntax = "proto3";

message dog {
  required string name = 1;
  int32 weight = 2;
  repeated string toys = 4;
}

Parquet

Parquet is an open source file format for Hadoop ecosystem. Its design is a combined effort between Twitter and Cloudera for an efficient data storage of analytics. Parquet stores data by column-oriented like ORC format. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is a binary file containing metadata about their content. The column metadata is stored at the end of the file, which allows for fast, one-pass writing. Parquet is optimized for the paradigm of write once read many (WORM).

Advantage
- Splittable files.
- Organizing by column, allowing a better compression, as data is more homogeneous.
- High efficiency for OLAP workload.
- Supports schema evolution.
- Restricted to batch processing.
Drawback
- Not human readable.
- Difficulties to apply updates unless you delete and recreate it again.
Ecosystems
- Eficient analysis for BI (Business Intelligence).
- Very fast to read for processing engines such as Spark.
- Commonly used along with Spark, Impala, Arrow and Drill.

Structure of Parquet format

ORC (Optimized Row Columnar)

ORC format was created and open sourced by Hortonworks in collaboration with Facebook. It is a column-oriented data storage format similar to Parquet. ORC files contain groups of row data called stripes, along with auxiliary information in a file footer. At the end of the file, a postscript holds compression parameters and the size of the compressed footer. The default stripe size is 250 MB. Large stripe sizes enable large efficient reads from HDFS.

Advantage
- Highly compressible, reducing the size of the original data up to 75%.
- More efficient for OLAP than OLTP queries.
- Restricted to batch processing.
Drawback
- Unable to add data without having to recreate the file.
- Does not support schema evolution.
- Impala currently does nt support ORC format.
Ecosystems
Efficient analysis for Business Intelligence workload.
ORC improves performance reading, writing, and processing in Hive.

Structure of ORC format

Conclusion

The selection of an appropriate file format is related to the kind of query (use cases) tied to the properties of each file format. Usually, row-oriented databases are well-suited for OLTP-like workloads whereas column-oriented systems is well suited of OLAP query. Since its column storage type contains a uniform type in each column, columnar format will be better in compression compare to a row based storage.

In term of comparison, if you wish to work with a text format, JSON line and eventually CSV when you control the data types (no text with special characters) are good choices. Unless the data source impose it, avoid selecting XML. It is pratically not used in data processing because of its high complexity often requiring a custom parser. In stream processing and data exchange, if you wish a flexible and powerfull format, select Avro. It also provide great support for schema evolution. While Protocol Buffers is efficient in streaming workload, it is not splittable and not compressible which makes it less attractive for stoding data at rest. Finally, the ORC and parquet formats are optimized for Business Intelligence analytics.

In summary, the main consideration to bear in mind is that selecting a file format depends on the use cases.

Overview

Types	CSV	JSON	XML	AVRO	Protocol Buffers	Parquet	ORC
text versus binary	text	text	text	metadata in JSON, data in binary	text	binary	binary
Data type	no	yes	no	yes	yes	yes	yes
Schema enforcement	no (minimal with header)	external for validation	external for validation	yes	yes	yes	yes
Schema evolution	non	yes	yes	yes	non	yes	non
Storage type	row	row	row	row	row	column	column
OLAP/OLTP	OLTP	OLTP	OLTP	OLTP	OLTP	OLAP	OLAP
Splittable	yes in its simpliest form	yes with JSON lines	non	yes	non	yes	yes
Compression	yes	yes	yes	yes	yes	yes	yes
Batch	yes	yes	yes	yes	yes	yes	yes
Stream	yes	yes	non	yes	yes	non	non
Typed data	non	non	non	non	yes	non	non
Ecosystems	popular everywhere for its simplicity	API and web	enterprise	Big Data and Streaming	RPC and Kubernetes	Big Data and BI	Big Data and BI