Encoding

Data exists in many forms, formats and modalities. Even missing data is a form of data.

Regardless of its form, format or modality, all data ultimately needs to be transformed into a matrix (or matrices) of numbers for it to be used for any machine learning or data mining algorithm.

The process of transforming raw data into numeric matrices is often referred to as Encoding and the resulting numeric matrices are referred to as Representations. These representations may focus on certain aspects (structure, content, semantics etc.) of the data more than others and may be more or less suitable for certain tasks.

Modalities of Data

In this section, we will discuss the most common modalities of data and how to work with them in Python.

There are too many data modalities to eumerate an exhaustive list. Each data modality has its own unique characteristics and often require specialized methods to process and analyze. Study and analysis of many data modalities is an area of research within itself. For example, Computer Vision is the area of research that deals with images and videos. Speech processing is the area of research that deals with sounds. Natural Language Processing (NLP) is the area of research that deals with text.

Some common modalities and their associated areas of research are:

  1. Text: Natural Language Processing, Computational Linguistics
  2. Images: Computer Vision, Digital Image Processing
  3. Sounds: Digital Signal Processing (DSP)
  4. Graphs: Graph Theory, Network Theory
  5. Time Series: Time Series Analysis
  6. Geographic: Geographic Information Systems (GIS), Spatial Computing.