Compact Data Structures

Course leaders: Dr Travis Gagie and Dr Nicola Prezza

Home institutions: Universidad Diego Portales, Chile - University of Pisa, Italy

 

Course pre-requisite(s): Basic knowledge of algorithms, data structures and discrete mathematics.

Course Overview

The emerging field of compact data structures lies in the intersection of combinatorics, data compression and data structures, and addresses the problem of how we can store massive datasets space-efficiently while still supporting fast queries on them. Its practical important is growing in the current era of big data, as well as its theoretical depth and elegance.

This course will follow Gonzalo Navarro's recent text, "Compact Data Structures: A Practical Approach". T

Learning Outcomes

By the end of the course the students should be able to give at least rough descriptions of the compact data structures they have seen and understand descriptions of new ones (so they can, e.g., review conference submissions on this topic). They should also be able to design simple extensions of the structures covered in the course and implement them using the Succinct Data Structure Library (SDSL). Finally, graduate students should be able to discuss open problems and pick a topic for a thesis project, if they wish to continue working in this area.

Course Content

The course will cover one chapter of the text per day:
1) Introduction (motivation, background, notation, SDSL, etc)
2) Entropy and Coding (Shannon's Theorem, Huffman coding, empirical entropy, Jensen's Inequality, etc)
3) Arrays (compact representations, gap coding, Elias-Fano coding, etc)
4) Bitvectors (compact binary strings supporting access, rank and select)
5) Permutations (compressed permutations supporting forward and reverse evaluation, fast exponentiation, etc)
6) Sequences (compact representations supporting access, rank and select)
7) Parentheses (sequences of balanced parentheses, used later to implement trees)
8) Trees (compact representations supporting fast navigation)
9) Graphs (compact representations of certain graph classes, web graphs, K^2-trees, etc)
10) Grids (compact range-reporting data structures, etc)
11) Texts (compressed suffix arrays, the Burrows-Wheeler Transform, FM-indexes, compressed suffix trees)
12) Dynamic Structures (dynamic versions of some of the static structures already covered)
13) Recent Trends (encodings, handling repetitive datasets, compact data structures in external memory --- and compact data structures developed since the text was published)

Instructional Method

The course will be based around daily lectures and homework assignments, which the students will be encourage to complete in pairs or small groups. Students will take turns presenting their solutions at the beginning of each class. Students will be expected to read chapters of the text in advance and engage in problem-solving discussions during the lectures.

Required Course Materials

The students will need access to a copy of Navarro's text, several physical copies and two online copies of which are available through the university library. All other required material will be distributed.

Assessment

There will be short homework assignments --- needing half an hour or so to complete --- each day from Monday to Thursday, with the answers being presented by the students at the beginning of the next class. Students will take turns presenting answers and homework marks will be based on their presentations. There will be longer assignments --- needing an hour or two, possibly with some programming --- over the weekends, which will be corrected by the instructors. There will be a 2-hour exam on the last day.