Summary for “Neo: A Learned Query Optimizer”
Published:
This is my summary of the paper Neo: A Learned Query Optimizer. It is a paper aiming for the vision of learned components in DMBS.
In database management systems (DBMS), using a learned query optimizer is going to greatly reduce the time costs for human engineers since a properly-tuned optimizer requires a lot of heuristics and must be continuously maintained to keep the performance. Although previous approaches use machine learning to learn some modules in the optimization pipelines, none of the approaches creates an end-to-end learning-based query optimizer and demonstrates that their methods reach state-of-the-art in actual queries. The authors are the first to show that a fully learned query optimizer is possible, and Neo achieves comparable performance to state-of-the-art commercial optimizers.
The first step in Neo is Expertise Collection. Given a sample workload (i.e., queries) that are representative of the system’s overall workload and an Expert Optimizer, Neo uses the Expert Optimizer to generate the query execution plans (QEP) for the workload and saves the query and latency pairs as supervised training samples to Neo’s Experience.
The subsequent step is Model Building. Neo uses a Deep Neural Network (DNN) to process a partial QEP Pi to predict the best possible query cost (the user could determine this cost, but the default cost can be the query latency) of a complete QEP Pj so that Pi is a subplan of Pj. Construction of a QEP’s encoding has two parts: Query-level encoding and Plan-level encoding. Query-level encoding is the upper left half of the join adjacency matrix concatenated with the predicate encoding. Three methods are used for predicate encoding, listed in the increasing order of complexity and representation power: one-hot encoding, histogram-based encoding, and R-Vector. R-Vector is a Natural Language Processing (NLP) model (e.g., word2vec) to extract semantic information about each query’s predicate text. The NLP model is first trained on the data inside the database to learn meaningful natural language representation of the data values. An additional table denormalization step can be performed before training to increase the performance of the R-Vector representation further. Plan-level encoding encodes inductive bias by keeping the information about the QEP tree structure. A tree representation of the QEP is created, with each node being represented by a vector; this vector encodes information about the immediate physical join operator (merge join, loop join, etc.) at this node as well as the predicates that are used in the subtree rooted at this node.
The deep neural network has the following architecture. The Query-level encoding is first fed through a number of fully-connected layers, resulting in a query-level hidden representation. This hidden representation is then concatenated to each node-level vector in the Plan-level encoding tree. This step helps augment plan-level encoding with query-level encoding. After this, the augmented tree goes through another number of tree convolution layers, which performs convolution over a local neighborhood of a node and its (maximum two) children, followed by a non-linear activation (e.g., softmax). After that, the tree representation is pooled to a fixed-size vector and goes through some fully connected layers so that the final result vector has the size of one, which is the prediction of the best possible cost of this QEP. This network’s loss function is a standard L2 loss between the prediction and the ground truth.
Neo’s next phase is Plan Search, which is using best-first search with the DNN as a heuristic. The min-heap of the best-first search method is initialized with scan-unspecified leaf nodes. The best possible cost QEP is retrieved from the heap, and its children are evaluated and put into the min-heap at each step. This procedure is repeated until a stopping condition is reached. The final step is Model Refinement. When a new QEP is created for a given query, it is sent to the query execution engine and executed to get its cost information. This will create a new sample for the DNN to learn, and the new sample is added to Neo’s Experience. From a reinforcement learning perspective, Neo’s combination between a DNN and a search algorithm is a value iteration approach because the method alternates between two stages: evaluating the value given a policy and using the new value to update the policy.
To evaluate Neo’s performance and the hyperparameters’ effect on the method, the authors evaluate Neo through three different benchmarks: JOB, TPC-H, and Corp (a proprietary benchmark) against four different DBMS: PostgreSQL, SQLite, SQL Server, Oracle. After being trained over 100 episodes using the PostgreSQL optimizer for bootstrapping, Neo achieved better performance than all the DMBS in most cases. Experiments display that Neo reaches comparable performance to every optimizer in at most half a day of training. The authors show that using randomized join ordering as the initialization for the plan search method is time-consuming and causes information loss to justify bootstrapping with an expert optimizer. Additional experiments demonstrate that the optimizer performance can be adapted to a specific preference by alternating the cost function.
In conclusion, Neo demonstrates the ability to adaptively perform better by learning from mistakes, which helps reduce the human engineering costs of creating and maintaining optimizers with state-of-the-art performance. Moreover, a tunable cost function means that Neo can be adapted to user-specific goals, offering a level of flexibility unseen in traditional query optimizers.