Single Cell Data Preprocessing

When I started my PhD back in 2009, the next generation sequencing (NGS) technologies had already been out there for a while. They were in the process of being adapted by many labs. My impression back then was that it was difficult to learn how exactly those methods worked. There were many reasons for that, and one of them was the need of the commercial kits for the library preparation. As a beginner, you kind of knew roughly what the kit was doing: end repair, A tailing, ligate the adapters to the fragments etc. However, there were many missing details due to the black box nature of the kit. For example, what enzymes are used in each step? What are the oligo sequences? Are there any modifications on the oligos? Those things are critical to understand the method.

As more and more people started to use the NGS technology, many … probably too many (see here and here) … methods based on NGS were developed using off-the-shelf reagents. You started to read them and get increasingly more familiar with the NGS technology, especially the Illumina platforms since most methods were based on the Illumina machines. Then you realised: well … things become a bit clearer and it is not that difficult. All those different methods, commercial kits included, do only one thing: adding sequencing adapters to the DNA or cDNA fragments you are interested in. To make things even better, the adapter sequences are known now, even though they are considered as secrets.

However, when you actually looked at the protocols of those methods, you probably realised that it is still difficult to figure out how exactly the adapters are attached to the fragments in the desired orientation. The procedures are kind of complicated, especially for all those single cell methods. To help visualise what happens in each step, I created the scg lib structs GitHub repository five years ago. One purpose of the repository is to help with the experimental troubleshooting, which I explained in a previous post. There is another reason for the creation of the repository: to help with the data preprocessing.

In single cell genomics (and many other areas of genomics), when we talk about data analysis, there are actually two main steps. The first step is data preprocessing, a procedure that converts the raw data (fastq files) to some sort of count matrices, such as gene by cell or peak by cell matrices. The second step is data analysis, which cannot be developed into standard pipelines, because the data analysis step varies a lot depends on the purpose of the project. However, the first step can. In order to do that, we need to know the library structure and the read components of each method, which can be easily viewed in the scg lib structs GitHub repository. In the early days of single cell genomics, one might have to combine and chain various tools and write custom scripts to preprocess data from different methods. This is very difficult and confusing for beginners and sometimes even for experienced bioinformaticians. In addition, it is difficult to compare different methods if different analysis pipelines are used to preprocess the data. With the fast development at the computational side of single cell genomics, it is now possible to actually just use STARsolo and chromap for most scRNA-seq and scATAC-seq methods.

Since I like ReadTheDocs and want to learn how to use it, I created one to document how to use STARsolo for scRNA-seq and chromap + MACS for scATAC-seq data preprocessing. I focused a lot on “what you should do if you have done the experiment by yourself using a specific method”. It was a fun practice.

Click here to look at the preprocessing pipelines for many scRNA-seq and scATAC-seq methods.