Chapter 3 Data Preprocessing for Bioinformatics

Neural networks require data to be in a numerical, normalized format. Biological data often requires special encoding.

3.1 Encoding Categorical Data

3.1.1 One-Hot Encoding for DNA Sequences

This is the most common method for representing sequences. Each nucleotide is converted into a vector.

  • A -> [1, 0, 0, 0]
  • T -> [0, 1, 0, 0]
  • C -> [0, 0, 1, 0]
  • G -> [0, 0, 0, 1]
  • N -> [0, 0, 0, 0]

A sequence "ATCG" becomes a 2D array of shape (4, 4).

import numpy as np

def one_hot_encode_dna(sequence):
    """One-hot encode a DNA sequence."""
    mapping = {'A': [1,0,0,0], 'T': [0,1,0,0], 'C': [0,0,1,0], 'G': [0,0,0,1]}
    one_hot = []
    for base in sequence:
        one_hot.append(mapping.get(base, [0,0,0,0])) # Handle 'N' or other chars
    return np.array(one_hot)

# Example
seq = "ATCGN"
encoded_seq = one_hot_encode_dna(seq)
print(f"Sequence: {seq}")
print(f"Encoded shape: {encoded_seq.shape}")
print(encoded_seq)

3.2 Handling Variable-Length Sequences

For sequences of different lengths, we use padding. We choose a maximum length and pad shorter sequences with zeros.

from tensorflow.keras.preprocessing.sequence import pad_sequences

# Assume `list_of_encoded_sequences` is a list of one-hot encoded sequences
padded_sequences = pad_sequences(list_of_encoded_sequences, maxlen=200, padding='post', dtype='float32')
print(f"Padded sequences shape: {padded_sequences.shape}") # (num_samples, 200, 4)

3.3 Data Generators

For large datasets that don’t fit in memory, we use tf.data.Dataset or Keras’ ImageDataGenerator to stream data from disk in batches.

# Example for reading from text files
def file_data_generator(file_paths, labels, batch_size=32):
    num_samples = len(file_paths)
    while True: # Loop forever so the generator never terminates
        for offset in range(0, num_samples, batch_size):
            batch_paths = file_paths[offset:offset+batch_size]
            batch_labels = labels[offset:offset+batch_size]

            batch_sequences = []
            for path in batch_paths:
                with open(path, 'r') as f:
                    seq = f.read().strip()
                encoded_seq = one_hot_encode_dna(seq)
                batch_sequences.append(encoded_seq)

            # Pad the sequences in the batch
            batch_x = pad_sequences(batch_sequences, maxlen=200, padding='post')
            yield batch_x, batch_labels

# Usage
# train_generator = file_data_generator(train_files, train_labels)
# model.fit(train_generator, epochs=10, steps_per_epoch=len(train_files)//batch_size)