A Simple Intro to Neural Networks

Introduction
What is a Neural Network?
The Basic Unit: Neurons and Layers
Forward Pass & Backpropagation
Activation Functions
Architectures and Variants
Training Challenges & Techniques
Universal Approximation Theorem
Limitations
Conclusion
Further Reading

Neural networks are the backbone of modern AI systems. From image classification to natural language processing, they underpin most of the breakthroughs in deep learning. Yet, for many engineers, neural networks can feel more like magic than math. This article aims to demystify them.

At its core, a neural network is a computational model inspired by the brain, consisting of layers of interconnected nodes (neurons) that transform input data into output predictions.

A neural network is a function approximator: it learns to approximate a function f(x) that maps inputs to outputs by adjusting weights and biases through training.

They are used for:

Classification (e.g., spam detection)
Regression (e.g., house price prediction)
Generation (e.g., image synthesis, text generation)
Control (e.g., robotics)

A single neuron takes weighted inputs, applies a non-linear activation, and produces an output:

z = w_1x_1 + w_2x_2 + \dots + w_nx_n + b \\ output = \phi(z)

Where:

$w_i$ are weights
$b$ is the bias
$\phi$ is an activation function (e.g., ReLU, sigmoid)

Neurons are stacked into:

Input layer: receives raw features
Hidden layers: extract and combine features
Output layer: produces final predictions

The depth and width of layers determine the capacity of the model. The depth of the network is the number of hidden layers. The width of the network is the number of neurons in each hidden layer.

Training a neural network involves two main steps:

Forward Pass:

Compute the outputs layer-by-layer using current weights
Compare prediction $\hat{y}$ with actual output $y$ using a loss function (e.g., MSE, cross-entropy)

Backward Pass (Backpropagation):

Compute gradients of the loss with respect to each weight using the chain rule
Update weights via gradient descent or its variants (e.g., Adam, RMSprop)

This process is repeated over many iterations (epochs) until convergence.

Without non-linear activations, neural networks would be no more powerful than linear regression. Common activation functions include:

ReLU: $\max(0, x)$ , introduces sparsity and fast convergence
Sigmoid: maps inputs to [0, 1], used in binary classification
Tanh: maps inputs to [-1, 1], zero-centered

Choice of activation can greatly impact model performance and training dynamics.

Feedforward Neural Networks (FNNs): Also called multi-layer perceptrons (MLPs). They are the most basic type of neural network. Fully connected layers, no memory of previous inputs.
Convolutional Neural Networks (CNNs): Use local filters to detect spatial hierarchies. Great for image processing.
Recurrent Neural Networks (RNNs): Maintain hidden state to model sequences. Variants like LSTM and GRU mitigate vanishing gradients.
Transformers: Use self-attention instead of recurrence. Scalable, parallelizable, and now the default for NLP and more.

Common challenges:

Overfitting: model memorizes training data. Works very well on the training data but poorly on the test data.
Vanishing/Exploding gradients: especially in deep or recurrent networks. This is because the gradients are multiplied by the weights of the network. If the weights are too large, the gradients will explode. If the weights are too small, the gradients will vanish.
Slow convergence: because of the vanishing/exploding gradients, the network takes a long time to converge.

Common solutions:

Regularization: Regularization is a technique to prevent overfitting. It is used to prevent the network from memorizing the training data. It is done by adding a penalty to the loss function. The most common regularization techniques are L1/L2 regularization and dropout.
Batch normalization: It is a technique to normalize the input to the network. It is done by subtracting the mean and dividing by the standard deviation of the input.
Weight initialization: It is a technique to initialize the weights of the network. The most common weight initialization techniques are He/Xavier.
Learning rate scheduling: This is a technique to adjust the learning rate of the network, done by adjusting the learning rate based on the performance of the network.

This theorem states that a feedforward network with a single hidden layer and sufficient neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$ .

In practice, depth often matters more than width for generalization and efficiency.

Neural networks aren’t magic. Key limitations include:

Require large labeled datasets
Can be brittle to distributional shifts: The network may not generalize well to new data that is not in the training data.
Lack interpretability: It is difficult to understand why the network is making a particular prediction.
Vulnerable to adversarial attacks: The network can be easily fooled by small changes to the input.

Modern research explores techniques like attention, contrastive learning, few-shot generalization, and self-supervised learning to address these issues.

Neural networks are powerful, flexible tools that underpin modern AI. By composing simple non-linear functions, they can approximate complex patterns in data.

Understanding their core components, training mechanisms, and limitations is essential for building robust, scalable AI systems.