Modern AI – Yavin

Part I: Large Language Models – The Power of Scale

The Language Model Revolution

Large Language Models (LLMs) represent perhaps the most visible face of modern AI. Systems like GPT-4, Claude, and PaLM demonstrate unprecedented language understanding and generation capabilities.

What Is a Language Model?

At their core, language models solve a deceptively simple task: predict the next token (word or sub-word) given previous context. But this simple objective, when scaled massively, leads to emergent capabilities.

From Prediction to Understanding

To accurately predict next words, a model must implicitly learn:

Syntax: Grammatical rules and sentence structure
Semantics: Meaning of words and phrases
World Knowledge: Facts about people, places, events, concepts
Reasoning: Logical inference and causal relationships
Context: How previous sentences influence meaning
Intent: Understanding what question or request is being made

These capabilities emerge from training on hundreds of billions or trillions of words from books, websites, code, and conversations.

The Transformer Foundation

Modern LLMs build on the Transformer architecture, stacking dozens of attention layers. Key architectural choices:

Massive Scale: Billions to hundreds of billions of parameters
Pre-training: Train on enormous web-scale datasets
Fine-tuning: Adapt to specific tasks or align with human preferences
Context Windows: Process thousands of tokens simultaneously (recent models exceed 100K tokens)

Training Paradigms

Three-Stage Training for Modern LLMs

Stage 1: Pre-training

Train on massive, diverse text corpus
Objective: Predict next token
Duration: Weeks or months on thousands of GPUs
Result: Model with broad language understanding

Stage 2: Supervised Fine-Tuning

Train on curated instruction-following examples
Teaches model to follow prompts and answer questions
Duration: Hours to days
Result: Model that responds helpfully to instructions

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Humans rank model responses by quality
Train reward model to predict human preferences
Use RL to optimize model outputs toward higher rewards
Result: Model aligned with human values and preferences

Emergent Capabilities at Scale

As models grow larger, unexpected abilities emerge:

In-context Learning: Perform new tasks from examples in the prompt (few-shot learning)
Chain-of-Thought Reasoning: Solve complex problems by breaking them into steps
Multi-step Planning: Decompose goals into sub-goals
Code Generation: Write working programs from descriptions
Multi-lingual Transfer: Translate between languages not explicitly paired in training

These capabilities weren't explicitly programmed—they emerged from scale and the next-token prediction objective. This suggests intelligence might be more about scale and architecture than specialized algorithms.

Applications Transforming Industries

Content Creation: Writing assistance, summarization, creative writing
Code Assistance: GitHub Copilot, code explanation, debugging
Customer Service: Chatbots, automated support
Education: Tutoring, explanation, personalized learning
Research: Literature review, hypothesis generation, data analysis
Healthcare: Clinical note generation, medical knowledge Q&A

Limitations and Challenges

Hallucinations: Models confidently generate false information
Knowledge Cutoff: No awareness of events after training
Context Limits: Even long contexts have limits
Reasoning Gaps: Struggle with novel logical reasoning
Computational Cost: Inference is expensive at scale

Part II: Computer Vision – Machines That See

From Pixels to Perception

Computer vision enables machines to derive meaningful information from digital images and videos. Modern systems approach or exceed human performance on many visual tasks.

Core Vision Tasks

Image Classification

Task: Assign label to entire image

Example: "This image contains a dog"

Applications: Medical diagnosis (tumor detection), content moderation, quality control

Object Detection

Task: Locate and classify multiple objects in image

Example: Draw bounding boxes around all cars, pedestrians, traffic signs

Applications: Autonomous vehicles, surveillance, retail analytics

Key architectures: R-CNN family, YOLO, RetinaNet

Semantic Segmentation

Task: Classify every pixel in image

Example: Label each pixel as road, sidewalk, building, sky, person, etc.

Applications: Medical image analysis, scene understanding, augmented reality

Key architectures: U-Net, DeepLab, Mask R-CNN

Facial Recognition

Task: Identify or verify individuals from faces

Method: Learn face embeddings—vector representations where similar faces cluster together

Applications: Device unlocking, security, photo organization

Concerns: Privacy, bias, surveillance implications

The Data Hunger Challenge

Vision models require enormous labeled datasets. ImageNet (14M images, 20K categories) catalyzed progress, but creating such datasets is expensive. Modern approaches mitigate this:

Self-supervised pre-training: Learn from unlabeled images
Synthetic data: Generate training data via simulation or GANs
Weak supervision: Use noisy labels from alt-text, hashtags, etc.
Active learning: Strategically select most informative examples to label

Beyond Static Images: Video Understanding

Video adds temporal dimension, enabling:

Action Recognition: Identify activities (running, jumping, cooking)
Motion Prediction: Anticipate future trajectories
Event Detection: Find specific moments in long videos
Video Generation: Create synthetic video content

3D Vision and Depth Perception

Modern vision systems increasingly reason in 3D:

Depth Estimation: Infer distance to surfaces from single images
3D Reconstruction: Build 3D models from multiple views
SLAM: Simultaneous Localization and Mapping for robot navigation
NeRF: Neural Radiance Fields for photorealistic 3D scene representation

Multimodal Vision-Language Models

Bridging vision and language enables powerful new capabilities:

CLIP: Connecting Images and Text

Train vision and language encoders jointly on image-caption pairs from the web:

Learn shared embedding space where semantically similar images and text are close
Enables zero-shot image classification by comparing image embeddings to text descriptions
Powers text-to-image generation (DALL-E, Stable Diffusion) by guiding image synthesis toward text embeddings

Applications:

Visual question answering: "What color is the car?"
Image captioning: Generate descriptions of photos
Text-to-image generation: Create images from descriptions
Visual reasoning: Answer complex questions requiring image understanding

Part III: Speech and Audio AI

Automatic Speech Recognition (ASR)

ASR converts spoken language to text—a challenging problem requiring understanding of acoustics, phonetics, and language.

The Pipeline Approach (Traditional)

Acoustic Model: Maps audio features to phonemes (smallest sound units)
Pronunciation Model: Maps phoneme sequences to words
Language Model: Scores word sequence plausibility

End-to-End Neural ASR

Modern systems replace the pipeline with a single neural network (often Transformer-based) that directly maps audio to text:

Simpler: One model instead of multiple components
Better: Jointly optimizes entire process
Examples: Whisper, Conformer, Speech2Text Transformers

Challenges:

Accents and dialects
Background noise
Multiple speakers (diarization)
Domain-specific vocabulary
Real-time processing requirements

Text-to-Speech (TTS)

TTS generates natural-sounding speech from text. Modern neural TTS achieves near-human quality:

WaveNet: Generates audio samples directly from text (computationally expensive)
Tacotron: Generates mel-spectrograms, then converts to audio
FastSpeech: Parallel generation for faster synthesis

Applications:

Voice assistants (Siri, Alexa, Google Assistant)
Accessibility (screen readers)
Audiobook narration
Language learning

Voice Cloning and Synthesis

Modern models can clone voices from minutes of audio, raising both opportunities (personalization, accessibility) and concerns (deepfakes, impersonation).

Music and Audio Generation

AI now generates music, sound effects, and ambient audio:

Jukebox: Generates music with singing
MuseNet: Composes multi-instrument pieces
AudioLM: Generates realistic soundscapes

Part IV: Robotics and Embodied AI

Bringing AI into the Physical World

Embodied AI tackles the challenge of operating in the real, physical world with all its complexity, uncertainty, and continuous dynamics.

Core Challenges in Robotics

Perception

Understanding the environment from sensors (cameras, lidar, touch, proprioception). Must handle:

Noisy, incomplete sensor data
Dynamic, changing environments
Occlusions and lighting variations
Real-time processing requirements

Planning and Control

Deciding what actions to take and executing them precisely:

Path planning in complex spaces
Collision avoidance
Motor control (precise movement)
Handling uncertainty and disturbances

Manipulation

Grasping and manipulating objects—deceptively difficult:

Estimating object properties (weight, friction, fragility)
Planning grasp points
Applying appropriate forces
Adapting to slippage or unexpected resistance

Reinforcement Learning for Robotics

RL is natural for robotics—agents learn from interaction. But real-world learning faces challenges:

Sample Inefficiency: RL needs many trials; real robots are slow and expensive
Safety: Exploration can damage robots or surroundings
Sim-to-Real Gap: Policies learned in simulation may fail on real hardware

Solutions:

Simulation Training: Train in physics simulators, transfer to reality
Domain Randomization: Vary simulation parameters to encourage robustness
Learning from Demonstrations: Bootstrap learning from human examples
Meta-Learning: Learn to adapt quickly to new situations

Autonomous Vehicles

Self-driving cars represent one of robotics' most ambitious goals. The full stack includes:

Autonomous Driving Pipeline

Perception: Detect vehicles, pedestrians, lanes, traffic signs, lights
Localization: Determine precise position on map
Prediction: Anticipate how other agents will move
Planning: Decide path and actions (lane changes, turns, stops)
Control: Execute plan with steering, throttle, brake commands

Progress and Challenges:

Works well in structured environments (highways, mapped cities)
Struggles with edge cases (construction zones, unusual weather, adversarial humans)
Requires solving perception, prediction, and planning simultaneously
Safety-critical nature demands near-perfect reliability

Part V: Recommendation Systems – Personalizing the Internet

The Most Deployed AI

Recommendation systems might be the AI you interact with most. They power:

Netflix: What to watch next
YouTube: Video suggestions
Amazon: Product recommendations
Spotify: Music discovery
Social media: Content feeds

Core Approaches

Collaborative Filtering

Idea: Users who agreed in the past will agree in the future

User-based: Find similar users, recommend what they liked

Item-based: Find similar items to ones user liked

Matrix Factorization: Learn latent factors for users and items, predict ratings as dot product

Content-Based Filtering

Idea: Recommend items similar to what user previously liked

Analyze item features (genre, actors, keywords) and user preferences to match

Hybrid and Deep Learning Approaches

Modern systems combine multiple signals:

User behavior (views, clicks, watch time)
Item features (metadata, content embeddings)
Contextual information (time, device, location)
Social connections

Deep neural networks learn complex, nonlinear combinations of these signals.

The Exploration-Exploitation Dilemma Returns

Should the system recommend:

Safe bets (exploitation): Similar to what user already likes
Novel items (exploration): Different content that might expand user interests

Too much exploitation creates filter bubbles; too much exploration frustrates users with irrelevant content.

Societal Implications

Recommendation systems shape information access at global scale:

Filter Bubbles: Narrowing of perspective by showing similar content
Engagement Optimization: Maximizing watch time may amplify sensational content
Feedback Loops: Popular items get more exposure, becoming more popular
Diversity Trade-offs: Accuracy vs. exposing diverse perspectives

As AI systems become increasingly embedded in daily life—from the content we consume to the decisions made about us—understanding their capabilities, limitations, and societal impacts becomes essential for all citizens, not just technical practitioners.

Modern AI Systems