1
00:00:08,124 --> 00:00:10,742
<g noun>Workstations</g> and <g adjective>high end</g> <g adjective>personal</a> <a noun>computers</g>

2
00:00:10,742 --> 00:00:14,749
<g verb>have been able to manipulate</g> <g adjective>digital</g> <g noun>audio</g> <g adverb>pretty easily</g> <g preposition>for</g> <g adjective>about fifteen</g> <g noun>years</g> <g interjection>now</g>.

3
00:00:14,749 --> 00:00:17,470
<g pronoun>It</g><g verb>'s</g> <g adverb>only</g> <g verb>been</g> <g adjective>about five</g> <g noun>years</g> <g conjunction>that</g> <g adjective>a decent <g noun>workstation</g><g verb>'s been able to handle</g>

4
00:00:17,470 --> 00:00:21,643
<g adjective>raw</g> <g noun>video</g> <g preposition>without</g> <g adjective>a lot of</g> <g adjective>expensive special purpose</g> <g noun>hardware</g>.

5
00:00:21,643 --> 00:00:25,400
But today even most cheap home PCs have the processor power and

6
00:00:25,400 --> 00:00:28,092
storage necessary to really toss raw video around,

7
00:00:28,092 --> 00:00:30,479
at least without too much of a struggle. 

8
00:00:30,479 --> 00:00:33,579
So now that everyone has all of this cheap capable hardware, 

9
00:00:33,579 --> 00:00:36,651
more people, not surprisingly, want to do interesting

10
00:00:36,651 --> 00:00:39,908
things with digital media, especially streaming. 

11
00:00:39,908 --> 00:00:44,017
YouTube was the first huge success, and now everybody wants in.

12
00:00:44,017 --> 00:00:47,413
Well good!  Because this stuff is a lot of fun!

13
00:00:48,250 --> 00:00:51,179
It's no problem finding consumers for digital media.  

14
00:00:51,179 --> 00:00:54,649
But here, I'd like to address the engineers, the mathematicians, 

15
00:00:54,649 --> 00:00:57,869
the hackers, the people who are interested in discovering 

16
00:00:57,869 --> 00:01:01,302
and making things and building the technology itself. 

17
00:01:01,302 --> 00:01:03,282
The people after my own heart.

18
00:01:04,250 --> 00:01:08,723
Digital media, compression especially, is perceived to be super-elite,

19
00:01:08,723 --> 00:01:12,822
somehow incredibly more difficult than anything else in computer science. 

20
00:01:12,822 --> 00:01:15,700
The big industry players in the field don't mind this perception at all; 

21
00:01:15,700 --> 00:01:19,734
it helps justify the staggering number of very basic patents they hold.  

22
00:01:19,734 --> 00:01:23,870
They like the image that their media researchers are the best of the best, 

23
00:01:23,870 --> 00:01:27,738
so much smarter than anyone else that their brilliant ideas can't 

24
00:01:27,738 --> 00:01:29,903
even be understood by mere mortals. 

25
00:01:30,625 --> 00:01:33,716
This is bunk.  

26
00:01:35,205 --> 00:01:38,900
Digital audio and video and streaming and compression 

27
00:01:38,900 --> 00:01:42,738
offer endless deep and stimulating mental challenges, 

28
00:01:42,738 --> 00:01:44,662
just like any other discipline. 

29
00:01:44,662 --> 00:01:47,929
It seems elite because so few people have been been involved.  

30
00:01:47,929 --> 00:01:51,223
So few people have been involved perhaps because so few people 

31
00:01:51,223 --> 00:01:54,665
could afford the expensive, special-purpose equipment it required. 

32
00:01:54,665 --> 00:01:58,792
But today, just about anyone watching this video has a cheap, 

33
00:01:58,792 --> 00:02:03,317
general-purpose computer powerful enough to play with the big boys. 

34
00:02:05,926 --> 00:02:11,108
There are battles going on today around HTML5 and browsers 

35
00:02:11,108 --> 00:02:13,671
and video and open vs. closed. 

36
00:02:13,671 --> 00:02:17,048
So now is a pretty good time to get involved.  

37
00:02:17,048 --> 00:02:20,000
The easiest place to start is probably understanding 

38
00:02:20,000 --> 00:02:22,619
the technology we have right now.

39
00:02:23,500 --> 00:02:25,071
This is an introduction. 

40
00:02:25,071 --> 00:02:28,180
Since it's an introduction, it glosses over a ton of details 

41
00:02:28,180 --> 00:02:30,882
so that the big picture's a little easier to see.

42
00:02:30,882 --> 00:02:33,908
Quite a few people watching are going to be way past anything 

43
00:02:33,908 --> 00:02:36,378
that I'm talking about, at least for now.  

44
00:02:36,378 --> 00:02:39,293
On the other hand, I'm probably going to go too fast for folks 

45
00:02:39,293 --> 00:02:44,558
who really are are brand new to all of this, so if this is all new, relax. 

46
00:02:44,558 --> 00:02:48,629
The important thing is to pick out any ideas  that really grab your imagination.

47
00:02:48,629 --> 00:02:52,497
Especially pay attention to the terminology surrounding those ideas, 

48
00:02:52,479 --> 00:02:56,078
because with those, and Google, and Wikipedia, you can dig 

49
00:02:56,078 --> 00:02:57,753
as deep as interests you.

50
00:02:57,753 --> 00:03:00,094
So, without any further ado, 

51
00:03:00,094 --> 00:03:03,351
welcome to one hell of a new hobby.

52
00:03:10,291 --> 00:03:13,030
Sound is the propagation of pressure waves through air, 

53
00:03:13,030 --> 00:03:16,981
spreading out from a source like ripples spread from a stone tossed into a pond.

54
00:03:16,981 --> 00:03:19,489
A microphone, or the human ear for that matter, 

55
00:03:19,489 --> 00:03:22,876
transforms these passing ripples of pressure into an electric signal.  

56
00:03:22,876 --> 00:03:25,800
Right, this is middle school science class, everyone remembers this.

57
00:03:25,800 --> 00:03:26,771
Moving on.

58
00:03:27,465 --> 00:03:32,527
That audio signal is a one-dimensional function, a single value varying over time.  

59
00:03:32,527 --> 00:03:34,248
If we slow the 'scope down a bit... 

60
00:03:36,450 --> 00:03:38,190
that should be a little easier to see. 

61
00:03:38,190 --> 00:03:40,688
A few other aspects of the signal are important.  

62
00:03:40,688 --> 00:03:43,418
It's continuous in both value and time;  

63
00:03:43,418 --> 00:03:46,813
that is, at any given time it can have any real value, 

64
00:03:46,813 --> 00:03:50,228
and there's a smoothly varying value at every point in in time.  

65
00:03:50,228 --> 00:03:52,439
No matter how much we zoom in,

66
00:03:54,068 --> 00:03:58,510 
there are no discontinuities, no singularities, no instantaneous steps 

67
00:03:58,510 --> 00:04:01,285
or points where the signal ceases to exist. 

68
00:04:03,247 --> 00:04:08,475
It's defined everywhere. Classic continuous math works very well on these signals.

69
00:04:11,001 --> 00:04:15,378
A digital signal on the other hand is discrete in both value and time.

70
00:04:15,378 --> 00:04:19,107
In the simplest and most common system, called Pulse Code Modulation,

71
00:04:19,107 --> 00:04:24,058
one of a fixed number of possible values directly represents the instantaneous signal amplitude 

72
00:04:24,058 --> 00:04:30,165
at points in time spaced a fixed distance apart. The end result is a stream of digits.

73
00:04:30,674 --> 00:04:35,309
Now this looks an awful lot like this.  

74
00:04:35,309 --> 00:04:38,964
It seems intuitive that we should somehow be able to rigorously transform 

75
00:04:38,964 --> 00:04:44,683
one into the other, and good news, the Sampling Theorem says we can and tells us how. 

76
00:04:44,683 --> 00:04:48,477
Published in its most recognizable form by Claude Shannon in 1949

77
00:04:48,477 --> 00:04:52,409
and built on the work of Nyquist, and Hartley, and tons of others, 

78
00:04:52,409 --> 00:04:56,138
the sampling theorem states that not only can we go back and forth between 

79
00:04:56,138 --> 00:05:00,913
analog and digital, but also lays down a set of conditions for which conversion 

80
00:05:00,913 --> 00:05:06,779
is lossless and the two representations become equivalent and interchangable.  

81
00:05:06,779 --> 00:05:10,601
When the lossless conditions aren't met, the sampling theorem tells us 

82
00:05:10,601 --> 00:05:14,247
how and how much information is lost or corrupted.

83
00:05:14,900 --> 00:05:21,270
Up until very recently, analog technology was the basis for practically everything done with audio, 

84
00:05:21,270 --> 00:05:25,267
and that's not because most audio comes from an originally analog source.

85
00:05:25,267 --> 00:05:28,450
You may also think that since computers are fairly recent, 

86
00:05:28,450 --> 00:05:31,643
analog signal technology must have come first.  

87
00:05:31,643 --> 00:05:34,428
Nope. Digital is actually older.  

88
00:05:34,428 --> 00:05:37,611
The telegraph predates the telephone by half a century 

89
00:05:37,611 --> 00:05:41,951
and was already fully mechanically automated by the 1860s, sending coded, 

90
00:05:41,951 --> 00:05:46,476
multiplexed digital signals long distances. You know... Tickertape. 

91
00:05:46,476 --> 00:05:50,427
Harry Nyquist of Bell Labs was researching telegraph pulse transmission 

92
00:05:50,427 --> 00:05:53,027
when he published his description of what later became known 

93
00:05:53,027 --> 00:05:57,219
as the Nyquist frequency, the core concept of the sampling theorem.  

94
00:05:57,219 --> 00:06:01,642
Now, it's true the telegraph was transmitting symbolic information, text, 

95
00:06:01,642 --> 00:06:06,883
not a digitized analog signal, but with the advent of the telephone and radio,

96
00:06:06,883 --> 00:06:12,000
analog and digital signal technology progressed rapidly and side-by-side.

97
00:06:12,699 --> 00:06:18,732
Audio had always been manipulated as an analog signal because, well, gee it's so much easier.  

98
00:06:18,732 --> 00:06:23,257
A second-order lowpass filter, for example, requires two passive components.  

99
00:06:23,257 --> 00:06:26,505
An all-analog short-time Fourier transform, a few hundred.  

100
00:06:26,505 --> 00:06:30,752
Well, maybe a thousand if you want to build something really fancy.  

101
00:06:31,844 --> 00:06:35,989
Processing signals digitally requires millions to billions of transistors 

102
00:06:35,989 --> 00:06:40,366
running at microwave frequencies, support hardware at very least to digitize 

103
00:06:40,366 --> 00:06:43,836
and reconstruct the analog signals, a complete software ecosystem 

104
00:06:43,836 --> 00:06:47,362
for programming and controlling that billion-transistor juggernaut,

105
00:06:47,362 --> 00:06:51,091
digital storage just in case you want to keep any of those bits for later...

106
00:06:51,091 --> 00:06:56,171
So we come to the conclusion that analog is the only practical way to do much with audio...

107
00:06:56,171 --> 00:07:07,019
well, unless you happen to have a billion transistors and all the other things just lying around. 

108
00:07:07,850 --> 00:07:12,660
And since we do, digital signal processing becomes very attractive.

109
00:07:13,363 --> 00:07:18,906
For one thing, analog componentry just doesn't have the flexibility of a general purpose computer.

110
00:07:18,906 --> 00:07:21,182
Adding a new function to this beast... 

111
00:07:22,191 --> 00:07:24,578
yeah, it's probably not going to happen.  

112
00:07:24,578 --> 00:07:26,567
On a digital processor though...

113
00:07:28,668 --> 00:07:34,127
...just write a new program. Software isn't trivial, but it is a lot easier.

114
00:07:34,127 --> 00:07:39,550
Perhaps more importantly though every analog component is an approximation. 

115
00:07:39,550 --> 00:07:44,352
There's no such thing as a perfect transistor, or a perfect inductor, or a perfect capacitor.  

116
00:07:44,352 --> 00:07:51,569
In analog, every component adds noise and distortion, usually not very much, but it adds up. 

117
00:07:51,569 --> 00:07:55,669
Just transmitting an analog signal, especially over long distances,

118
00:07:55,669 --> 00:08:00,434
progressively, measurably, irretrievably corrupts it.  

119
00:08:00,434 --> 00:08:06,513
Besides, all of those single-purpose analog components take up a lot of space.  

120
00:08:06,513 --> 00:08:09,946
Two lines of code on the billion transistors back here 

121
00:08:09,946 --> 00:08:14,702
can implement a filter that would require an inductor the size of a refrigerator.

122
00:08:14,702 --> 00:08:17,941
Digital systems don't have these drawbacks.  

123
00:08:17,941 --> 00:08:24,335
Digital signals can be stored, copied, manipulated and transmitted without adding any noise or distortion. 

124
00:08:24,335 --> 00:08:26,889
We do use lossy algorithms from time to time, 

125
00:08:26,889 --> 00:08:31,284
but the only unavoidably non-ideal steps are digitization and reconstruction,

126
00:08:31,284 --> 00:08:35,929
where digital has to interface with all of that messy analog.  

127
00:08:35,929 --> 00:08:40,750
Messy or not, modern conversion stages are very, very good.  

128
00:08:40,750 --> 00:08:45,849
By the standards of our ears, we can consider them practically lossless as well.

129
00:08:45,849 --> 00:08:50,429
With a little extra hardware, then, most of which is now small and inexpensive 

130
00:08:50,429 --> 00:08:55,379
due to our modern industrial infrastructure, digital audio is the clear winner over analog.

131
00:08:55,379 --> 00:09:00,857
So let us then go about storing it, copying it, manipulating it, and transmitting it.

132
00:09:04,956 --> 00:09:08,639
Pulse Code Modulation is the most common representation for raw audio.  

133
00:09:08,639 --> 00:09:13,867
Other practical representations do exist, for example the Sigma-Delta coding used by the SACD, 

134
00:09:13,867 --> 00:09:16,625
which is a form of Pulse Density Modulation.  

135
00:09:16,625 --> 00:09:19,687
That said, Pulse Code Modulation is far and away dominant, 

136
00:09:19,687 --> 00:09:22,158
mainly because it's so mathematically convenient.  

137
00:09:22,158 --> 00:09:26,350
An audio engineer can spend an entire career without running into anything else.

138
00:09:26,350 --> 00:09:29,135
PCM encoding can be characterized in three parameters,

139
00:09:29,135 --> 00:09:34,187
making it easy to account for every possible PCM variant with mercifully little hassle.

140
00:09:34,187 --> 00:09:36,426
The first parameter is the sampling rate.  

141
00:09:36,426 --> 00:09:40,886
The highest frequency an encoding can represent is called the Nyquist Frequency.  

142
00:09:40,886 --> 00:09:45,124
The Nyquist frequency of PCM happens to be exactly half the sampling rate.

143
00:09:45,124 --> 00:09:51,389
Therefore the sampling rate directly determines the highest possible frequency in the digitized signal.

144
00:09:51,389 --> 00:09:56,515
Analog telephone systems traditionally band-limited voice channels to just under 4kHz, 

145
00:09:56,515 --> 00:10:02,224
so digital telephony and most classic voice applications use an 8kHz sampling rate, 

146
00:10:02,224 --> 00:10:07,277
the minimum sampling rate necessary to capture the entire bandwidth of a 4kHz channel.  

147
00:10:07,227 --> 00:10:14,263
This is what an 8kHz sampling rate sounds like--- a bit muffled but perfectly intelligible for voice.  

148
00:10:17,263 --> 00:10:18,149
This is the lowest sampling rate that's ever been used widely in practice.

149
00:10:18,149 --> 00:10:23,322
From there, as power, and memory, and storage increased, consumer computer hardware

150
00:10:23,322 --> 00:10:29,642
went to offering 11, and then 16, and then 22, and then 32kHz sampling.  

151
00:10:29,642 --> 00:10:33,491
With each increase in the sampling rate and the Nyquist frequency, 

152
00:10:33,491 --> 00:10:38,302
it's obvious that the high end becomes a little clearer and the sound more natural.

153
00:10:38,301 --> 00:10:44,576
The Compact Disc uses a 44.1kHz sampling rate, which is again slightly better than 32kHz, 

154
00:10:44,576 --> 00:10:46,788
but the gains are becoming less distinct.  

155
00:10:46,788 --> 00:10:52,053
44.1kHz is a bit of an oddball choice, especially given that it hadn't been used  for anything 

156
00:10:52,053 --> 00:10:56,559
prior to the compact disc, but the huge success of the CD has made it a common rate.

157
00:10:56,559 --> 00:11:01,195
The most common hi-fidelity sampling rate aside from the CD is 48kHz.

158
00:11:05,710 --> 00:11:08,597
There's virtually no audible difference between the two.  

159
00:11:08,597 --> 00:11:13,640
This video, or at least the original version of it, was shot and produced with 48kHz audio, 

160
00:11:13,640 --> 00:11:18,545
which happens to be the original standard for high-fidelity audio with video.

161
00:11:18,545 --> 00:11:25,100
Super-hi-fidelity sampling rates of 88, and 96, and 192kHz have also appeared. 

162
00:11:25,100 --> 00:11:30,888
The reason for the sampling rates beyond 48kHz isn't to extend the audible high frequencies further. 

163
00:11:30,888 --> 00:11:32,489
It's for a different reason.

164
00:11:32,896 --> 00:11:37,319
Stepping back for just a second, the French mathematician Jean Baptiste Joseph Fourier 

165
00:11:37,319 --> 00:11:42,353
showed that we can also think of signals like audio as a set of component frequencies.  

166
00:11:42,353 --> 00:11:45,841
This frequency domain representation is equivalent to the time representation; 

167
00:11:45,841 --> 00:11:49,719
the signal is exactly the same, we're just looking at it a different way.  

168
00:11:49,719 --> 00:11:56,131
Here we see the frequency domain representation of a hypothetical analog signal we intend to digitally sample.

169
00:11:56,131 --> 00:11:59,888
The sampling theorem tells us two essential things about the sampling process. 

170
00:11:59,888 --> 00:12:04,727
First, that a digital signal can't represent any frequencies above the Nyquist frequency. 

171
00:12:04,727 --> 00:12:10,640
Second, and this is the new part, if we don't remove those frequencies with a lowpass filter before sampling, 

172
00:12:10,640 --> 00:12:16,414
the sampling process will fold them down into the representable frequency range as aliasing distortion.

173
00:12:16,414 --> 00:12:20,069
Aliasing, in a nutshell, sounds freakin' awful, 

174
00:12:20,069 --> 00:12:25,242
so it's essential to remove any beyond-Nyquist frequencies before sampling and after reconstruction.

175
00:12:25,871 --> 00:12:31,265
Human frequency perception is considered to extend to about 20kHz. 

176
00:12:31,265 --> 00:12:37,548
In 44.1 or 48kHz sampling, the lowpass before the sampling stage has to be extremely sharp 

177
00:12:37,548 --> 00:12:42,101
to avoid cutting any audible frequencies below 20kHz 

178
00:12:42,101 --> 00:12:49,439
but still not allow frequencies above the Nyquist to leak forward into the sampling process.  

179
00:12:49,439 --> 00:12:55,342
This is a difficult filter to build and no practical filter succeeds completely. 

180
00:12:55,342 --> 00:13:00,024
If the sampling rate is 96kHz or 192kHz on the other hand, 

181
00:13:00,024 --> 00:13:07,223
the lowpass has an extra octave or two for its transition band. This is a much easier filter to build.  

182
00:13:07,223 --> 00:13:14,348
Sampling rates beyond 48kHz are actually one of those messy analog stage compromises.

183
00:13:15,014 --> 00:13:20,844
The second fundamental PCM parameter is the sample format, that is, the format of each digital number.  

184
00:13:20,844 --> 00:13:26,285
A number is a number, but a number can be represented in bits a number of different ways.

185
00:13:26,942 --> 00:13:30,902
Early PCM was eight bit linear, encoded as an unsigned byte.  

186
00:13:30,902 --> 00:13:37,028
The dynamic range is limited to about 50dB and the quantization noise, as you can hear, is pretty severe. 

187
00:13:37,028 --> 00:13:39,970
Eight bit audio is vanishingly rare today.

188
00:13:41,007 --> 00:13:47,484
Digital telephony typically uses one of two related non-linear eight bit encodings 
called A-law and mu-law. 

189
00:13:47,484 --> 00:13:51,287
These formats encode a roughly 14 bit dynamic range into eight bits 

190
00:13:51,287 --> 00:13:54,674
by spacing the higher amplitude values farther apart. 

191
00:13:54,674 --> 00:13:59,226
A-law and mu-law obviously improve quantization noise compared to linear 8-bit, 

192
00:13:59,226 --> 00:14:03,557
and voice harmonics especially hide the remaining quantization noise well. 

193
00:14:03,557 --> 00:14:08,248
All three eight bit encodings, linear, A-law, and mu-law, are typically paired 

194
00:14:08,248 --> 00:14:13,328
with an 8kHz sampling rate, though I'm demonstrating them here at 48kHz.

195
00:14:13,328 --> 00:14:18,491
Most modern PCM uses 16 or 24 bit two's-complement signed integers to encode 

196
00:14:18,491 --> 00:14:23,858
the range from negative infinity to zero decibels in 16 or 24 bits of precision. 

197
00:14:23,858 --> 00:14:27,800
The maximum absolute value corresponds to zero decibels. 

198
00:14:27,800 --> 00:14:31,584
As with all the sample formats so far, signals beyond zero decibels 

199
00:14:31,584 --> 00:14:35,619
and thus beyond the maximum representable range are clipped.

200
00:14:35,619 --> 00:14:41,199
In mixing and mastering, it's not unusual to use floating point numbers for PCM instead of integers.  

201
00:14:41,199 --> 00:14:47,222
A 32 bit IEEE754 float, that's the normal kind of floating point you see on current computers, 

202
00:14:47,222 --> 00:14:52,793
has 24 bits of resolution, but a seven bit floating point exponent increases the representable range.  

203
00:14:52,793 --> 00:14:57,040
Floating point usually represents zero decibels as +/-1.0, 

204
00:14:57,040 --> 00:15:00,547
and because floats can obviously represent considerably beyond that, 

205
00:15:00,547 --> 00:15:05,220
temporarily exceeding zero decibels during the mixing process doesn't cause clipping.

206
00:15:05,220 --> 00:15:11,077 
Floating point PCM takes up more space, so it tends to be used only as an intermediate production format.

207
00:15:11,077 --> 00:15:15,796
Lastly, most general purpose computers still read and write data in octet bytes, 

208
00:15:15,796 --> 00:15:18,489
so it's important to remember that samples bigger than eight bits 

209
00:15:18,489 --> 00:15:22,838
can be in big or little endian order, and both endiannesses are common.  

210
00:15:22,838 --> 00:15:28,751
For example, Microsoft WAV files are little endian, and Apple AIFC files tend to be big-endian.  

211
00:15:28,751 --> 00:15:30,139
Be aware of it.

212
00:15:30,870 --> 00:15:34,071
The third PCM parameter is the number of channels.  

213
00:15:34,071 --> 00:15:38,485
The convention in raw PCM is to encode multiple channels by interleaving the samples 

214
00:15:38,485 --> 00:15:43,398
of each channel together into a single stream.  Straightforward and extensible.

215
00:15:43,398 --> 00:15:47,701
And that's it!  That describes every PCM representation ever. 

216
00:15:47,701 --> 00:15:51,578
Done. Digital audio is _so easy_!  

217
00:15:51,578 --> 00:15:56,436
There's more to do of course, but at this point we've got a nice useful chunk of audio data, 

218
00:15:56,436 --> 00:15:58,092
so let's get some video too.

219
00:16:02,571 --> 00:16:08,798
One could think of video as being like audio but with two additional spatial dimensions, X and Y, 

220
00:16:08,798 --> 00:16:12,787
in addition to the dimension of time. This is mathematically sound.  

221
00:16:12,787 --> 00:16:19,097
The Sampling Theorem applies to all three video dimensions just as it does the single time dimension of audio.

222
00:16:19,097 --> 00:16:25,815
Audio and video are obviously quite different in practice. For one, compared to audio, video is huge. 

223
00:16:25,815 --> 00:16:29,294
Raw CD audio is about 1.4 megabits per second. 

224
00:16:29,294 --> 00:16:33,958
Raw 1080i HD video is over 700 megabits per second. 

225
00:16:33,958 --> 00:16:40,056
That's more than 500 times more data to capture, process and store per second.  

226
00:16:40,056 --> 00:16:43,711
By Moore's law... that's... let's see... roughly eight doublings times two years, 

227
00:16:43,711 --> 00:16:47,838
so yeah, computers requiring about an extra fifteen years to handle raw video 

228
00:16:47,838 --> 00:16:51,252
after getting raw audio down pat was about right.

229
00:16:51,252 --> 00:16:55,425
Basic raw video is also just more complex than basic raw audio. 

230
00:16:55,425 --> 00:16:58,599
The sheer volume of data currently necessitates a representation 

231
00:16:58,599 --> 00:17:02,106 
more efficient than the linear PCM used for audio.  

232
00:17:02,106 --> 00:17:06,705
In addition, electronic video comes almost entirely from broadcast television alone,

233
00:17:06,705 --> 00:17:13,423
and the standards committees that govern broadcast video have always been very concerned with backward compatibility.

234
00:17:13,423 --> 00:17:17,559  
Up until just last year in the US, a sixty year old black and white television 

235
00:17:17,559 --> 00:17:21,038
could still show a normal analog television broadcast.  

236
00:17:21,038 --> 00:17:23,879
That's actually a really neat trick.

237
00:17:23,879 --> 00:17:28,718
The downside to backward compatibility is that once a detail makes it into a standard, 

238
00:17:28,718 -->  00:17:30,985
you can't ever really throw it out again. 

239
00:17:30,985 --> 00:17:37,305
Electronic video has never started over from scratch the way audio has multiple times.  

240
00:17:37,305 --> 00:17:43,958
Sixty years worth of clever but obsolete hacks necessitated by the passing technology of a given era 

241
00:17:43,958 --> 00:17:50,102
have built up into quite a pile, and because digital standards also come from broadcast television, 

242
00:17:50,102 --> 00:17:54,664
all these eldritch hacks have been brought forward into the digital standards as well.

243
00:17:54,664 --> 00:18:00,022
In short, there are a whole lot more details involved in digital video than there were with audio. 

244
00:18:00,022 --> 00:18:05,592
There's no hope of covering them all completely here, so we'll cover the broad fundamentals.

245
00:18:06,036 --> 00:18:10,857
The most obvious raw video parameters are the width and height of the picture in pixels. 

246
00:18:10,857 --> 00:18:15,882
As simple as that may sound, the pixel dimensions alone don't actually specify the absolute 

247
00:18:15,882 --> 00:18:22,016
width and height of the picture, as most broadcast-derived video doesn't use square pixels.

248
00:18:22,016 --> 00:18:25,005
The number of scanlines in a broadcast image was fixed, 

249
00:18:25,005 --> 00:18:29,021
but the effective number of horizontal pixels was a function of channel bandwidth. 

250
00:18:29,021 --> 00:18:31,945
Effective horizontal resolution could result in pixels that were either 

251
00:18:31,945 --> 00:18:35,489
narrower or wider than the spacing between scanlines.

252
00:18:35,489 --> 00:18:38,395
Standards have generally specified that digitally sampled video 

253
00:18:38,395 --> 00:18:41,902
should reflect the real resolution of the original analog source, 

254
00:18:41,902 --> 00:18:45,566
so a large amount of digital video also uses non-square pixels. 

255
00:18:45,566 --> 00:18:49,924
For example, a normal 4:3 aspect NTSC DVD is typically encoded 

256
00:18:49,924 --> 00:18:55,374
with a display resolution of 704 by 480, a ratio wider than 4:3.  

257
00:18:55,374 --> 00:18:59,640
In this case, the pixels themselves are assigned an aspect ratio of 10:11, 

258
00:18:59,640 --> 00:19:04,553
making them taller than they are wide and narrowing the image horizontally to the
correct aspect.  

259
00:19:04,553 --> 00:19:09,800
Such an image has to be resampled to show properly on a digital display with square pixels.

260
00:19:10,253 -->  00:19:15,287
The second obvious video parameter is the frame rate, the number of full frames per second.  

261
00:19:15,287 --> 00:19:19,655
Several standard frame rates are in active use. Digital video, in one form or another, 

262
00:19:19,655 --> 00:19:23,689
can use all of them.  Or, any other frame rate.  Or even variable rates 

263
00:19:23,689 --> 00:19:27,113
where the frame rate changes adaptively over the course of the video. 

264
00:19:27,113 --> 00:19:32,998
The higher the frame rate, the smoother the motion and that brings us, unfortunately, to interlacing.

265
00:19:32,998 --> 00:19:37,967
In the very earliest days of broadcast video, engineers sought the fastest practical framerate 

266
00:19:37,967 --> 00:19:42,075
to smooth motion and to minimize flicker on phosphor-based CRTs.  

267
00:19:42,075 --> 00:19:45,277
They were also under pressure to use the least possible bandwidth 

268
00:19:45,277 --> 00:19:48,182
for the highest resolution and fastest frame rate.  

269
00:19:48,182 --> 00:19:51,208
Their solution was to interlace the video where the even lines 

270
00:19:51,208 --> 00:19:54,826
are sent in one pass and the odd lines in the next.  

271
00:19:54,826 --> 00:19:59,961
Each pass is called a field and two fields sort of produce one complete frame.

272
00:19:59,961 --> 00:20:05,319
"Sort of", because the even and odd fields aren't actually from the same source frame.  

273
00:20:05,319 --> 00:20:10,797
In a 60 field per second picture, the source frame rate is actually 60 full frames per second, 

274
00:20:10,797 --> 00:20:15,386
and half of each frame, every other line, is simply discarded.  

275
00:20:15,386 --> 00:20:20,272
This is why we can't deinterlace a video simply by combining two fields into one frame;

276
00:20:20,272 --> 00:20:23,039
they're not actually from one frame to begin with.

277
00:20:24,047 --> 00:20:29,683
The cathode ray tube was the only available display technology for most of the history of electronic video. 

278
00:20:29,683 --> 00:20:32,949
A CRT's output brightness is nonlinear, approximately equal 

279
00:20:32,949 --> 00:20:36,585
to the input controlling voltage raised to the 2.5th power. 

280
00:20:36,585 --> 00:20:43,821
This exponent, 2.5, is designated gamma, and so it's often referred to as the gamma of a display.  

281
00:20:43,821 --> 00:20:50,493
Cameras, though, are linear, and if you feed a CRT a linear input signal, it looks a
bit like this.

282
00:20:51,270 --> 00:20:56,637
As there were originally to be very few cameras, which were fantastically expensive anyway, 

283
00:20:56,637 --> 00:21:01,634
and hopefully many, many television sets which best be as inexpensive as possible, 

284
00:21:01,634 --> 00:21:08,222
engineers decided to add the necessary gamma correction circuitry to the cameras rather than the sets. 

285
00:21:08,222 --> 00:21:13,062
Video transmitted over the airwaves would thus have a nonlinear intensity using the inverse 

286
00:21:13,062 --> 00:21:18,271
of the set's gamma exponent, so that once a camera's signal was finally displayed on the CRT, 

287
00:21:18,271 --> 00:21:23,305
the overall response of the system from camera to set was back to linear again.

288
00:21:23,777 --> 00:21:25,118
Almost.

289
00:21:30,393 --> 00:21:33,113
There were also two other tweaks. 

290
00:21:33,113 --> 00:21:40,442
A television camera actually uses a gamma exponent that's the inverse of 2.2, not 2.5.  

291
00:21:40,442 --> 00:21:43,754
That's just a correction for viewing in a dim environment. 

292
00:21:43,754 --> 00:21:48,279
Also, the exponential curve transitions to a linear ramp near black.  

293
00:21:48,279 --> 00:21:52,360
That's just an old hack for suppressing sensor noise in the camera.

294
00:21:54,941 --> 00:21:57,347
Gamma correction also had a lucky benefit. 

295
00:21:57,347 --> 00:22:02,214
It just so happens that the human eye has a perceptual gamma of about 3.  

296
00:22:02,214 --> 00:22:05,962
This is relatively close to the CRT's gamma of 2.5. 

297
00:22:05,962 --> 00:22:10,607
An image using gamma correction devotes more resolution to lower intensities, 

298
00:22:10,607 --> 00:22:14,336
where the eye happens to have its finest intensity discrimination, 

299
00:22:14,336 --> 00:22:18,222
and therefore uses the available scale resolution more efficiently.  

300
00:22:18,222 --> 00:22:22,784
Although CRTs are currently vanishing, a standard sRGB computer display 

301
00:22:22,784 --> 00:22:28,419
still uses a nonlinear intensity curve similar to television, with a linear ramp near black,

302
00:22:28,419 --> 00:22:32,491
followed by an exponential curve with a gamma exponent of 2.4. 

303
00:22:32,491 --> 00:22:36,636
This encodes a sixteen bit linear range down into eight bits.

304 
00:22:37,580 --> 00:22:41,790
The human eye has three apparent color channels, red, green, and blue, 

305
00:22:41,790 --> 00:22:47,407
and most displays use these three colors as additive primaries to produce a full range of color output.  

306
00:22:49,258 --> 00:22:54,190
The primary pigments in printing are Cyan, Magenta, and Yellow for the same reason; 

307
00:22:54,190 --> 00:22:59,381
pigments are subtractive, and each of these pigments subtracts one pure color from reflected light.  

308
00:22:59,381 --> 00:23:05,682
Cyan subtracts red, magenta subtracts green, and yellow subtracts blue.

309
00:23:05,682 --> 00:23:10,919
Video can be and sometimes is represented with red, green, and blue color channels, 

310
00:23:10,919 --> 00:23:17,211
but RGB video is atypical. The human eye is far more sensitive to luminosity than it is the color, 

311
00:23:17,211 --> 00:23:21,329
and RGB tends to spread the energy of an image across all three color channels.  

312
00:23:21,329 --> 00:23:25,326
That is, the red plane looks like a red version of the original picture, 

313
00:23:25,326 --> 00:23:28,769
the green plane looks like a green version of the original picture, 

314
00:23:28,769 --> 00:23:32,063
and the blue plane looks like a blue version of the original picture.  

315
00:23:32,063 --> 00:23:35,705
Black and white times three.  Not efficient.

316
00:23:35,706 --> 00:23:39,438
For those reasons and because, oh hey, television just happened to start out 

317
00:23:39,438 --> 00:23:45,017
as black and white anyway, video usually is represented as a high resolution luma channel, 

318
00:23:45,017 --> 00:23:51,041
the black & white, along with additional, often lower resolution chroma channels, the color. 

319
00:23:51,041 --> 00:23:57,074
The luma channel, Y, is produced by weighting and then adding the separate red, green and blue signals.  

320
00:23:57,074 --> 00:24:01,867
The chroma channels U and V are then produced by subtracting the luma signal from blue 

321
00:24:01,867 --> 00:24:04,070
and the luma signal from red.

322
00:24:04,070 --> 00:24:11,750
When YUV is scaled, offset and quantized for digital video, it's usually more correctly called Y'CbCr, 

323
00:24:11,750 --> 00:24:15,238
but the more generic term YUV is widely used to describe 

324
00:24:15,238 --> 00:24:18,301
all the analog and digital variants of this color model.

325
00:24:18,912 --> 00:24:22,983
The U and V chroma channels can have the same resolution as the Y channel, 

326
00:24:22,983 --> 00:24:28,674
but because the human eye has far less spatial color resolution than spatial luminosity resolution, 

327
00:24:28,674 --> 00:24:34,346
chroma resolution is usually halved or even quartered in the horizontal direction, the vertical direction, 

328
00:24:34,346 --> 00:24:39,528
or both, usually without any significant impact on the apparent raw image quality. 

329
00:24:39,528 --> 00:24:43,942
Practically every possible subsampling variant has been used at one time or another,

330
00:24:43,942 --> 00:24:46,875
but the common choices today are 

331
00:24:46,875 --> 00:24:51,187
4:4:4 video, which isn't actually subsampled at all, 

332
00:24:51,187 --> 00:24:56,711
4:2:2 video in which the horizontal resolution of the U and V channels is halved, 

333
00:24:56,711 --> 00:25:02,587
and most common of all, 4:2:0 video in which both the horizontal and vertical resolutions 

334
00:25:02,587 --> 00:25:08,897
of the chroma channels are halved, resulting in U and V planes that are each one quarter the size of Y.

335
00:25:08,897 --> 00:25:17,096
The terms 4:2:2, 4:2:0, 4:1:1 and so on and so forth, aren't complete descriptions of a chroma subsampling. 

336
00:25:17,096 --> 00:25:21,186
There's multiple possible ways to position the chroma pixels relative to luma, 

337
00:25:21,096 --> 00:25:24,776 
and again, several variants are in active use for each subsampling.  

338
00:25:24,776 --> 00:25:32,502
For example, motion JPEG, MPEG-1 video, MPEG-2 video, DV, Theora and WebM all use 

339
00:25:32,502 --> 00:25:38,137
or can use 4:2:0 subsampling, but they site the chroma pixels three different ways.

340
00:25:38,498 --> 00:25:43,023
Motion JPEG, MPEG1 video, Theora and WebM all site chroma pixels 

341
00:25:43,023 --> 00:25:46,345
between luma pixels both horizontally and vertically.

342
00:25:46,345 --> 00:25:51,989
MPEG2 sites chroma pixels between lines, but horizontally aligned with every other luma pixel. 

343
00:25:51,989 --> 00:25:57,106
Interlaced modes complicate things somewhat, resulting in a siting arrangement that's a tad bizarre.

344
00:25:57,106 --> 00:26:00,909
And finally PAL-DV, which is always interlaced, places the chroma pixels 

345
00:26:00,909 --> 00:26:04,398
in the same position as every other luma pixel in the horizontal direction, 

346
00:26:04,398 --> 00:26:07,303
and vertically alternates chroma channel on each line.

347
00:26:07,683 --> 00:26:12,282
That's just 4:2:0 video. I'll leave the other subsamplings as homework for the
viewer.  

348
00:26:12,282 --> 00:26:14,882
You've got the basic idea, moving on.

349
00:26:15,511 --> 00:26:21,128
In audio, we always represent multiple channels in a PCM stream by interleaving the samples 

350
00:26:21,128 --> 00:26:26,383
from each channel in order. Video uses both packed formats that interleave the color channels, 

351
00:26:26,383 --> 00:26:30,584
as well as planar formats that keep the pixels from each channel together in separate planes 

352
00:26:30,584 --> 00:26:35,415
stacked in order in the frame. There are at least 50 different formats in these two broad categories 

353
00:26:35,415 --> 00:26:41,549
with possibly ten or fifteen in common use. Each chroma subsampling and different bit-depth requires 

354
00:26:41,549 --> 00:26:46,574
a different packing arrangement,  and so a different pixel format.  For a given unique subsampling, 

355
00:26:46,574 --> 00:26:50,858 
there are usually also several equivalent formats that consist of trivial channel order 

356
00:26:50,858 --> 00:26:55,966
rearrangements or repackings due either to convenience once-upon-a-time on some particular 

357
00:26:55,966 --> 00:27:00,352
piece of hardware or sometimes just good old-fashioned spite.

358
00:27:00,352 --> 00:27:04,692
Pixels formats are described by a unique name or fourcc code.  

359
00:27:04,692 --> 00:27:08,115
There are quite a few of these and there's no sense going over each one now.

360
00:27:08,115 --> 00:27:13,704
Google is your friend.  Be aware that fourcc codes for raw video specify the pixel arrangement 

361
00:27:13,704 --> 00:27:20,339
and chroma subsampling, but generally don't imply anything certain about chroma siting or color space.  

362
00:27:20,339 --> 00:27:25,807
YV12 video to pick one, can use JPEG, MPEG-2 or DV chroma siting, 

363
00:27:25,807 --> 00:27:28,991
and any one of several YUV colorspace definitions.

364
00:27:29,472 --> 00:27:33,913
That wraps up our not so quick and yet very incomplete tour of raw video. 

365
00:27:33,913 --> 00:27:38,651
The good news is we can already get quite a lot of real work done using that overview. 

366
00:27:38,651 --> 00:27:42,528
In plenty of situations, a frame of video data is a frame of video data.  

367
00:27:42,528 --> 00:27:46,451
The details matter, greatly, when it come time to write software, 

368
00:27:46,452 --> 00:27:52,086
but for now I am satisfied that the esteemed viewer is broadly aware of the relevant issues.

369
00:27:55,640 --> 00:27:59,230
So. We have audio data. We have video data. 

370
00:27:59,230 --> 00:28:03,246
What remains is the more familiar non-signal data and straight up engineering 

371
00:28:03,246 --> 00:28:07,410
that software developers are used to. And plenty of it!

372
00:28:07,928 --> 00:28:11,768 
Chunks of raw audio and video data have no externally visible structure, 

373
00:28:11,768 -->  00:28:15,173
but they're often uniformly sized.  We could just string them together 

374
00:28:15,173 --> 00:28:18,097
in a rigid pre-determined ordering for streaming and storage, 

375
00:28:18,097 --> 00:28:21,040
and some simple systems do approximately that. 

376
00:28:21,040 --> 00:28:24,195
Compressed frames though aren't necessarily a predictable size, 

377
00:28:24,195 --> 00:28:29,405
and we usually want some flexibility in using a range of different data types in streams.

378
00:28:29,405 --> 00:28:34,281
If we string random formless data together, we lose the boundaries that separate frames 

379
00:28:34,281 --> 00:28:37,871
and don't necessarily know what data belongs to which streams.  

380
00:28:37,871 --> 00:28:42,192
A stream needs some generalized structure to be generally useful.

381
00:28:42,192 --> 00:28:46,606
In addition to our signal data, we also have our PCM and video parameters.  

382
00:28:46,606 --> 00:28:49,752
There's probably plenty of other metadata we also want to deal with, 

383
00:28:49,752 --> 00:28:55,415
like audio tags and video chapters and subtitles, all essential components of rich media.  

384
00:28:55,415 --> 00:29:01,633
It makes sense to place this metadata, that is,  data about the data, within the media itself.

385
00:29:01,633 --> 00:29:06,445
Storing and structuring formless data and disparate metadata is the job of a container.  

386
00:29:06,445 --> 00:29:09,221
Containers provide framing for the data blobs, 

387
00:29:09,221 --> 00:29:12,015
interleave and identify multiple data streams, 

388
00:29:12,015 --> 00:29:15,337
provide timing information, and store the metadata necessary 

389
00:29:15,337 --> 00:29:19,140
to parse, navigate, manipulate and present the media.  

390
00:29:19,140 --> 00:29:22,222
In general, any container can hold any kind of data.  

391
00:29:22,222 --> 00:29:24,970
And data can be put into any container.

392
00:29:28,801 --> 00:29:32,391 
In the past thirty minutes, we've covered digital audio, video, 

393
00:29:32,391 --> 00:29:35,435
some history, some math and a little engineering. 

394
00:29:35,435 --> 00:29:39,377
We've barely scratched the surface, but it's time for a well earned break.

395
00:29:41,107 --> 00:29:45,373
There's so much more to talk about, so I hope you'll join me again in our next episode.  

396
00:29:45,373 --> 00:29:47,159
Until then--- Cheers!