(An incomplete guide)
Preface
Overview of tensors
As Edward Z Yang covered in his 2019 writeup, PyTorch Internals,
The tensor is the central data structure in PyTorch... an ndimensional data structure containing some scalar type, e.g., floats, ints, etc. We can think of a tensor as consisting of some data, and then some metadata describing the size of the tensor, the type of the elements in contains (dtype), what device the tensor lives on [&] the stride
 The stride is used to provide views onto the tensor's underlying data in memory (the storage).
 Operations (such as
mm
, matrix multiplication) involve a device/sparsitydependent dynamic dispatch followed by a dtypedependent dispatch (a simple switch statement for the kernel's supported dtypes)  Tensor layout doesn't have to be dense and strided: it can be sparse, MKLDNN, etc. (thanks to extensions)
Automatic differentiation in brief
PyTorch implements reverse mode automatic differentiation (AD):
In reverse accumulation AD, the dependent variable to be differentiated is fixed and the derivative is computed with respect to each subexpression recursively. In a penandpaper calculation, the derivative of the outer functions is repeatedly substituted in the chain rule:$\frac{\partial{y}}{\partial{x}} = \frac{\partial{y}}{\partial{w_1}} \frac{\partial{w_1}}{\partial{x}} = \frac{\partial{y}}{\partial{w_2}} \frac{\partial{w_2}}{\partial{w_1}} \frac{\partial{w_1}}{\partial{x}} = \frac{\partial{y}}{\partial{w_3}} \frac{\partial{w_3}}{\partial{w_2}} \frac{\partial{w_2}}{\partial{w_1}} \frac{\partial{w_1}}{\partial{x}} = ...$
In reverse accumulation, the quantity of interest is the adjoint, denoted with a bar ($\bar{w}$); it is a derivative of a chosen dependent variable with respect to a subexpression $w$:$\bar{w} = \frac{\partial{y}}{\partial{w}}$
Reverse accumulation traverses the chain rule from outside to inside, or in the case of the computational graph in Figure 3, from top to bottom [from complete function down to its component variables]
Yang describes this as follows:
we effectively walk the forward computations "backward" to compute the gradients.
Technically, these variables which we call
grad_
are not really gradients". They're really Jacobians leftmultiplied by a vector.
See example 1 here for a concise explanation
Notes that the Jacobian for a scalarvalued loss function is simply a rowvector, so the innermost parens (bracketing the loss gradient via the associativity of matrix multiplication)
See Kevin Clark's CS224n notes for more formal coverage
Core methods
Basic info
dim()
ndimension()
 the number of dimensions
nelement()
numel()
 total number of elements in the tensor
size()
 the size of the tensor (also available as
.shape
) item()
 the value as a standard Python numerical type, for tensors with one element
view(*shape)
 gives a new tensor with the same underlying
data
but differentshape
(a "view" on it) view_as(other)
 views a tensor as the same size as the
other
tensor clone()
 creates a tensor that shares the same data/storage and retained autograd relationship
detach()
 creates a tensor that shares the same data/storage but is permanently 'detached' from the
automatic differentiation graph, like setting the
tensor(requires_grad=False)
parameter
Basic operations
apply_(callable)
 applies the function
callable
to each element, replacing the element with the return value backward(gradient=None, retain_graph=None, create_graph=False, inputs=None)

compute gradient of the tensor w.r.t. graph leaves and 'accumulates' them in the leaves. Must pass
gradient
(a tensor of same dtype and device) if nonscalar, containing the gradient of the differentiated function w.r.t. the tensor.Preferred Networks explains how autograd operates on graphs:
 autograd overview –
how PyTorch ships basic derivatives for the functions that comprise the graph nodes,
and differentiates by Jacobian vector products upon calling
backward()
 how graphs are constructed –
where the components of autograd live, how C++ code is built for the derivatives,
how the use of
requires_grad=True
in tensor instantiation sets off construction of an autograd metadata computational graph, and how operation gradients are passed as pointers (C++ source code is shown, namely templates in.h
header files)  how graphs are executed –
runs through what happens when calling
Tensor.backward()
, leading totorch.autograd.backward()
checking the inputs and calling the C++ layer, or when callingtorch.autograd.grad()
(which returns a tuple instead of populating the.grad
field of the Tensor objects)
 autograd overview –
how PyTorch ships basic derivatives for the functions that comprise the graph nodes,
and differentiates by Jacobian vector products upon calling
map_(tensor, callable)
 applies the function
callable
to each element and the giventensor
and stores the results in theself
(self
and the giventensor
must be broadcastable). register_hook(hook)
 registers a backward hook (a function taking a single argument,
grad
), which will be called every time a gradient w.r.t. the tensor is computed, and returns either a new gradient to use in place ofgrad
, orNone
. t()
 transposes dimensions 0 and 1 of a $\leq$2D tensor (0D and 1D tensors are returned asis)
zero_()
 fills the tensor with zeros.
Creating new tensors
Remember
These methods by default return the same dtype
and device
as the tensor they're called on
new_empty(size, dtype=None, device=None, requires_grad=False)
 creates a tensor of size
size
filled with uninitialised data new_full(size, fill_value, dtype=None, device=None, requires_grad=False)
 creates a tensor of size
size
filled withfill_value
new_ones(size, dtype=None, device=None, requires_grad=False)
 creates a tensor of size
size
filled with $1$ new_zeros(size, dtype=None, device=None, requires_grad=False)
 creates a tensor of size
size
filled with $0$ new_tensor(data, dtype=None, device=None, requires_grad=False)

creates a tensor with copied
data
Implicitly constructs a leaf variable
Prefer (the equivalent)
x.clone().detach()
over initialising a copy on an existing tensor with this method
Memory management
contiguous(memory_format=torch.contiguous_format)
 return a contiguous inmemory tensor containing the same data.
If already in the specified
memory_format
, return the original tensor. cpu(memory_format=torch.preserve_format)
 copy into CPU memory (or return the original if it's already there)
cuda(device=None, non_blocking=False, memory_format=torch.preserve_format)
 copy into CUDA memory (or return the original if it's already there)
data_ptr()
 return the address of the first element
element_size()
 size of an individual element, in bytes
copy_(src, non_blocking=False)
 copies elements from
src
(which may be of a different dtype or device) into the tensor, and return it. Thenon_blocking
flag only applies to CPUGPU transfer get_device()
 gives the device ordinal of the GPU for CUDA tensors, or throws an error for CPU tensors
pin_memory()
 copies to pinned memory, if not already pinned
requires_grad_(requires_grad=True)
 change if autograd should record operations on this tensor.
Sets the
requires_grad
attribute inplace and returns the tensor retain_grad()
 enables the tensor to have the
grad
populated duringbackward()
(noop for leaf tensors) set_(source=None, storage_offset=0, size=None, stride=None)
 sets the underlying storage (
source
and its offset),size
andstride
. Ifsource
is a tensor share the same storage, and match its size and strides, such that changes will be reflected between the two. share_memory_()
 moves the underlying storage to shared memory (noop if already in shared memory or for CUDA tensors)
cannot be resized
storage()
 returns the underlying storage
storage_offset()
 returns the tensor's offset in the underlying storage (in units of storage elements, not bytes)
storage_type()
 returns the underlying storage's type
Type conversion
type(dtype=None)
 returns the dtype if no
dtype
passed, or else casts to the given type type_as(tensor)
 casts to the type of the given
tensor
(noop if already of that type), equivalent tot.type(tensor.type())
tolist()
 converts to a list just like NumPy's
ndarray.tolist()
method, with all the items within the tensor becoming base Python types as_subclass(cls)
 makes a
cls
instance with the same data pointer (cls
must be a subclass ofTensor
) numpy()
 returns the tensor as a NumPy
ndarray
, sharing the same underlying storage (thus changes to one will be reflected in the other) deg2rad()
 converts from angles in degrees to radians
rad2deg()
 converts from angles in radians to degrees
to(dtype, non_blocking=False, copy=False, memory_format=torch.preserve_format)
to(device=None, dtype=None, non_blocking=False, copy=False, memory_format=torch.preserve_format)
 performs dtype and/or device conversion (inferred from args/kwargs), copying if different to the original's
to_mkldnn()
 copy as
torch.mkldnn
layout
dtype conversion
The following methods on a tensor convert to a particular dtype:
bool()
 convert to booleans via
torch.to
(typically used for device movement but also does dtype conversion) float()
 convert to
torch.float32
(32bit floating point) int()
 convert to
torch.int32
(32bit integer) short()
 convert to
torch.int16
(16bit integer) long()
 convert to
torch.int64
(64bit integer) half()
 convert to
torch.float16
(16bit floating point) double()
 convert to
torch.float64
(64bit floating point) bfloat16()

convert to
torch.bfloat16
(Brain 16bit floating point)see “What Is Bfloat16 Arithmetic?” by Nick Higham
bfloat16 "allocates 8 bits for the significand and 8 bits for the exponent (the same exponent size as fp32), c.f. fp16's 11 for the significand but only 5 for the exponent", as NNs are “far more sensitive to the size of the exponent” (Wang and Kanwar, 2019)". Google TPUs and NVIDIA A100s support it.
byte()
 convert to
torch.uint8
(8bit unsigned integer)
Remember
short
and long
are both integer types, analogous to the half
and double
float types
See also:
 What every user should know about mixed precision training in PyTorch
 Automatic mixed precision for faster training on NVIDIA GPUs
torch.amp
 Automatic Mixed Precision package
Arithmetic
Simple arithmetic
sign()
 signs of the elements
abs()
absolute()
 absolute (nonnegative) values of the elements
add(other, *, alpha=1)
 add, which takes a keyword
alpha
for a scalar multiplier sub(other, *, alpha=1)
subtract(other, *, alpha=1)
 subtract, which takes a keyword
alpha
for a scalar multiplier mul()
multiply()
 scalar multiply
div(value, *, rounding_mode=None)
divide(value, *, rounding_mode=None)
 division
true_divide(value)
 alias for
t.div(rounding_mode=None)
dot()
 dot product/inner product
exp()
 raise each item base e
exp2()
 raise each item base 2
expm1()
 raise each item base e, then minus 1
frac()
 get the fractional part (after the decimal point) of each float
gcd()
 get the greatest common divisor of each pair of integers
log()
 natural logarithm, $\ln$
log10()
 logarithm base 10
log1p()
 natural logarithm of
(1+input)
log2()
 logarithm base 2
matmul(tensor2)
 matrix multiplication, broadcast inputs (usually use
@
instead) mm(mat2)
 matrix multiplication, does not broadcast
mv(vec)
 matrixvector product, does not broadcast
mean(dim=None, keepdim=False, *, dtype=None)
 mean average, elementwise,
dim
can be a tuple median(dim=None, keepdim=False, *, dtype=None)

median average,
dim
can be an integer else the last dimension is used.Not unique for
input
tensors with an even number of elements indim
In this case the lower of the two medians is returned. To compute the mean of both medians, use
quantile(q=0.5)
instead.indices
does not necessarily contain the first occurence of each median value (unless it is unique)Results will vary based on device, likewise do not expect the gradients to be deterministic
mode(dim=None, keepdim=False)
 mode average, elementwise, can take
dim
otherwise assumes last dimension reciprocal()
 reciprocal of the elements
sqrt()
 square root of the elements
square()
 square of the elements
sum(dim=None, keepdim=False, dtype=None)
 sum of the elements
diff(input, n=1, dim=1, prepend=None, append=None)
n
'th forward difference in the given dimension (default: lastdim
)
Matrix arithmetic
addbmm(batch1, batch2, *, beta=1, alpha=1)
 batched matrixmatrix product with a reduced add step, accumulating all matmuls along the first dimension
addcdiv(tensor1, tensor2, *, value=1)
addcmul(tensor1, tensor2, *, value=1)
 divides/multiplies
tensor1
bytensor2
elementwise, multiplies the result by a scalarvalue
and adds to the input addmm(mat1, mat2, *, beta=1, alpha=1)
 matrix multiplication of
mat1
bymat2
, added to the input (alpha
scales the matmul product andbeta
scales the added input matrix) addmv(mat, vec, *, beta=1, alpha=1)
 matrix vector product of
mat
andvec
, added to the input (alpha
scales the matmul product andbeta
scales the added input matrix) addr(vec1, vec2, *, beta=1, alpha=1)
 outerproduct of the vectors
vec1
andvec2
, added to the the input (alpha
scales the outer product andbeta
scales the added input matrix) baddbmm(batch1, batch2, *, beta=1, alpha=1)
 batched matrixmatrix product, added to the the input (
alpha
scales the matrixmatrix product andbeta
scales the added input matrix) bmm(batch2)
 batched matrixmatrix product of matrices in the source tensor and
mat2
, which both must be 3D and contain the same number of matrices
More arithmetic
copysign(other)
 creates a new floating point tensor with the same magnitude but the sign of
other
, elementwise cross(other, dim=None)

vector crossproduct in dimension
dim
Warning: possible unexpected behaviour
If
dim
is not given, it defaults to the first dimension found with size 3 cumprod(dim, dtype=None)
 cumulative product of elements in the dimension
dim
cumsum(dim, dtype=None)
 cumulative sum of elements in the dimension
dim
floor_divide(value)

Deprecated
To actually perform floor division, use
.div(rounding_mode="floor")
true_divide(dividend, divisor, *, out)
 alias for
div(rounding_mode=None)
eq(other)
equal(other)
 elementwise equality
float_power(exponent)
 raises elementwise to the power of
exponent
, in doubleprecision fmod(divisor)
 applies C++'s
std::fmod
elementwise.a.fmod(b)
is equivalent toa  a.div(b, rounding_mode="trunc") * b
frexp()
 decompose into mantissa and exponent tensors
inner(other)
 dot product for 1D tensors, sum of elementwise product with
other
along their last dimension for multidimensional tensors. Equivalent to.mul(other)
for scalars, else totorch.tensordot(dims=[1], [1])
ge(other)
greater_equal(other)
 elementwise $>=$ inequality check
gt()
greater()
 elementwise $\gt$ inequality check
lt()
less()
 elementwise $\lt$ inequality check
le()
less_equal()
 elementwise $\leq$ inequality check
ne(other)
not_equal(other)
 computes $\ne$ elementwise
neg()
negative()
 takes the negative (flips the sign) of the elements
remainder(divisor)
 computes modulus elementwise
rsqrt()
 reciprocal of the square root, elementwise
pow(exponent)
 raises each element to the power
exponent
for scalar exponent, or broadcasts if tensorexponent
prod(dim=None, keepdim=False, dtype=None)
 the product of all elements in the input tensor (if no
dim
specified, first flattened) sum_to_size(*size)

sum the tensor to
size
, which must be broadcastable (in other words sum along any axes that differ from the current tensorshape
)sum_to_size
isexpand
backwards
"Just as broadcasting is inserting implicit expands, the autograd engine will insert implicit “expand backwards” in the form of
sum_to_size
" — Thomas Viehmann lcm(other)
 lowest/least common multiple, elementwise with another integerdtype tensor
ldexp(other)
 multiplies by $2^{other}$
lerp(end, weight)
 linear interpolates with
end
based on a scalar/tensorweight
xlogy(other)
 multiplies by $\log(other)$, similar to SciPy's
scipy.special.xlogy
vdot(other)
 computes the dot product of a 1D tensor with another
Logical && bitwise operators
logical_and(other)
logical_not(other)
logical_or(other)
logical_xor(other)
 elementwise logical AND ($\wedge$), NOT ($\neg$), OR ($\vee$), XOR ($\oplus$), where zeros are treated as
False
and nonzeros asTrue
bitwise_and(other)
bitwise_not(other)
bitwise_or(other)
bitwise_xor(other)
 bitwise AND/NOT/OR/XOR (logical for bool tensors)
bitwise_left_shift(other)
bitwise_right_shift(other)
 left/right arithmetic shift by
other
bits on an integer tensor
Rounding
round(decimals=0)
 round to the closest integer
ceil()
 give the smallest integer greater than or equal to each element
floor()
 give the largest integer less than or equal to each element
clamp(min, max)
clip(min, max)
 constrain the values to the range $[min,max]$
clamp_min()
clamp_max()
 onesided (lower/upper bound)
clamp
(clip
) trunc()
fix()
 truncate the integer part (regardless of sign) of a float
Sorting by and querying for extreme values
sort(dim=1, descending=False)
 sort elements along a given dimension (default: last) into order (default: ascending) by value, returning a named tuple
(values, indices)
argsort(dim=1, descending=False)
 return the indices that sort elements along a given dimension (default: last) into order (default: ascending) by value
min()
minimum()
 minimum
max()
maximum()
 maximum
topk(k, dim=None, largest=True, sorted=True)
 get the
k
largest elements along a given dimension (default: last dimension) argmin()
argmax()
 index the minimum/maximum value
argwhere()
 index the nonzero values
cummin(dim)
cummax(dim)
 values and indices for the cumulative minimum/maximum of elements in the given dimension
amin(dim=None, keepdim=False)
amax(dim=None, keepdim=False)
aminmax(*, dim=None, keepdim=False)

minimum, maximum, or both for each slice in the dimension
dim
Differences to
max()
/min()
 supports reducing on multiple dimensions
 doesn't return indices
 evenly distributes gradient between equal values
(whereas
max
/min
only propagates gradient to a single index in the source tensor)
fmin(other)
fmax(other)

elementwise minimum/maximum (wraps C++'s
std::fmin
, and similar to NumPy'sfmin()
)Handles
NaN
differently tomin()
If exactly one of the two elements in a comparison is
NaN
then the nonNaN
element is taken as the minimum (soNaN
only propagates if both are) msort()
 sorts elements along the first dimension in ascending order by value
Repetition
expand(*sizes)
 return a view with singleton dimensions expanded, with 1 indicating no change to that dimension
expand_as(other)
 expand [any singleton dimensions of] the tensor to the same size as another: equivalent to
expand(other.size())
repeat(*sizes)
 repeat tensor along specified dimensions the given number of times (
sizes
), copying its data (similar to NumPytile
) repeat_interleave(repeats: Tensor  int, dim=None)
 repeat elements the given number of times [broadcasted to fit the axis], along axis
dim
, else by default use the flattened array (similar to NumPyrepeat
) tile(dims)
 repeat elements the given number of times [broadcasted to fit the axis], along axis
dim
, else by default use the flattened array (similar to NumPyrepeat
) unique(sorted=True, return_inverse=False, return_counts=False, dim=None)
 unique elements without repetition (eliminates nonconsecutive duplicate values).
Use
torch.unique_consecutive
instead if input is sorted unique_consecutive(return_inverse=False, return_counts=False, dim=None)
 eliminates duplicates after the first element from every consecutive group of equivalent elements
Dimension and sampling
all(dim=None, keepdim=False)
any(dim=None, keepdim=False)
 tests if all/any elements evaluate to
True
(like NumPy, converts tobool
for all dtypes exceptuint8
) allclose(other, rtol=1e05, atol=1e08, equal_nan=False)
 checks if all source and
other
elements satisfy the condition $\left input  other \right \leq atol + rtol \times \left other \right$ (behaves like Numpy'sallclose
) count_nonzero(dim=None)
 counts nonzero values along the given
dim
, or in the entire tensor if nodim
specified where(condition, y)
 returns a tensor of elements selected from either input or
y
depending on thecondition
permute(*dims)
 view with dimensions permuted (reordered) as
dims
unbind(dim=0)
 removes a tensor dimension, returns a tuple of slices along
dim
without it gather(dim, index)
 gathers values from
index
along an axisdim
scatter_(dim, index, src)
 writes values from
src
atindex
along an axisdim
diagonal_scatter(src, offset=0, dim1=0, dim2=1)
 writes values from
src
along the diagonal elements of the input with respect todim1
anddim2
narrow(dim, start, length)
 view along dimension
dim
at positionstart
forlength
items take(index)
 make a new tensor from the values at
index
select(dim, index)
 view a slice along the
dim
axis atindex
fill_()
 fill the tensor with the specified value, inplace
fill_diagonal_(fill_value, wrap=False)
 fills the main diagonal of a multidimensional tensor inplace (all dimensions must be of equal length for $\gt$ 2D), 'wrapping' after $N$ columns for tall matrices (where $M > N$)
unfold(dimension, size, step)
 view all slices of the given
size
in the givendimension
roll(shifts, dims=None)
 shift the tensor along the given dimension(s)
dims
, flattening if nodims
are specified before restoring the original shape (bothshifts
andints
can be anint
orint
tuple) stride(dim)
 gives the integer jump necessary to go from one element to the next in the specified dimension
dim
, or a tuple of all strides if nodim
specified chunk(chunks, dim=0)
 view a tensor in a specific number of chunks along axis
dim
, the last will be smaller if tensor size indivisible bychunks
bincount(weights=None, minlength=0)

count the frequency of each value in an array of nonnegative integers
Can produce nondeterministic gradients
See the docs on randomness and reproducibility
dsplit(split_size_or_sections)
 split a tensor with 3 or more dimensions into multiple views, depthwise according to
split_size_or_sections
More dimensions/sampling
as_strided()
 ...
broadcast_to()
 ...
histc()
 ...
histogram()
 ...
take_along_dim()
 ...
hsplit()
 ...
index_add(dim, index, source, *, alpha=1)
 adds elements of
alpha
timessource
by adding to the indices in the order given inindex
index_copy(dim, index, tensor2)
 copies elements of
tensor2
into the source tensor by selecting the indices in the order given inindex
index_fill(dim, index, value)
 fills elements of the source tensor with
value
by selecting the indices in the order given inindex
index_put(indices, values, accumulate=False)

puts
values
into the source tensor using the indices specified inindices
(a tuple of tensors),The inplace version is equivalent to indexed assignment
tensor.index_put_(indices, values)
$==$tensor[indices] = values
index_reduce(dim, index, source, reduce, *, include_self=True)
 reduces the source tensor by selecting the indices in the order given in
index
, where thereduce
argument is one of "prod", "mean", "amax", "amin" index_select(dim, index)
 selects the indices of the source tensor in the order given in
index
kthvalue()
 ...
masked_fill()
 ...
masked_scatter()
 ...
masked_select()
 ...
moveaxis()
 ...
movedim()
 ...
multinomial()
 ...
nextafter()
 ...
put_()
 ...
ravel()
 ...
split()
 ...
tensor_split()
 ...
var()
 ...
vsplit()
 ...
scatter_add()
 ...
scatter_reduce()
 ...
select_scatter()
 ...
slice_scatter()
 ...
swapaxes()
 ...
swapdims()
 ...
Shape change
reshape(*shape)
 return a tensor with the same data and number of elements but the specified
shape
, as a view if compatible with the current shape reshape_as(other)
 returns the tensor in the same shape as
other
, equivalent toreshape(other.sizes())
, as a view if compatible with the current shape resize_(*sizes, memory_format=torch.contiguous_format)

resizes the tensor to the specified size, resizing the underlying storage if larger than the current storage size
Lowlevel method
Prefer
reshape()
resize_as_(tensor, memory_format=torch.contiguous_format)
 resizes the tensor to be the same size as the specified
tensor
, equivalent toresize(tensor.size())
transpose(dim0, dim1)
 return a tensor with the same data but dimensions
dim0
anddim1
swapped flatten(start_dim=0, end_dim=1)
 flattens a contiguous range of dimensions in a tensor
unflatten(dim, sizes)
 expands the dimension
dim
over multiple dimensions of sizes given bysizes
squeeze()
 remove all dimensions of size 1 ("singleton dimensions")
unsqueeze()
 view the tensor with a singleton dimension inserted at the specified position
flip(dims)
 reverse the order of a ndimensional tensor along the given axes
dims
fliplr()
 flip the entries in each row left/right, equivalent to
input[:,::1]
(must be $\ge$2D) flipud()
 flip the entries in each column up/down, equivalent to
input[::1,...]
(must be $\ge$1D) tril(diagonal=0)
 get the lower triangular part of the matrix, or batches, $\pm$
diagonal
diagonals above/below the main diagonal triu(diagonal=0)
 get the upper triangular part of the matrix, or batches, $\pm$
diagonal
diagonals above/below the main diagonal rot90(k, dims)
 rotate a ndimensional tensor by 90° in the
dims
plane,k
times from the first towards the second if $k \gt 0$ or vice versa if $k \lt 0$ diag(diagonal=0)
 turns a 1D vector into a 2D diagonal matrix, or vice versa (main diagonal by default, or above/below as specified by $\pm$
offset
) diagflat(offset=0)
 puts a 1D vector along the diagonal of a 2D matrix, flattening if multidimensional
diagonal(offset=0, dim1=0, dim2=1)
 partial view with diagonal elements in dimensions
dim1
anddim2
as a new final dimension (i.e. a filled tensor made from diagonal elements, not a diagonal matrix) diag_embed(offset=0, dim1=2, dim2=1)
 create a tensor whose diagonals of certain 2D planes are filled by the input (by default: the planes of the last 2 dimensions of the input) [and zero off the diagonals]
Linear algebra
cov(correction=1, fweights=None, aweights=None)
 estimate the covariance matrix
lstsq(A)
 compute a solution to least squares
outer()
ger()
 outer product
dist(other, p=2)
p
norm of input $$ otherinverse()
 inverse of a square matrix, or batches
det()
 determinant of a square matrix, or batches
logdet()
 logdeterminant of a square matrix, or batches
cholesky()
 Cholesky factorise a symmetric positivedefinite matrix, or batches
lu()
 LU factorise a matrix, or batches
qr()
 QR factorise a matrix, or batches
renorm(p, dim, maxnorm)
 calculate a tensor where each subtensor of
input
along axisdim
is normalised such that the subtensorp
norm $\lt$maxnorm
svd()
 singular value decomposition of a real matrix, or batches
trace()
 sum the diagonal elements of a 2D matrix
kron()
 Kronecker product
adjoint()
 view conjugated and with the last two dimensions transposed
Be careful when using mixed precision training
Operations from torch.linalg
can be sensitive to [im]precision (see AMP best practices)
More linear algebra
cholesky_inverse()
 ...
cholesky_solve()
 ...
corrcoef()
 ...
eig()
 ...
geqrf()
 ...
lu_solve()
 ...
matrix_exp()
 ...
matrix_power()
 ...
norm()
 ...
orgqr()
 ...
ormqr()
 ...
pinverse()
 ...
symeig()
 ...
std()
 ...
slogdet()
 ...
triangular_solve()
 ...
Missing values
nan_to_num(nan=0.0, posinf=None, neginf=None)
 replaces
NaN
withnan
(default: zero), and $\pm$ infinity values withposinf
andneginf
(default: greatest/least finite value representable by the dtype) nanmean(dim=None, keepdim=False, *, dtype=None)
 computes the mean of all non
NaN
elements along the dimension(s)dim
, equivalent tot[~t.isnan()].mean(...)
nanmedian(dim=None, keepdim=False)
 computes the median of all non
NaN
elements along the dimension(s)dim
, equivalent tot[~t.isnan()].median(...)
nanquantile(q, dim=None, keepdim=False, *, interpolation='linear')
 computes the quantiles of all non
NaN
elements along the dimension(s)dim
, equivalent tot[~t.isnan()].quantile(...)
nansum(dim=None, keepdim=False, dtype=None)
 computes the sum of all non
NaN
elements along the dimension(s)dim
, equivalent tot[~t.isnan()].sum(...)
Checks
(see also: arithmetic checks, ne
etc.)
Distributions
bernoulli_(p=0.5, generator=None)
 fills each location with an independent sample from
Bernoulli(p)
cauchy_(median=0, sigma=1, generator=None)
 fills each location with an independent sample from the Cauchy distribution
log_normal_(mean=1, std=2, generator=None)
 fills each location with samples from the lognormal distribution,
whose underlying normal distribution is parameterised by $\mu$ =
mean
and $\sigma$ =std
normal_(mean=1, std=2, generator=None)
 fills each location with samples from the normal distribution parameterised by $\mu$ =
mean
and $\sigma$ =std
uniform_(from=0, to=1)
 fills each location with samples from the continuous uniform distribution
exponential_(lambd=1, *, generator=None)
 fills each location with samples from the exponential distribution, $λe^{λx}$
geometric_(p, *, generator=None)
 fills each location with samples from the geometric distribution, $p^{k  1} (1  p)$
random_(from=0, to=None, *, generator=None)
 fills each location with samples from the discrete uniform distribution over $[from, to  1]$, else bounded by the data type if not specified (for floating point types, range will be $[0, 2^{mantissa}]$ to ensure every value is representable)
Machine learning
Main ML
logit()
 logit (a.k.a. logodds), the natural logarithm of $\frac{p}{1p}$ (for the input distribution conventionally written as $p$),
will give
NaN
outside the range $[0,1]$ relu()
 rectified linear unit function, $ReLU(x) = max(0, x)$
softmax(dim=None)
 rescales/normalises so elements lie in the range $[0,1]$ and sum to 1, optionally along dimension
dim
. $Softmax(x_i) = \exp(x_i) / \Sigma_j(\exp(x_j))$ log_softmax()
 the $\log$ of the Softmax function, optionally along dimension
dim
sigmoid()
 applies the function $Sigmoid(x) = \frac{1}{1 + \exp(x)}$
heaviside()
 applies the Heaviside step function, defined as 1 above 0 and 0 at and below 0.
hardshrink(lambd=0.5)
 leaves elements alone whose absolute value exceeds
lambd
, and zeros anything in the range $[lambd, lambd]$.
Some more main ML
logcumsumexp()
 ...
logsumexp()
 ...
logaddexp()
 ...
logaddexp2()
 ...
quantile()
 ...
Quantisation
Quantisation refers to techniques for computation and memory access with lower precision data,
typically int8
as compared to floating point, which introduces approximations that can lead to an
accuracy gap (which these techniques attempt to minimise).
dequantize()
 ...
int_repr()
 ...
q_per_channel_axis()
 ...
q_per_channel_scales()
 ...
q_per_channel_zero_points()
 ...
q_scale()
 ...
q_zero_point()
 ...
qscheme()
 ...
Dimension naming
align_to()
 permute dimensions to the given order, adding sizeone dimensions for any new names (returns a view)
align_as()
 permute dimensions to the order of the other tensor, adding sizeone dimensions for any new names (returns a view)
refine_names()
 'lift' unnamed dimensions using the given list of names
rename()
 rename dimension names using a list of
*names
or a mapping**rename_map
(returns a view)
Special functions
i0()
 zeroth order modified Bessel function of the first kind, elementwise
Trigonometric functions
angle()
 elementwise angle (in radians)
hypot(other)
 hypotenuse given the legs of a rightangled triangle
sin()
cos()
tan()
 sine/cosine/tangent
asin()
arcsin()
acos()
arccos()
atan()
arctan()
 inverse sine/cosine/tangent (a.k.a. arcsine/arccosine/arctangent)
asinh()
arcsinh()
acosh()
arccosh()
atanh()
arctanh()

inverse hyperbolic sine/cosine/tangent
The domain of
atanh
is $(1, 1)$The limits map to $\mp \inf$, values outside this interval map to
NaN
atan2(other)
arctan2(other)
 arctangent with consideration of the quadrant,
following $(y,x)$ order convention (i.e. source tensor is $y$,
other
is $x$) sinh()
cosh()
tanh()
 hyperbolic sine/cosine/tangent
sinc()
 normalised sinc
Error functions
erf()
 ...
erfc()
 ...
erfinv()
 ...
Gamma functions
digamma()
 ...
igamma()
 ...
igammac()
 ...
lgamma()
 ...
polygamma()
 ...
mvlgamma()
 ...
STFT (Fourier) functions
stft()
 ...
istft()
 ...
Special data types
Sparse tensors
coalesce()
 ...
dense_dim()
 ...
values()
 returns the values tensor of a sparse COO tensor
indices()
 ...
narrow_copy()
 ...
smm()
 ...
sparse_dim()
 ...
sparse_mask()
 ...
sparse_resize_()
 ...
sparse_resize_and_clear_()
 ...
sspaddmm()
 ...
to_dense()
 ...
to_sparse()
 ...
Complex numbers
conj()
 ...
conj_physical()
 ...
resolve_conj()
 ...
resolve_neg()
 ...
sgn()
 ...