A decision tree can be used for both regression and classification problems. The idea is to partition the predictor space into non-overlapping regions. Then, for a given test observation, we can predict the average response of the region in which that observation falls. For regression problems, we can use the mean of the training observations' responses in the appropriate region, while for classification problems, we can predict whichever class is most common. This comes down to fitting the following model:
$$
f(x) = \sum_{j=1}^{J}c_j \cdot 1_{(x \in R_j)},
$$
where $R_1, \dots, R_J$ are the regions, and $c_j$ is the average response in region $R_j$.
Ideally, we would look at all possible partitions $R_1, \dots, R_J$, and pick the one that minimizes training error, but that is computationally infeasible. Instead, we can use a stepwise method such as recursive binary splitting, which successively splits the predictor space, each time using the split that reduces training error the most, until we reach a predetermined stopping criterion (typically until all regions have less than a certain number of training observations).
Now, if we build a decision tree that is too complex (i.e. if we create too many regions), then the tree is prone to overfit the training data. To correct this, we can "prune" a tree by selecting a subtree with smaller test error. Looking at all possible subtrees and estimating their test error is infeasible, so we look at a subset of subtrees instead. With cost complexity pruning, we consider varying values of a tuning parameter $\alpha$ and assign to each value the subtree with $J$ regions that minimizes
$$
\sum_{j=1}^{J}\sum_{i: x_i \in R_j}(y_i - \bar{y}_{R_j})^2 + \alpha J.
$$
As $\alpha$ increases from 0, we get a sequence of nested subtrees that is predictable, making it easy to compute the sequence of subtrees as a function of $\alpha$. Thus, we can choose $\alpha$ using cross-validation, and build the decision tree with the corresponding number of regions using the entire training data.
To measure training error, we can use the usual residual sum of squares (RSS) for regression:
$$
\text{RSS} = \sum_{j=1}^{J}\sum_{i \in R_j}(y_i - \bar{y}_{R_j})^2,
$$
where $\bar{y}_{R_j}$ is the mean of the training observations' responses in region $R_j$.
In the classification setting, we use different measures when growing and pruning the tree. The Gini index (G) reflects purity (whether a split results in regions with observations mostly from the same class), and therefore is good to grow a decision tree:
$$
\text{G} = \sum_{j=1}^{J}\sum_{k=1}^{K}\hat{p}_{jk}(1 - \hat{p}_{jk}),
$$
where $\hat{p}_{jk}$ is the proportion of training observations of class $k$ in region $R_j$. On the other hand, the classification error rate (E) reflects prediction accuracy, and thus is good when pruning a tree:
$$
\text{E} = \sum_{j=1}^{J}1 - \underset{k}{\text{max}} (\hat{p}_{jk}).
$$
Note that decision trees do not need to create dummy variables for categorical predictors, since a split simply means that we assign some values of the predictor to one region, and the other values to another region.
Simple decision trees are easy to interpret and visualize, but they do not have as much predictive power as other methods and they suffer from high variance. To improve performance, we can use bagging and random forests. For a description of bagging, read this.
Random forests improve upon bagging via a small modification: when building a tree and creating a split, only consider a random sample of all available predictors, and use a new random sample for every split. The idea is to give a chance to all predictors instead of always using the strongests early on in the tree. This produces uncorrelated trees, which results in a bigger reduction in variance. If we have a large number of correlated predictors, then using small samples is especially helpful (usually of size $\sqrt{p}$, where $p$ is the number of predictors).