Feed-forward neural network construction and training
We applied systems neuroscience and information-theoretic methods to analyze the structure of a feed-forward neural network as it was trained to rapidly classify a set of ten hand-written digits (Modified National Institute of Standards and Technology [MNIST] dataset [28]). The ANN was trained across 100,000 epochs with stochastic gradient descent, however we only present a subset of epochs in order to demonstrate the key patterns observed in the dataset—specifically, we analyze a total of 64 epochs: the first 30; every 10 epochs to 100; every 100 epochs to 1000; every 1000 epochs to 10,000; and every 10,000 epochs to 100,000. Although a neural network with a single hidden layer is theoretically sufficient for high performance on MNIST [28], neural networks with more hidden layers provide benefits of both computational and parameter efficiency [31]. For the sake of simplicity, we chose a relatively basic network in which edge weights and nodal activity patterns could be directly related to performance.
With these constraints in mind, we constructed a feed-forward network with two hidden layers—a 100-node hidden layer (HL1) that received the 28 × 28 input (pixel intensities from the MNIST dataset) and a 100-node hidden layer (HL2) that received input from HL1—and a 10-node output layer (Fig. 1A). The edges between these layers were given unique labels: edges connecting the input nodes to the first hidden layer were labeled as α edges (dark blue in Fig. 1A); the edges connecting the two hidden layers were labeled as \(\beta\) edges (orange in Fig. 1A); and the edges connecting the second hidden layer to the readout layer were labeled as \(\gamma\) edges (dark green in Fig. 1A). A clear difference between the topology of the ANN and standard approaches to analyzing neuroimaging data is that the mean of the absolute value of edge weights from all three groups increased nonlinearly over the course of training in the ANN, whereas typical neuroimaging analyses normalize the strength of weights across cohorts.
The topological properties of a feed-forward neural network during training
It has previously been suggested that the concept of modularity (i.e., ‘Q’) may be employed to improve the design of deep neural-network architecture in various ways [32, 33]. Non-trivial modular structure is a pervasive feature of complex systems [25, 27], and has been shown to increase as a function of learning in neuroimaging experiments [5, 10]. Based on this similarity, we hypothesized that Q should increase as a function of training on the MNIST dataset and should reflect improvements in classification accuracy. To test this prediction, we required a means for translating the edges of the neural network into a format that was amenable to network science approaches (i.e., a weighted and directed adjacency matrix). To achieve this, we created a sparse node × node matrix, and then mapped the \({\upalpha }\) (input–HL1), \(\beta\) (HL1–HL2) and \(\gamma\) (HL2–output) edges accordingly, yielding the adjacency matrix shown in Fig. 1B.
With the network edge weights framed as a graph, we applied methods from network science to analyze how its complex topological structure changed as the ANN was trained to classify the MNIST dataset (Fig. 1C). We applied used the Louvain algorithm to estimate Q from the neural network graph at each training epoch. Variations in network modularity across training epochs are plotted in Fig. 1D and reveal three distinct periods of: approximately constant Q (‘Early’; training epoch 1–9; data points 1–9 in Fig. 1D), followed by increasing Q (‘Middle’; training epoch 10–8,000; data points 10–55 in Fig. 1D), and finally decreasing Q (‘Late’; training epoch 9000–100,000; data points 56–64 in Fig. 1D). Early in training, there was a substantial improvement in accuracy without a noticeable change in Q (light blue in Fig. 1E). In the middle period, we observed an abrupt increase in Q (light green in Fig. 1E) that tracked linearly with performance accuracy (r = 0.981, pPERM < 10–4, permutation test). Finally, in the late training period, Q began to drop (Fig. 1E; light purple). These results demonstrate that the modularity of the neural network varies over the course of training in a way that corresponds to three different types of behavior with respect to the network’s classification performance.
Early edge-weight alteration is concentrated on informative inputs
The fact that Q did not change early in training, despite substantial improvements in accuracy, was somewhat surprising. This result was made even more compelling by the fact that we observed substantial edge-weight alteration during the early period, however with no alteration in modularity. To better understand this effect, we first created a difference score representing the absolute value of edge changes across each pair of epochs in the early phase (i.e., the blue epochs in Fig. 1D, E). We then calculated the grand mean of this value across the first epoch (i.e., one value for each of the 784 input dimensions, summed across all \(\alpha\) edge weights associated with each input node in the input layer), and then reshaped this vector such that it matched the dimensions of the input data (i.e., 282 pixels). We found that the \(\alpha\) edge weights that varied the most over this period were located along the main stroke lines in the middle of the image (e.g., the outside circle and a diagonal line; Fig. 2A).
Similar to the manner in which an eye saccades to a salient target [34], we hypothesized that the feed-forward network was reconfiguring early in training so as to align with the most informative regions of the input space. To test this hypothesis, we binarized the pixel activity across the 60,000 items from the training set, with a threshold that varied across each pixel so as to maximize the mutual information (MI) that the binarized pixel provides about the class (i.e., the digit), and then calculated the information held by each pixel (IP: MI(pixel, class); Fig. 2B). We observed a clear, linear correspondence between IP and the edges that reconfigured the most during the early period (Fig. 2C; r = 0.965, pPERM < 0.0001). The effect remained significant for edge changes in the middle (r = 0.874) and late (r = 0.855) periods, however the effect was significantly strongest for the early period (Z = 16.03, p < 0.001) [35]. This result indicates that the network was adjusting to concentrate sensitivity to class-discriminative areas of input space, which we demonstrate occurs via the reconfiguration of edge weights relating to the most class-discriminative areas of the input space.
Topological segregation during the middle period of learning
Following the initial period of learning, we observed a substantial increase in network modularity, Q, that rose linearly with improvements in classification accuracy (Fig. 1C, green). To better understand how node-level network elements reconfigured during the middle period, we computed two metrics for each node that quantify how its connections are distributed across network modules: (i) module-degree z-score (MZ), and (ii) participation coefficient (PC) [36]. MZ and PC have together been used characterize the cartographic profile of complex networks: MZ measures within-module connectivity, and PC measures between-module connectivity and thus captures the amount of inter-regional integration within the network (see Sect. 4 for details; Fig. 3A) [36]. These statistics have been used previously in combination with whole-brain human fMRI data to demonstrate a relationship between heightened network integration and cognitive function [11, 37], however the role of integrative topological organization is less well understood in ANNs. Importantly, the calculation of both MZ and PC relies on the community assignment estimated from the Louvain algorithm, and hence affords a sensitivity to changes in network topology over the course of training.
Using this cartographic approach [36], we were able to translate the edge weights in the network into values of PC and MZ for each node of the network for each epoch of training. Figure 3B, C shows the PC and MZ values for the nodes in the input layer (i.e., the topological properties of the \(\alpha\) edges) at training epoch 30, which was indicative of the patterns in the middle period. PC was associated with a relatively ‘patchy’ appearance around the main stroke areas, suggestive of a distributed topological coverage of the input space, as well as high values on the edges of the input space (Fig. 3B). In contrast, MZ values were far more centrally concentrated, indicative of local hubs within network communities around the main stroke areas of the inputs space (Fig. 3C). Overall, PC and MZ mapped onto different locations in the input space, and hence were somewhat negatively correlated when data were pooled across all epochs (r = − 0.107; p = 3 × 10–89). We hypothesized that these changes in MZ and PC were indicative of a topological reconfiguration of the input layer of the ANN to align network hubs with key aspects of the input stream, being the main stroke areas here.
To test this hypothesis, we related the PC and MZ for each node of the network across all epochs of training to a statistic, ID: MI(pixelOn, class), which computes the amount of information available in each pixel of the input space when that pixel is active (Fig. 3D). In contrast to the average information IP held by the pixel about the class, ID is a partial information, quantifying how informative each pixel is for tracking multiple different digit classes only when the pixel is active (pixelOn). High values of ID imply that the recruitment of the pixel is associated with a reduction in uncertainty (i.e., an increase in information) about the digit. As detailed in Sect. 4, IP (Fig. 2B) is negatively correlated to ID (Fig. 3D) and dominated by samples when the pixel is inactive.
We observed a significant positive correlation between ID and MZ that emerged towards the end of the middle period (Fig. 3E). Specifically, we observed a dissociation in the input layer (Fig. 3E) during the middle period, wherein ID was positively correlated with PC (max r = 0.396, pPERM < 10–4), but negatively correlated with the module-degree z-score (max r = − 0.352, pPERM < 10–4). In other words, the topology of the neural network reconfigured so as to align highly informative active pixels with topologically integrated hubs (nodes with higher PC). While these pixels are less commonly active, they are highly informative of class when they are active (high ID), suggesting that the pixel being active requires the network to send information about such events to many downstream modules. By contrast, more segregated hubs (nodes with higher MZ) are more likely to be associated with higher IP, being nodes that are more informative on average of digit class (and tending to be more highly informative when inactive). This may indicate that the network is reconfiguring so as to organize sets of informative nodes into modules in a way that supports the creation of higher-order ‘features’ in the next layer. In neuroscience, nodes within the same module are typically presumed to process similar patterns of information [13], suggesting that the topology of the neural network studied here may be adjusting to better detect the presence or absence of low-dimensional features within the input space.
Inter-layer correspondence
Given that the same gradient descent algorithm used to train the network was applied consistently across all layers of the network, we predicted that the same principles identified in the input layer should propagate through the network, albeit to the abstracted ‘features’ captures by each previous layer. Similar to the manner in which a locksmith sequentially opens a bank vault, we hypothesized that each layer of the neural network should align with the most informative dimensions of its input in turn, such that the information could only be extracted from an insulated layer once a more superficial layer was appropriately aligned with the most informative aspects of its input stream. To test this hypothesis, we investigated how the mutual information IH: MI(node, class) between each node’s activity and the digit class evolved across training epochs. (Note that IH is equivalent to IP but computed for hidden layer nodes rather than inputs.) As shown in Fig. 2F, mean MI within both hidden layers 1 (MIHL1) and 2 (MIHL2) increased during the first two epochs, but then diverged at the point in learning coinciding with the global decrease in modularity, Q (cf. Figure 1D). Crucially, despite the decrease in MIHL1 there was still an increase in MIHL2, suggesting that the Layer 2 nodes are improving their ability to combine information available in separate individual Layer 1 nodes to become more informative about the class. This suggests that Layer 1 nodes specialize (and therefore hold less information overall, lower MIHL1) in order to support the integration of information in deeper layers of the neural network (increased MIHL2).
Validation with the eMNIST dataset
In summary, in studying the topological reconfiguration of an ANN during training on the MNIST dataset, we observed three distinctive periods of adjustment, which play different roles in augmenting the distributed information processing across the network to capture class-relevant information in the input data. To better understand the generalizability of these findings, we trained a new feed-forward neural network (identical in architecture to the original network) on the eMNIST dataset [52]. The eMNIST dataset is similar to MNIST, but uses hand-written letters, as opposed to numbers. Although learning was more protracted in the eMNIST dataset (likely due to the increased complexity of the alphabet, relative to the set of digits), we observed similar changes in network structure across training as reported above for the MNIST dataset. Specifically: (i) the network shifted from integration to segregation; (ii) layers reconfigured in serial; and (iii) nodal roles (with respect to inferred network modules) were similarly related to class-relevant information in individual pixels (Additional file 1: Fig. S1). These results suggest that the insights obtained from the MNIST analysis may represent general topological features of efficient distributed information processing in complex systems.
The late period is associated with low-dimensional pattern separation
Next, we investigated whether the extent to which the nodal topology of the networks trained on the two datasets differed (i.e., whether different regions of the input space had higher or lower PC and MZ) was proportional to the most informative locations of the input space in each dataset (\(\Delta\)ID). Specifically, the difference in the pattern of an input node’s edges across inferred network modules between the eMNIST and MNIST datasets (\(\Delta\)PC) was correlated with the difference in image input characteristics between the two datasets (\(\Delta\)ID vs. \(\Delta\)PC: r = 0.301, pPERM < 0.0001; \(\Delta\)ID vs. \(\Delta\)MZ: r = − 0.247, pPERM < 0.0001). This result provides further confirmation that neural networks learn by reorganizing their nodal topology into a set of periods that act to align network edges and activity patterns with the most informative pixels within the training set.
We found that pixels demonstrated unique roles across learning with respect to the emerging modular architecture of the neural network, and that these roles were shaped by their class-relevant information. As edge weights were reconfigured across training, we observed that standard deviation of changes in outgoing edge strength from a node (i.e., Edge \(\Delta_{1}\)) increases for highly informative inputs (i.e., high IP; Additional file 1: Fig. S1D for eMNIST corresponding to Fig. 2C for MNIST). As these weights change, they alter the activity of each of the nodes in the hidden layers, which ultimately pool their activity via modules to affect the class predictions, which are read out based on the activity of the final output layer. So how do the changes in edge weight translate into nodal activity? Based on recent empirical electrophysiological [38] and fMRI [39] studies, we hypothesized that the activity patterns would be distributed across the neural network in a low-dimensional fashion. Specifically, by way of analogy to the notion of manifold untangling in the ventral visual system [40], we predicted that across training, the high-dimensional initial state of the system (i.e., random edge weights) would become more low-dimensional as pixel–pixel redundancies were discovered through the learning process.
To test this hypothesis, we used dimensionality-reduction [41] to analyze the ‘activity’ of all of the nodes within the neural network, across the different training epochs. Here, activity was defined as the sum of weighted inputs from inputs or earlier layers of the network, after having been filtered through an activation function. We applied PCA [42] to the nodal activity across all four layers of the feed-forward network—i.e., the input, HL1, HL2 and output nodes—which were first standardized and then either concatenated (to calculate the dimensionality of the entire process) or analyzed on an epoch-to-epoch basis (to calculate the effect of training; see Sect. 4 for details). The concatenated state–space embedding was relatively low-dimensional (120/994 components, or 12.2%, explained ~ 80% of the variance) and the pixel-wise loading of each of the top eigenvalues (λs) for the input layer (Fig. 4A) was correlated with both IP and ID statistics used in the prior analyses (IP − λ1: r = 0.218, p < 10–4; λ2: r = 0.189, p < 10–4; λ3 = 0.158, p < 0.0001; and ID − λ1: r = 0.338, p < 10–4; λ2: r = 0.123, p < 10–4; λ3: r = 0.062, p = 0.08), suggesting a direct correspondence between class-relevant information in the input space and the low-dimensional embedding. Crucially, test trials that were incorrectly classified (at Epoch 10,000, though results were consistent for other epochs) were associated with lower absolute loadings on the ten most explanatory EVs (EV1–10; Additional file 1: Fig. S3; FDR p < 0.05). These results are tangentially related to recent empirical neuroscientific studies that employed dimensionality reduction on electrophysiological [38] and fMRI data [39] to show that learning and cognitive task performance are typically more effective when constrained to a low-dimensional embedding space.
By conducting a PCA on each epoch in turn, we found that training was associated with a non-linear alteration in the amount of variance explained by the top 10 PCs (Vd10), and that these changes aligned well with the topologically identified periods (Fig. 4B). The network began in a relatively high-dimensional configuration, consistent with the random initiation of nodal activity. During the early period (light blue in Fig. 4B), as the edge weights reconfigured to align with IP (Fig. 3D), Vd10 remained relatively high. During the middle period (light green in Fig. 3B), there was a sharp reduction in Vd10, however the dimensionality collapse was diminished mid-way through the period. The late period (purple in Fig. 3B) was associated with a mild reduction in Vd10. Interestingly, heightened levels of training led to a tighter correspondence between nodal topological signatures (PC/MZ calculated separately at each epoch) and the principal component loadings of nodes in the input layer (Additional file 1: Fig. S2), suggesting that the topology of the neural network reconfigured over training to better map onto a low-dimensional sub-space that concentrated class-relevant information in the training dataset.
Organizing an information processing system within the constraints of a relatively low-dimensional architecture (i.e., dimensions \(\ll\) nodes) can confer important computational benefits [41]. For instance, previous theoretical work in systems neuroscience has argued that the ventral visual stream of the cerebral cortex is organized so as to ‘untangle’ different inputs streams into highly informative categories [40]. Here, ‘untangling’ refers to the ability of the system to effectively separate inputs along different categorical dimensions (e.g., distinguish a well-known face from that of a stranger), while still retaining sufficient information in the signal such that higher-order classifications of the same data are still possible (e.g., recognizing a well-known face in a unique orientation). Interestingly, the same concept has been used to explain the function of both the visual system [40] and effective decision-making [43], and may underpin the functionality of convolutional neural networks trained on naturalistic images [24]. In the context of our PCA analysis, ‘untangling’ could be envisaged as alterations in the way that the activity of the network reflecting different digit categories is embedded within the network’s state space: the loadings onto different categories in the untrained network should be relatively overlapping (i.e., ‘tangled’), but should become less overlapping (i.e., ‘untangled’) as the network learns to effectively categorize the inputs into distinct digits.
Analyzing our data from this vantage point, we found that the increase in topologically rich, low-dimensionality was associated with a relative ‘untangling’ of the low-dimensional manifold (Fig. 4C): the middle period was associated with a general expansion in the low-dimensional embedding distance within categories (light green in Fig. 4D), which then allowed the system to both expand between categories and contract within categories during the late period of learning (purple in Fig. 4D). This ultimately had the effect of boosting classification accuracy. Indeed, the contraction of the within-category embedding distance—which takes place first—co-occurs with the drop of MIHL1, with the following expansion of between category distance co-occurring with the increase in MIHL2. At the sub-network level, the activity on nodes in HL2 was substantially more low-dimensional than HL1 (Additional file 1: Fig. S4), further expanding on the notion that different computational constraints are imposed on neural networks, depending on the depth of network layers. Overall, these results confirm the presence of manifold ‘untangling’ in a simple, feed-forward ANN, and hence provide a further link between the way that both synthetic and biological neural networks learn how to classify visual inputs.