Wearable computer

Wearable computer

A wearable computer, also known as a body-borne computer or wearable, is a computing device worn on the body. The definition of 'wearable computer' may be narrow or broad, extending to smartphones or even ordinary wristwatches. Wearables may be for general use, in which case they are just a particularly small example of mobile computing. Alternatively, they may be for specialized purposes such as fitness trackers. They may incorporate special sensors such as accelerometers, heart rate monitors, or on the more advanced side, electrocardiogram (ECG) and blood oxygen saturation (SpO2) monitors. Under the definition of wearable computers, we also include novel user interfaces such as Google Glass, an optical head-mounted display controlled by gestures. It may be that specialized wearables will evolve into general all-in-one devices, as happened with the convergence of PDAs and mobile phones into smartphones. Wearables are typically worn on the wrist (e.g. fitness trackers), hung from the neck (like a necklace), strapped to the arm or leg (electronic tagging), or on the head (as glasses or a helmet), though some have been located elsewhere (e.g. on a finger or in a shoe). Devices carried in a pocket or bag – such as smartphones and before them, pocket calculators and PDAs, may or may not be regarded as 'worn'. Wearable computers have various technical issues common to other mobile computing, such as batteries, heat dissipation, software architectures, wireless and personal area networks, and data management. Many wearable computers are active all the time, e.g. processing or recording data continuously. == Applications == Wearable computers are not only limited to computers such as fitness trackers that are worn on wrists; they also include wearables such as heart pacemakers and other prosthetics. They are used most often in research that focuses on behavioral modeling, health monitoring systems, IT and media development, where the person wearing the computer actually moves or is otherwise engaged with his or her surroundings. Wearable computers have been used for the following: general-purpose computing (e.g. smartphones and smartwatches) sensory integration, e.g. to help people see better or understand the world better (whether in task-specific applications like camera-based welding helmets or for everyday use like Google Glass) behavioral modeling health care monitoring systems service management electronic textiles and fashion design, e.g. Microsoft's 2011 prototype "The Printing Dress". Wearable computing is the subject of active research, especially the form-factor and location on the body, with areas of study including user interface design, augmented reality, and pattern recognition. The use of wearables for specific applications, for compensating disabilities or supporting elderly people steadily increases. == Operating systems == The dominant operating systems for wearable computing are: FreeRTOS is a real-time operating system kernel for embedded devices; most of the Smartbands that are currently available in the market are based on FreeRTOS, which include Huawei, Honor, Lenovo, realme, TCL and Xiaomi smartbands. LiteOS is a lightweight open source real-time operating system that is part of Huawei's "1+8+N" Internet of Things solution. Tizen OS from Samsung (there was an announcement in May 2021 that Wear OS and Tizen OS will merge and will be called simply Wear.) watchOS watchOS is a proprietary mobile operating system developed by Apple Inc. to run on the Apple Watch. Wear OS Wear OS (previously known as Android Wear) is a smartwatch operating system developed by Google Inc. == History == Due to the varied definitions of wearable and computer, the first wearable computer could be as early as the first abacus on a necklace, a 16th-century abacus ring, a wristwatch and 'finger-watch' owned by Queen Elizabeth I of England, or the covert timing devices hidden in shoes to cheat at roulette by Thorp and Shannon in the 1960s and 1970s. However, a general-purpose computer is not merely a time-keeping or calculating device, but rather a user-programmable item for arbitrary complex algorithms, interfacing, and data management. By this definition, the wearable computer was invented by Steve Mann, in the late 1970s: Steve Mann, a professor at the University of Toronto, was hailed as the father of the wearable computer and the ISSCC's first virtual panelist, by moderator Woodward Yang of Harvard University (Cambridge Mass.). The development of wearable items has taken several steps of miniaturization from discrete electronics over hybrid designs to fully integrated designs, where just one processor chip, a battery, and some interface conditioning items make the whole unit. === 1500s === Queen Elizabeth I of England received a watch from Robert Dudley in 1571, as a New Year's present; it may have been worn on the forearm rather than the wrist. She also possessed a 'finger-watch' set in a ring, with an alarm that prodded her finger. === 1600s === The Qing dynasty saw the introduction of a fully functional abacus on a ring, which could be used while it was being worn. === 1960s === In 1961, mathematicians Edward O. Thorp and Claude Shannon built some computerized timing devices to help them win a game of roulette. One such timer was concealed in a shoe and another in a pack of cigarettes. Various versions of this apparatus were built in the 1960s and 1970s. Thorp refers to himself as the inventor of the first "wearable computer". In other variations, the system was a concealed cigarette-pack-sized analog computer designed to predict the motion of roulette wheels. A data-taker would use microswitches hidden in his shoes to indicate the speed of the roulette wheel, and the computer would indicate an octant of the roulette wheel to bet on by sending musical tones via radio to a miniature speaker hidden in a collaborator's ear canal. The system was successfully tested in Las Vegas in June 1961, but hardware issues with the speaker wires prevented it from being used beyond test runs. This was not a wearable computer because it could not be re-purposed during use; rather it was an example of task-specific hardware. This work was kept secret until it was first mentioned in Thorp's book Beat the Dealer (revised ed.) in 1966 and later published in detail in 1969. === 1970s === Pocket calculators became mass-market devices in 1970, starting in Japan. Programmable calculators followed in the late 1970s, being somewhat more general-purpose computers. The HP-01 algebraic calculator watch by Hewlett-Packard was released in 1977. A camera-to-tactile vest for the blind, launched by C.C. Collins in 1977, converted images into a 1024-point, ten-inch square tactile grid on a vest. === 1980s === The 1980s saw the rise of more general-purpose wearable computers. In 1981, Steve Mann designed and built a backpack-mounted 6502-based wearable multimedia computer with text, graphics, and multimedia capability, as well as video capability (cameras and other photographic systems). Mann went on to be an early and active researcher in the wearables field, especially known for his 1994 creation of the Wearable Wireless Webcam, the first example of lifelogging. Seiko Epson released the RC-20 Wrist Computer in 1984. It was an early smartwatch, powered by a computer on a chip. In 1989, Reflection Technology marketed the Private Eye head-mounted display, which scans a vertical array of LEDs across the visual field using a vibrating mirror. This display gave rise to several hobbyist and research wearables, including Gerald "Chip" Maguire's IBM/Columbia University Student Electronic Notebook, Doug Platt's Hip-PC, and Carnegie Mellon University's VuMan 1 in 1991. The Student Electronic Notebook consisted of the Private Eye, Toshiba diskless AIX notebook computers (prototypes), a stylus based input system and a virtual keyboard. It used direct-sequence spread spectrum radio links to provide all the usual TCP/IP based services, including NFS mounted file systems and X11, which all ran in the Andrew Project environment. The Hip-PC included an Agenda palmtop used as a chording keyboard attached to the belt and a 1.44 megabyte floppy drive. Later versions incorporated additional equipment from Park Engineering. The system debuted at "The Lap and Palmtop Expo" on 16 April 1991. VuMan 1 was developed as part of a Summer-term course at Carnegie Mellon's Engineering Design Research Center, and was intended for viewing house blueprints. Input was through a three-button unit worn on the belt, and output was through Reflection Tech's Private Eye. The CPU was an 8 MHz 80188 processor with 0.5 MB ROM. === 1990s === In the 1990s PDAs became widely used, and in 1999 were combined with mobile phones in Japan to produce the first mass-market smartphone. In 1993, the Private Eye was used in Thad Starner's wearable, based on Doug Platt's system and built from a kit from Park Enterprises, a Pri

Neural field

In machine learning, a neural field (also known as implicit neural representation, neural implicit, or coordinate-based neural network), is a mathematical field that is fully or partially parametrized by a neural network. Initially developed to tackle visual computing tasks, such as rendering or reconstruction (e.g., neural radiance fields), neural fields emerged as a promising strategy to deal with a wider range of problems, including surrogate modelling of partial differential equations, such as in physics-informed neural networks. Differently from traditional machine learning algorithms, such as feed-forward neural networks, convolutional neural networks, or transformers, neural fields do not work with discrete data (e.g. sequences, images, tokens), but map continuous inputs (e.g., spatial coordinates, time) to continuous outputs (i.e., scalars, vectors, etc.). This makes neural fields not only discretization independent, but also easily differentiable. Moreover, dealing with continuous data allows for a significant reduction in space complexity, which translates to a much more lightweight network. == Formulation and training == According to the universal approximation theorem, provided adequate learning, sufficient number of hidden units, and the presence of a deterministic relationship between the input and the output, a neural network can approximate any function to any degree of accuracy. Hence, in mathematical terms, given a field y = Φ ( x ) {\textstyle {\boldsymbol {y}}=\Phi ({\boldsymbol {x}})} , with x ∈ R n {\displaystyle {\boldsymbol {x}}\in \mathbb {R} ^{n}} and y ∈ R m {\displaystyle {\boldsymbol {y}}\in \mathbb {R} ^{m}} , a neural field Ψ θ {\displaystyle \Psi _{\theta }} , with parameters θ {\displaystyle {\boldsymbol {\theta }}} , is such that: Ψ θ ( x ) = y ^ ≈ y {\displaystyle \Psi _{\theta }({\boldsymbol {x}})={\hat {\boldsymbol {y}}}\approx {\boldsymbol {y}}} === Training === For supervised tasks, given N {\displaystyle N} examples in the training dataset (i.e., ( x i , y i ) ∈ D t r a i n , i = 1 , … , N {\displaystyle ({\boldsymbol {x_{i}}},{\boldsymbol {y_{i}}})\in {\mathcal {D_{train}}},i=1,\dots ,N} ), the neural field parameters can be learned by minimizing a loss function L {\displaystyle {\mathcal {L}}} (e.g., mean squared error). The parameters θ ~ {\displaystyle {\tilde {\theta }}} that satisfy the optimization problem are found as: θ ~ = argmin θ 1 N ∑ ( x i , y i ) ∈ D t r a i n L ( Ψ θ ( x i ) , y i ) {\displaystyle {\tilde {\boldsymbol {\theta }}}={\underset {\boldsymbol {\theta }}{\text{argmin}}}\;{\frac {1}{N}}\sum _{({\boldsymbol {x_{i}}},{\boldsymbol {y_{i}}})\in {\mathcal {D_{train}}}}{\mathcal {L}}(\Psi _{\theta }({\boldsymbol {x}}_{i}),{\boldsymbol {y}}_{i})} Notably, it is not necessary to know the analytical expression of Φ {\displaystyle \Phi } , for the previously reported training procedure only requires input-output pairs. Indeed, a neural field is able to offer a continuous and differentiable surrogate of the true field, even from purely experimental data. Moreover, neural fields can be used in unsupervised settings, with training objectives that depend on the specific task. For example, physics-informed neural networks may be trained on just the residual. === Spectral bias === As for any artificial neural network, neural fields may be characterized by a spectral bias (i.e., the tendency to preferably learn the low frequency content of a field), possibly leading to a poor representation of the ground truth. In order to overcome this limitation, several strategies have been developed. For example, SIREN uses sinusoidal activations, while the Fourier-features approach embeds the input through sines and cosines. == Conditional neural fields == In many real-world cases, however, learning a single field is not enough. For example, when reconstructing 3D vehicle shapes from Lidar data, it is desirable to have a machine learning model that can work with arbitrary shapes (e.g., a car, a bicycle, a truck, etc.). The solution is to include additional parameters, the latent variables (or latent code) z ∈ R d {\displaystyle {\boldsymbol {z}}\in \mathbb {R} ^{d}} , to vary the field and adapt it to diverse tasks. === Latent code production === When dealing with conditional neural fields, the first design choice is represented by the way in which the latent code is produced. Specifically, two main strategies can be identified: Encoder: the latent code is the output of a second neural network, acting as an encoder. During training, the loss function is the objective used to learn the parameters of both the neural field and the encoder. Auto-decoding: each training example has its own latent code, jointly trained with the neural field parameters. When the model has to process new examples (i.e., not originally present in the training dataset), a small optimization problem is solved, keeping the network parameters fixed and only learning the new latent variables. Since the latter strategy requires additional optimization steps at inference time, it sacrifices speed, but keeps the overall model smaller. Moreover, despite being simpler to implement, an encoder may harm the generalization capabilities of the model. For example, when dealing with a physical scalar field f : R 2 → R {\displaystyle f:\mathbb {R} ^{2}\rightarrow \mathbb {R} } (e.g., the pressure of a 2D fluid), an auto-decoder-based conditional neural field can map a single point to the corresponding value of the field, following a learned latent code z {\displaystyle {\boldsymbol {z}}} . However, if the latent variables were produced by an encoder, it would require access to the entire set of points and corresponding values (e.g. as a regular grid or a mesh graph), leading to a less robust model. === Global and local conditioning === In a neural field with global conditioning, the latent code does not depend on the input and, hence, it offers a global representation (e.g., the overall shape of a vehicle). However, depending on the task, it may be more useful to divide the domain of x {\displaystyle {\boldsymbol {x}}} in several subdomains, and learn different latent codes for each of them (e.g., splitting a large and complex scene in sub-scenes for a more efficient rendering). This is called local conditioning. === Conditioning strategies === There are several strategies to include the conditioning information in the neural field. In the general mathematical framework, conditioning the neural field with the latent variables is equivalent to mapping them to a subset θ ∗ {\displaystyle {\boldsymbol {\theta }}^{}} of the neural field parameters: θ ∗ = Γ ( z ) {\displaystyle {\boldsymbol {\theta }}^{}=\Gamma ({\boldsymbol {z}})} In practice, notable strategies are: Concatenation: the neural field receives, as input, the concatenation of the original input x {\displaystyle {\boldsymbol {x}}} with the latent codes z {\displaystyle {\boldsymbol {z}}} . For feed-forward neural networks, this is equivalent to setting θ ∗ {\displaystyle {\boldsymbol {\theta }}^{}} as the bias of the first layer and Γ ( z ) {\displaystyle \Gamma ({\boldsymbol {z}})} as an affine transformation. Hypernetworks: a hypernetwork is a neural network that outputs the parameters of another neural network. Specifically, it consists of approximating Γ ( z ) {\displaystyle \Gamma ({\boldsymbol {z}})} with a neural network Γ ^ γ ( z ) {\displaystyle {\hat {\Gamma }}_{\gamma }({\boldsymbol {z}})} , where γ {\displaystyle {\boldsymbol {\gamma }}} are the trainable parameters of the hypernetwork. This approach is the most general, as it allows to learn the optimal mapping from latent codes to neural field parameters. However, hypernetworks are associated to larger computational and memory complexity, due to the large number of trainable parameters. Hence, leaner approaches have been developed. For example, in the Feature-wise Linear Modulation (FiLM), the hypernetwork only produces scale and bias coefficients for the neural field layers. === Meta-learning === Instead of relying on the latent code to adapt the neural field to a specific task, it is also possible to exploit gradient-based meta-learning. In this case, the neural field is seen as the specialization of an underlying meta-neural-field, whose parameters are modified to fit the specific task, through a few steps of gradient descent. An extension of this meta-learning framework is the CAVIA algorithm, that splits the trainable parameters in context-specific and shared groups, improving parallelization and interpretability, while reducing meta-overfitting. This strategy is similar to the auto-decoding conditional neural field, but the training procedure is substantially different. == Applications == Thanks to the possibility of efficiently modelling diverse mathematical fields with neural networks, neural fields have been applied to a wide range of problems: 3D scene reconstruction: neural fields can be used to model t

Integrated writing environment

An integrated writing environment (IWE) is software that provides comprehensive writing and knowledge management functionality for writers and information workers. IWEs enable writers and information workers to perform a variety of tasks related to the document in the IWE in a single environment. This provides a distraction-free workspace and streamlined writing experience. IWEs provide similar efficiency and functionality benefits to writers and information professionals that integrated development environments (IDEs) provide to software developers. == Overview == IWEs are designed to maximize productivity and help improve the quality of written work by integrating together tools that allow users to work effectively in a single application. The IWE features may include integrated content search, reversion management, outlining, note management, and reference management, as may be suitable for the target field of use. == List of IWEs == Celtx This IWE is intended for screenplay writers and has screenplay writing and management tools. Celtex provides tools for the pre-production work phase, story development, storyboarding, script breakdowns, production scheduling, and reports. Scrivener This IWE targets novel, research paper, and script writing. Scrivener provides tools to organize notes and research documents for easy access and referencing. After completing the writing, Scrivener allows the user to export the document to formats supported by common word processors, such as Microsoft Word. TeXstudio This IWE targets LaTeX documents and provides interactive spelling checker, code folding, and syntax highlighting.

Bigram

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, and speech recognition. Gappy bigrams or skipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar). == Applications == Bigrams, along with other n-grams, are used in most successful language models for speech recognition. Bigram frequency attacks can be used in cryptography to solve cryptograms. See frequency analysis. Bigram frequency is one approach to statistical language identification. Some activities in logology or recreational linguistics involve bigrams. These include attempts to find English words beginning with every possible bigram, or words containing a string of repeated bigrams, such as logogogue. == Bigram frequency in the English language == The frequency of the most common letter bigrams in a large English corpus is: th 3.56% of 1.17% io 0.83% he 3.07% ed 1.17% le 0.83% in 2.43% is 1.13% ve 0.83% er 2.05% it 1.12% co 0.79% an 1.99% al 1.09% me 0.79% re 1.85% ar 1.07% de 0.76% on 1.76% st 1.05% hi 0.76% at 1.49% to 1.05% ri 0.73% en 1.45% nt 1.04% ro 0.73% nd 1.35% ng 0.95% ic 0.70% ti 1.34% se 0.93% ne 0.69% es 1.34% ha 0.93% ea 0.69% or 1.28% as 0.87% ra 0.69% te 1.20% ou 0.87% ce 0.65%

LTX (text-to-video model)

LTX is a family of open source artificial intelligence video foundation models developed by Lightricks, and first released in November 2024. The latest models, LTX-2, create videos based on user prompts. They were preceded by LTX Video, which was released in 2024 as the company's first text-to-video model. LTX-2 is part of the LTX family of video generation models, which form the core technology, alongside LTX Studio, of the LTX ecosystem. == History == === Origins: LTX Video (2024–2025) === In November 2024 Lightricks publicly released its first text-to-video model, LTX Video. It was a 2-billion parameter model, available as open source. In May 2025 Lightricks launched LTXV-13b, a version with 13-billion parameters. Two months later, the model broke the 60 second barrier for generated video. === Release of LTX-2 (2025) === In October 2025 Lightricks announced its latest model, and renamed it LTX-2. The model was described as capable of generating synchronized audio and video at native 4K resolution and up to 50 frames per second (fps), using a variety of conditions and prompts, including text-to-video and image-to-video. Google highlighted the fact that LTX-2 was trained on its infrastructure, and saying it was "The first open source AI video generation model, powered by Google Cloud". Upon its release it was ranked in the top-3 models for image-to-video creation by Artificial Analysis, behind Kling 3.5 by Kling AI and Veo 3.1 by Google. Its text-to-image option was ranked 7th. In addition to its open-source release, Lightricks offers API access to LTX-2, allowing developers to generate videos from text and image prompts through a hosted service without running the model locally. === Open Source Release (2026) === In January 2026, Lightricks officially released the full open-source version of LTX-2, making the model’s complete codebase, weights, and associated tooling publicly available. In March 2026 the company released LTX-2.3, which was accompanied by a desktop video editor enabling the entire model to run locally on consumer hardware. == Technical features == === Advancements over LTX Video === LTX-2 builds upon the LTX Video architecture with several major improvements: Unified audio-video generation producing synchronized dialogue, ambience, and motion Native 4K rendering 50-fps output for cinematic motion Three operational modes (Fast, Pro, Ultra) More efficient diffusion pipelines enabling high fidelity on consumer GPUs === Core capabilities === Text-to-video generation Image-to-video generation Multimodal audiovisual synthesis High-resolution spatial and temporal coherence Configurable quality/performance settings Open-source distribution of weights and datasets == Reception == Initial reception to LTX-2 was broadly positive, with several technology and media outlets highlighting its open-source approach and multimodal capabilities. Open Source For You described LTX-2 as “one of the first AI video systems to combine 4K output, synchronized audio, and an open model release,” noting that it positioned Lightricks as a significant competitor to proprietary systems such as OpenAI's Sora and Google's Veo. IEA Green said that the model “could rewrite the AI filmmaking game,” emphasizing that its 50-fps rendering and unified audio-video generation made it suitable for professional studios and independent creators alike. AI News characterized LTX-2 as a “major step forward in the democratization of cinematic-quality video generation,” praising its consumer-grade hardware efficiency and multi-tier generation modes, while also noting ongoing challenges in long-form temporal stability. FinancialContent reported strong interest among creative agencies, attributing the attention to Lightricks’ decision to release model weights and datasets, which reviewers said enabled “a level of transparency not typically seen in commercial AI video models.” === Benchmarks and rankings === Upon release, LTX-2 ranked third for image-to-video creation in the Artificial Analysis benchmark, behind Kling 3.5 and Veo 3.1, while its text-to-video option ranked seventh. As of early 2026, it was the highest-ranked open-source model in the benchmark. === Limitations === Some early reviewers also pointed out quality limitations. The Ray3 technical review noted occasional inconsistencies in lip-sync and motion tracking during long scenes, though it stated these were “in line with the challenges faced by all current AI video diffusion models” and expected to improve with continued iteration. Like other diffusion-based video generators, LTX-2 can produce artifacts in complex multi-person scenes and may struggle with precise text rendering within generated video.

Endomondo

Endomondo is a health and wellness website. It allows users to track their health statistics and provides insights on fitness trends. Originally launched in 2007, Endomondo was acquired by Under Armour in 2015. Under Armour shut down Endomondo in 2020, but, by 2024, Endomondo re-launched as its own entity. == History == Endomondo started in Denmark in 2007 by Mette Lykke, Christian Birk and Jakob Nordenhof Jønck. In 2011, the company opened an office in Silicon Valley, USA, but kept its research and development department in Denmark. In 2013, Endomondo LLC was listed in Red Herring as a European finalists for promising start-ups. The same year, Christian Birk and Jakob Nordenhof Jønck left the daily operation of the company, but kept co-ownership. In February 2015, Endomondo LLC was acquired by athletic apparel maker Under Armour for $85 million. Endomondo, at that time, had over 20 million users. In October 2020, Under Armour announced that Endomondo would be shutting down and selling off MyFitnessPal to the private equity firm Francisco Partners for $345 million. Service stopped on 31 December 2020, giving customers until 15 February 2021 to download an archive of their historic data. In 2024, Endomondo.com was brought back online as a professional fitness guidance website. == Features == Endomondo provides numerous workouts, guidance on exercises, performance-enhancing nutrition, and tips. Previously, Endomondo was able to track numerous fitness attributes such as running routes, distance, duration, and calories. The software helped analyze performance and recommend improvements. There was a free and a paid version available of Endomondo. The free version had advertisements. The paid Premium version was free of advertisements and included additional features such as the possibility to create one's own training plan. The offering of additional features was different between the Android, IOS and Windows platforms, and had significantly better features for tracking performance over time than UnderArmours suggested replacement. Endomondo offered challenges of various types to the user and allowed users to create their own challenges.

Convolutional layer

In artificial neural networks, a convolutional layer is a type of network layer that applies a convolution operation to the input. Convolutional layers are some of the primary building blocks of convolutional neural networks (CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform translational symmetry. The convolution operation in a convolutional layer involves sliding a small window (called a kernel or filter) across the input data and computing the dot product between the values in the kernel and the input at each position. This process creates a feature map that represents detected features in the input. == Concepts == === Kernel === Kernels, also known as filters, are small matrices of weights that are learned during the training process. Each kernel is responsible for detecting a specific feature in the input data. The size of the kernel is a hyperparameter that affects the network's behavior. === Convolution === For a 2D input x {\displaystyle x} and a 2D kernel w {\displaystyle w} , the 2D convolution operation can be expressed as: y [ i , j ] = ∑ m = 0 k h − 1 ∑ n = 0 k w − 1 x [ i + m , j + n ] ⋅ w [ m , n ] {\displaystyle y[i,j]=\sum _{m=0}^{k_{h}-1}\sum _{n=0}^{k_{w}-1}x[i+m,j+n]\cdot w[m,n]} where k h {\displaystyle k_{h}} and k w {\displaystyle k_{w}} are the height and width of the kernel, respectively. This generalizes immediately to nD convolutions. Commonly used convolutions are 1D (for audio and text), 2D (for images), and 3D (for spatial objects, and videos). === Stride === Stride determines how the kernel moves across the input data. A stride of 1 means the kernel shifts by one pixel at a time, while a larger stride (e.g., 2 or 3) results in less overlap between convolutions and produces smaller output feature maps. === Padding === Padding involves adding extra pixels around the edges of the input data. It serves two main purposes: Preserving spatial dimensions: Without padding, each convolution reduces the size of the feature map. Handling border pixels: Padding ensures that border pixels are given equal importance in the convolution process. Common padding strategies include: No padding/valid padding. This strategy typically causes the output to shrink. Same padding: Any method that ensures the output size same as input size is a same padding strategy. Full padding: Any method that ensures each input entry is convolved over for the same number of times is a full padding strategy. Common padding algorithms include: Zero padding: Add zero entries to the borders of input. Mirror/reflect/symmetric padding: Reflect the input array on the border. Circular padding: Cycle the input array back to the opposite border, like a torus. The exact numbers used in convolutions is complicated, for which we refer to (Dumoulin and Visin, 2018) for details. == Variants == === Standard === The basic form of convolution as described above, where each kernel is applied to the entire input volume. === Depthwise separable === Depthwise separable convolution separates the standard convolution into two steps: depthwise convolution and pointwise convolution. The depthwise separable convolution decomposes a single standard convolution into two convolutions: a depthwise convolution that filters each input channel independently and a pointwise convolution ( 1 × 1 {\displaystyle 1\times 1} convolution) that combines the outputs of the depthwise convolution. This factorization significantly reduces computational cost. It was first developed by Laurent Sifre during an internship at Google Brain in 2013 as an architectural variation on AlexNet to improve convergence speed and model size. === Dilated === Dilated convolution, or atrous convolution, introduces gaps between kernel elements, allowing the network to capture a larger receptive field without increasing the kernel size. === Transposed === Transposed convolution, also known as deconvolution, fractionally strided convolution, and upsampling convolution, is a convolution where the output tensor is larger than its input tensor. It's often used in encoder-decoder architectures for upsampling. It's used in image generation, semantic segmentation, and super-resolution tasks. == History == The concept of convolution in neural networks was inspired by the visual cortex in biological brains. Early work by Hubel and Wiesel in the 1960s on the cat's visual system laid the groundwork for artificial convolution networks. An early convolution neural network was developed by Kunihiko Fukushima in 1969. It had mostly hand-designed kernels inspired by convolutions in mammalian vision. In 1979 he improved it to the Neocognitron, which learns all convolutional kernels by unsupervised learning (in his terminology, "self-organized by 'learning without a teacher'"). During the 1988 to 1998 period, a series of CNN were introduced by Yann LeCun et al., ending with LeNet-5 in 1998. It was an early influential CNN architecture for handwritten digit recognition, trained on the MNIST dataset, and was used in ATM. (Olshausen & Field, 1996) discovered that simple cells in the mammalian primary visual cortex implement localized, oriented, bandpass receptive fields, which could be recreated by fitting sparse linear codes for natural scenes. This was later found to also occur in the lowest-level kernels of trained CNNs. The field saw a resurgence in the 2010s with the development of deeper architectures and the availability of large datasets and powerful GPUs. AlexNet, developed by Alex Krizhevsky et al. in 2012, was a catalytic event in modern deep learning. In that year’s ImageNet competition, the AlexNet model achieved a 16% top-five error rate, significantly outperforming the next best entry, which had a 26% error rate. The network used eight trainable layers, approximately 650,000 neurons, and around 60 million parameters, highlighting the impact of deeper architectures and GPU acceleration on image recognition performance. From the 2013 ImageNet competition, most entries adopted deep convolutional neural networks, building on the success of AlexNet. Over the following years, performance steadily improved, with the top-five error rate falling from 16% in 2012 and 12% in 2013 to below 3% by 2017, as networks grew increasingly deep.