论文标题
AI数据与关联阵列争吵
AI Data Wrangling with Associative Arrays
论文作者
论文摘要
AI革命是数据驱动的。 AI“数据争论”是将无法使用的数据转换以支持AI算法开发(培训)和部署(推理)的过程。大量时间用于翻译支持AI管道中许多查询和分析步骤的不同数据表示。这些数据的严格数学表示可以使数据翻译和分析优化在步骤内和跨步骤中。关联阵列代数提供了一个数学基础,该基础自然描述了数据库基础的表格结构和设置数学。同样,神经网络使用的矩阵操作和相应的推理/训练计算也由联想阵列很好地描述。更令人惊讶的是,可以很容易地构建一种一般的层次层压格式形式,例如XML和JSON。最后,是使用最广泛使用的数据分析工具之一的Pivot表自然来自关联阵列构造函数。关联阵列中的一个共同基础提供了互操作性的保证,证明其操作是具有严格数学属性的线性系统,例如关联性,通勤性和分布性,对于重新排序优化至关重要。
The AI revolution is data driven. AI "data wrangling" is the process by which unusable data is transformed to support AI algorithm development (training) and deployment (inference). Significant time is devoted to translating diverse data representations supporting the many query and analysis steps found in an AI pipeline. Rigorous mathematical representations of these data enables data translation and analysis optimization within and across steps. Associative array algebra provides a mathematical foundation that naturally describes the tabular structures and set mathematics that are the basis of databases. Likewise, the matrix operations and corresponding inference/training calculations used by neural networks are also well described by associative arrays. More surprisingly, a general denormalized form of hierarchical formats, such as XML and JSON, can be readily constructed. Finally, pivot tables, which are among the most widely used data analysis tools, naturally emerge from associative array constructors. A common foundation in associative arrays provides interoperability guarantees, proving that their operations are linear systems with rigorous mathematical properties, such as, associativity, commutativity, and distributivity that are critical to reordering optimizations.