My world

[投资学习笔记] 阻力位与箱体

yang — Fri, 15 Aug 2025 06:23:56 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/invest/resistance_level

起因是SBG在回踩的时候卖飞了，经过反思后觉得有必要来学习一下如何判断阻力位以及箱体。

一、阻力位 (Resistance Level)

阻力位可以说是股价上涨的“天花板”。在这个价位附近，卖方的力量开始超过买方的力量，导致股价停止上涨，甚至可能掉头下跌。

(一) 如何确定阻力位？

前期高点 (Previous Highs)：在K线图上，过去明显的波段高点或历史最高点，是天然的阻力位。
移动平均线 (Moving Averages, MA)：在下降趋势中，重要的均线（如MA20, MA60, MA120）会成为股价反弹的动态阻力。
下降趋势线 (Downtrend Line)：连接两个或多个依次降低的高点，形成的斜线会对后续反弹形成压制。
斐波那契回撤 (Fibonacci Retracement)：在一轮下跌后，反弹至38.2%, 50%, 61.8%等关键位置时，通常会遇到阻力。
整数关口/心理价位 (Psychological Levels)：如 ¥10, ¥50, ¥100 等整数价位。

(二) 原因分析

前期高点：这是最经典的心理博弈区，主要有三股卖出力量汇集于此。
1. 解套盘：在前期高点买入的投资者，股价下跌后一直被套牢。当股价终于回升到他们的成本价附近时，会急于卖出回本。这构成了第一批坚决的卖压。
2. 获利盘：在底部或上涨中途买入的投资者，看到股价接近了前一个高峰，会认为这是一个阶段性的顶部。出于锁定利润的心理，他们会选择在此卖出。
3. 做空力量：看空市场的投资者会认为前期高点是一个经过市场验证的天花板，在此位置建立空头头寸的风险收益比较高，他们的卖出行为进一步加强了阻力。
移动平均线 (MA)：
- 平均成本的“心理锚”：移动平均线代表了过去一段时间内所有交易者的平均持仓成本。在下降趋势中，当股价反弹至重要的长期均线（如60日线、120日线）时，意味着价格接近了这段时间内的平均成本。大量在此期间买入的被套投资者终于等到了解套机会，便会集中卖出，形成强大阻力。
整数关口/心理价位：
- 人类对“整数”的偏好：人们在设定目标时，天然倾向于使用简单好记的整数，如10元、50元、100元。无论是机构还是散户，他们的挂单、止盈目标价位也常常集中在这些整数关口，导致订单的密集，从而形成事实上的阻力。当然，中国交易者难免会出现6或8的偏好。

(三) 阻力位突破判断技巧

1. 成交量与阻力强弱 成交量是判断力量强弱的核心指标，验证阻力位和突破有效性的试金石。

放量突破的可靠性：突破阻力位时，如果成交量显著放大，说明有大量资金愿意在更高价位买入，买方力量充足，突破更容易成立，是强烈的看涨信号。
- 术语解释：放量 (Volume Spike)
  - 指成交量远超近期平均水平。通常以突破日的成交量达到前5个交易日平均成交量的1.5倍或以上，作为参考标准。
缩量遇阻回落：当股价接近阻力位但成交量萎缩，表明市场追涨意愿不足，缺乏向上攻击的能量，此时很容易因卖盘的出现而冲高回落。

2. 多周期共振 不同时间周期的图表相互验证，可以大幅提高阻力位的可靠性。

日线 + 周线 + 月线：如果在日线图上看到的阻力位，恰好也是周线图或月线图上的一个重要历史高点，那么这个阻力位的级别就非常高，突破难度极大，一旦突破，意义也更重大。
- 术语解释：多周期共振 (Multi-Timeframe Resonance)
  - 指在不同的时间周期图表上（如日线、周线、月线），出现了指向同一结论的技术信号（如多个周期同时出现阻力）。这种共振信号的可靠性远高于单一周期的信号。
均线与水平阻力重叠：例如，60日均线（MA60）恰好运行到前期高点的水平位置附近，形成“双重阻力”，这里的压力会非常强。

3. K线形态确认 单日的K线形态可以提供关于阻力位附近多空博弈的宝贵信息。

长上影线：股价在盘中一度突破阻力位，但收盘时又被打回阻力位下方，形成一根带有长长上影线的K线。这说明上方卖压沉重，多头攻击失败，阻力位被有效验证。
- 术语解释：K线 (Candlestick)
  - 又称蜡烛图，通过图形记录每个交易日的开盘价、收盘价、最高价和最低价。上影线代表了当天最高价与收盘价（或开盘价）之间的差距。
连续小阳线冲击阻力：股价以温和的小阳线形式，一步步逼近阻力位。这说明买方在持续试探，但如果连续多日都无法有效放量突破，说明攻击力度有限，耐心耗尽后可能引发空头的快速反扑。

4. 动态阻力判断 除了固定的水平线和趋势线，还有一些指标可以提供动态变化的阻力参考。

布林带上轨：在震荡行情中，布林带（Bollinger Bands）的通道上轨线常常对股价形成短期动态阻力。
- 术语解释：布林带 (Bollinger Bands, BOLL)
  - 一种由移动平均线和统计学中的标准差概念组成的路径指标。它由上、中、下三条轨道线组成，股价通常在上下轨之间波动。
趋势线转折：支撑与阻力可以相互转化。例如，一条长期压制股价的下降趋势线一旦被成功突破，当股价后续回调至这条线附近时，原来的阻力线就会变成支撑线。

二、箱体 (Trading Range / Box)

箱体理论是指股价在一段时间内，在由阻力位（箱顶）和支撑位（箱底）构成的价格区间内反复波动。

(一) 如何识别箱体？

识别价格区间：找到一段股价横盘整理的时期。
确定箱顶（阻力）：连接该区域内至少两个大致相同的高点，画出水平上轨。
确定箱底（支撑）：连接该区域内至少两个大致相同的低点，画出水平下轨。

(二) 箱体优化技巧

1. 箱体形态识别优化

宽幅 vs. 窄幅箱体：
- 宽幅箱体：箱顶与箱底之间价差较大（如超过15%-20%）。波动空间大，适合在箱体内进行高抛低吸的波段操作。
- 窄幅箱体：价差很小，K线紧密排列。这通常意味着多空力量高度均衡，市场在等待一个明确的信号。一旦突破，能量释放会更猛烈，往往伴随加速的大幅行情。
假突破识别：这是箱体交易中最需要防范的陷阱。
- 向上假突破 (Bull Trap)：股价在某天放量突破箱顶，但无法持续，在1-3个交易日内又快速跌回箱体内部，甚至伴随着更大的成交量下跌。这是诱多陷阱。
- 向下假跌破 (Bear Trap)：股价跌破箱底，看似趋势转坏，但很快被强力拉回箱体内部，并伴随成交量放大。这是诱空陷阱，往往是洗盘吸筹的行为。

2. 箱体内的交易量变化

量价齐升：当价格从箱底反弹，向箱顶运行时，如果成交量也呈现温和放大的态势，说明有资金在积极推动，增加了未来成功突破箱顶的可能性。
量能递减：当价格在箱体内反复震荡，但整体成交量却呈现越来越小的趋势。这表明市场交投清淡，多空双方的观望情绪浓厚，浮动筹码被逐渐锁定。这可能是暴风雨前的宁静，意味着突破的时机越来越近。

3. 多周期箱体叠加 从小周期服从大周期的原则出发，观察不同周期图上的箱体。

一个日线级别的小箱体，可能只是周线级别一个大箱体内部的一次小幅震荡。大周期箱体的突破（如周线级别的突破），往往会带来更长久、更强劲的趋势性行情。

4. 突破后的加速规律

回踩确认 (Pullback)：一个健康、可靠的突破，往往不是一去不回头。股价在向上突破箱顶后，上涨一段距离，然后会有一个缩量的回调动作，回踩到原箱顶（此时已从阻力变为支撑）附近，得到支撑后再继续上攻。这个“回踩确认”的过程可能是绝佳的二次买入或加仓点。

三、综合判断

1. 配合震荡指标 使用RSI、MACD等震荡指标来辅助判断突破的风险。

当股价即将或刚刚突破阻力位时，如果RSI指标已经显示高于70的超买状态，或者MACD指标与股价走势形成顶背离，那么即使突破了，也要警惕短期内因动能衰竭而引发回调的风险。
- 术语解释：RSI (相对强弱指数)
  - 衡量股价近期涨跌动能的指标，通常高于70被视为超买区，低于30为超卖区。
- 术语解释：MACD 顶背离 (Bearish Divergence)
  - 指股价创出新高，但MACD指标的对应高点却未能创出新高（甚至在降低）。这是典型的动能衰竭信号，预示上涨趋势可能即将反转。

2. 事件驱动 技术分析并非万能，需要结合基本面信息。

当股价运行到关键阻力位或箱体突破前夕，如果恰逢公司发布财报、行业出现重大利好/利空政策等重大事件，这些事件驱动的强大力量可能会让原有的技术形态瞬间失效。

3. 心理预期与市场情绪

热门题材股的阻力位突破，往往伴随着市场的高度关注和情绪化资金的爆发性推动，有时会以连续涨停等极端方式完成，不一定有“回踩”过程。
冷门股或绩优白马股的突破，则更多依赖于主力资金的稳健布局，走势通常更符合技术规范，例如会清晰地看到放量突破和缩量回踩。

四、如何识别假突破

(一) 突破前的信号

在股价尚未突破阻力位，但正在不断接近它时，一些先行指标可能已经发出了警告。

1. 量价背离 (Volume-Price Divergence) - 最重要的先行指标

现象：股价在不断创出新高（或接近前期高点），但对应的成交量却一波比一波低，呈现萎缩状态。
解读：这说明推动价格上涨的动能（资金）正在衰竭。价格的上涨更像是一种惯性，缺乏新增资金的认可。这种“无量上涨”是空虚的，一旦在阻力位遇到抛压，很容易崩溃。

2. 指标顶背离 (Indicator Bearish Divergence)

现象：股价创出新高，但常用的动量指标（如MACD、RSI）的对应高点却未能创出新高，反而走低。
解读：这表明虽然价格表面上很强势，但其内在的上涨动能已经减弱，上涨趋势随时可能反转。价格和指标走势不一致，是趋势即将衰竭的信号。

3. K线形态乏力

现象：在冲击阻力位的过程中，K线实体越来越短，上影线却越来越长，或者频繁出现十字星。
解读：长上影线和十字星都代表了在该价位多空分歧巨大，买方虽然努力上攻，但被卖方成功压制。这说明阻力位的抛压非常真实且有效，买方的攻击已经显现出疲态。

(二) 突破的验证方法

1. 突破必须放量

真突破：一个健康、可信的突破，必须伴随成交量的显著放大（通常是20日均量的1.5倍以上）。
假突破：如果突破阻力位当天的成交量与前几日相比没有明显放大，甚至是缩量的，那么这是最典型的假突破特征。
原因分析：突破阻力位意味着要消化所有在该位置等待解套和获利的卖盘，这需要巨大的资金来承接。没有成交量的放出，说明市场主力资金根本没有参与，或者参与意愿不强。

2. K线实体强度

真突破：突破当天通常会收出一根大阳线，且收盘价要明显高于阻力线（比如高于3%以上），显示出买方压倒性的决心。
假突破：突破当天收出的是一根小阳线，或者带有长上影线的阳线（即使收盘价在阻力位之上）。这表明突破过程非常勉强，在盘中遭到了强大阻击，买方优势微弱。

(三) 突破后的确认技巧

1. “三日原则”与快速回落

真突破：股价在突破阻力位后，通常能连续2-3个交易日稳定地收盘在原阻力位之上。
假突破：最常见的特征是在突破后的下一个交易日，股价就立即大幅低开或盘中跳水，收盘时又跌回到了阻力位下方。

2. 回踩形态的性质

真突破后的回踩：健康的回踩是缩量的，股价跌到原阻力位附近时会明显获得支撑，然后止跌企稳，再度上涨。此时，原阻力位成功转换为支撑位。
假突破的回落：回落是放量的，股价会毫不费力地直接击穿原阻力位，根本没有任何支撑作用。这表明市场完全不认可突破的有效性，恐慌盘正在涌出。

Finished reading? Say something

[中文笔记] Convex Optimization for Neural Networks

yang — Tue, 12 Aug 2025 15:04:59 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/ai/convex_optimization_nn

课件地址（Stanford EE364B：Convex Optimization for Neural Networks）

预备知识 (凸函数相关)

核心问题：如何将两层 ReLU 网络的非凸问题转为凸问题求解？

完整的推导路线图

非凸的 L2 正则化网络 ↓ (变量缩放)
等价的 L1 路径范数问题 ↓ (权重规范化)
无限字典的 LASSO 问题 ↓ (凸对偶)
半无限规划 (SIP) ↓ (超平面排列)
有限维的凸对偶问题 (QCQP) ↓ (再次求对偶)
最终的、可解的组 LASSO 问题

通过这个转换来理解“神经网络是凸正则化器”。它通过一个巧妙的变量缩放技巧，将一个看似复杂的 L2 正则化项，变成了一个形式更简单、更利于后续对偶分析的 L1 正则化项。

问题设定

场景设定

模型: 两层全连接 ReLU 网络。
损失函数: 平方损失。
正则化: 标准的 L2 权重衰减 (Weight Decay)。

我们的初始优化问题是：

$$ \min{W_1, W_2} \frac{1}{2} \left| \sum{j=1}^{m} \sigma(Xw{1j})w{2j} - y \right|2^2 + \lambda \left( \sum{j=1}^{m} |w{1j}|_2^2 + \sum{j=1}^{m} |w_{2j}|^2 \right) $$

其中，$W1$ 是第一层权重矩阵（$w{1j}$ 是其第 j 列），$w2$ 是第二层权重向量（$w{2j}$ 是其第 j 个标量元素）。这是一个关于 $W_1$ 和 $W_2$ 的非凸问题。

第一步：变量缩放 (The Scaling Trick)

对于单个隐层神经元 j，我们可以对其权重对 $(w{1j}, w{2j})$ 进行缩放，而不改变网络的最终输出。

1.1. 定义缩放变换

为每个隐神经元 $j$ 引入一个缩放因子 $\alpha_j > 0$。定义新的权重为：

$$ \tilde{w}{1j} = \frac{w{1j}}{\alphaj}, \quad \tilde{w}{2j} = w_{2j}\alpha_j $$

1.2. 为什么网络输出不变？

这一步依赖于 ReLU 函数 $\sigma(t) = \max(0, t)$ 的正齐次性 (Positive Homogeneity)，即 $\sigma(c \cdot t) = c \cdot \sigma(t)$ 对任何 $c \ge 0$ 成立。

因此，对任意一个神经元 $j$，其对最终输出的贡献为：

$$ \sigma(X\tilde{w}{1j})\tilde{w}{2j} = \sigma\left(X\frac{w{1j}}{\alpha_j}\right)(w{2j}\alphaj) = \frac{1}{\alpha_j}\sigma(Xw{1j})(\alphaj w{2j}) = \sigma(Xw{1j})w{2j} $$

由于 $\alphaj > 0$，$1/\alpha_j$ 可以被提出来。可以看到，经过缩放后的新权重 $(\tilde{w}{1j}, \tilde{w}{2j})$ 产生的输出与原权重完全相同。因此，总模型的输出 $\sum \sigma(\cdot)w{2j}$ 和损失函数 $|\dots - y|^2$ 的值也保持不变。

1.3. 正则化项如何变化？

虽然损失项不变，但正则化项会随着 $\alphaj$ 的变化而改变。对于神经元 $j$，其正则化惩罚从 $|w{1j}|2^2 + |w{2j}|^2$ 变为：

$$ \text{Penalty}j(\alpha_j) = \left| \frac{w{1j}}{\alphaj} \right|_2^2 + |w{2j}\alphaj|^2 = \frac{1}{\alpha_j^2}|w{1j}|2^2 + \alpha_j^2|w{2j}|^2 $$

1.4. 寻找最优缩放

对于一组固定的 $(w{1j}, w{2j})$，我们可以通过选择最优的 $\alpha_j$ 来最小化这个惩罚项。我们将 $\text{Penalty}_j(\alpha_j)$ 看作是关于 $\alpha_j$ 的函数，求其最小值：

令其导数为零：

$$ \frac{d}{d\alphaj}\text{Penalty}_j(\alpha_j) = -\frac{2}{\alpha_j^3}|w{1j}|2^2 + 2\alpha_j|w{2j}|^2 = 0 $$

解得最优的 $\alpha_j^*$ 满足：

$$ (\alphaj^*)^4 = \frac{|w{1j}|2^2}{|w{2j}|^2} \quad \implies \quad (\alphaj^*)^2 = \frac{|w{1j}|2}{|w{2j}|} $$

将最优的 $(\alpha_j^*)^2$ 代回惩罚项，得到最小惩罚值为：

$$ \min{\alpha_j > 0} \text{Penalty}_j(\alpha_j) = \frac{|w{2j}|}{|w{1j}|_2}|w{1j}|2^2 + \frac{|w{1j}|2}{|w{2j}|}|w{2j}|^2 = |w{1j}|2|w{2j}| + |w{1j}|_2|w{2j}| = 2|w{1j}|_2|w{2j}| $$

1.5. 结论：从 L2 到 L1 路径范数

这个结果表明，对于任意给定的权重 $(W1, W_2)$，总能找到一组缩放因子 ${\alpha_j}$，使得网络输出不变，但正则化项取得其可能的最小值 $\sum 2\lambda|w{1j}|2|w{2j}|$。

因此，原始的、带有 L2 权重衰减的优化问题，等价于下面这个新的优化问题：

$$ \min{W_1, W_2} \frac{1}{2} \left| \sum{j=1}^{m} \sigma(Xw{1j})w{2j} - y \right|2^2 + 2\lambda \sum{j=1}^{m} |w{1j}|_2|w{2j}| $$

这个新的正则化项 $\sum |w{1j}|_2|w{2j}|$ 被称为路径范数 (Path-Norm)。它形似 L1 范数，因为它惩罚的是绝对值的和，这为后续的稀疏性分析和对偶变换奠定了基础。我们成功地将 L2 正则化“搬运”并转化为了一个结构更清晰的 L1 型正则化。

第二步：权重规范化与 LASSO 形式

为了使问题形式更加标准，我们进一步对权重进行分解。

2.1. 再次重参数化

对于每个 $j$，定义：

方向: $uj = w{1j} / |w_{1j}|_2$，这是一个单位向量，$|u_j|_2 = 1$。
系数: $cj = w{2j} |w_{1j}|_2$。

2.2. 代入模型

网络的输出可以重写为：

$$ \sigma(Xw{1j})w{2j} = \sigma(X(|w{1j}|_2 u_j))w{2j} = |w{1j}|_2\sigma(Xu_j)w{2j} = \sigma(Xuj)(w{2j}|w_{1j}|_2) = \sigma(Xu_j)c_j $$

而路径范数正则项则变成了对新系数 $c_j$ 的 L1 惩罚：

$$ 2\lambda \sum{j} |w{1j}|2 |w{2j}| = 2\lambda \sum{j} |w{2j} |w{1j}|_2| = 2\lambda \sum{j} |c_j| $$

2.3. 等价的 LASSO 问题

吸收常数 2 到 $\lambda$ 中，我们得到了一个在形式上非常接近 LASSO (L1 正则化回归) 的问题：

$$ \min{|u_j|_2=1, c_j \in \mathbb{R}} \frac{1}{2} \left| \sum{j=1}^{m} \sigma(Xuj)c_j - y \right|_2^2 + \lambda \sum{j=1}^{m} |c_j| $$

这个问题要求我们在所有可能的“特征”（由单位向量 $u_j$ 生成的 $\sigma(Xu_j)$）中，寻找一个稀疏的线性组合来拟合 $y$。这正是后续进行凸对偶分析的出发点。

第三步：推导凸对偶，得到半无限规划 (SIP)

我们已经将原问题等价地表示为一个在“特征” $\sigma(Xu_j)$ 上的稀疏回归问题。这里的“特征”是由第一层单位范数权重 $u_j$ 生成的。因为 $u_j$ 可以在单位球面上任意取值，所以我们实际上是在一个无限的特征字典里进行选择。

这一步的目标是推导出这个问题的凸对偶 (Convex Dual)形式。

3.1. 定义问题和字典

我们的问题是：

$$ \min_{|u_j|_2=1, c_j} \frac{1}{2} \left| \sum_j \sigma(Xu_j)c_j - y \right|_2^2 + \lambda \sum_j |c_j| $$

为了进行分析，我们首先将这个问题看作一个两阶段的过程：

先从无限的特征集合 $\Phi = {\sigma(Xu) \mid |u|_2=1}$ 中，选择一个有限的子集（即字典）$\mathcal{A} = [\sigma(Xu_1), \sigma(Xu_2), \dots] = [\phi_1, \phi_2, \dots]$。
然后在这个固定的字典 $\mathcal{A}$ 上求解标准的 LASSO 问题：
$$ \min_{c} \frac{1}{2}|\mathcal{A}c - y|_2^2 + \lambda|c|_1 $$

3.2. 固定字典下的标准 LASSO 对偶

现在，我们对这个固定字典 $\mathcal{A}$ 的 LASSO 问题求其对偶。

引入对偶变量: 我们引入对偶变量 $v \in \mathbb{R}^n$（n 是样本数）。利用 Fenchel 共轭，二次损失项可以写作：
$$ \frac{1}{2}|\mathcal{A}c - y|2^2 = \max{v \in \mathbb{R}^n} \left( v^\top(\mathcal{A}c) - \frac{1}{2}|v-y|_2^2 + \frac{1}{2}|y|_2^2 \right) $$
(注：$\frac{1}{2}|y|_2^2$ 是常数，为简化通常先省略，最后加回。课件中的 $-\frac{1}{2}|v-y|_2^2$ 是一个等价的、更简洁的表达)。
交换 min 和 max: 将上式代入原问题，得到一个 min-max 问题。由于原问题是凸的，强对偶性成立，我们可以交换 min 和 max 的顺序：
$$ \max{v} \min{c} \left( v^\top(\mathcal{A}c) + \lambda|c|_1 - \frac{1}{2}|v-y|_2^2 + \frac{1}{2}|y|_2^2 \right) $$
求解内部 min: 我们关注内部对 $c$ 的最小化部分：
$$ \min_{c} \left( (\mathcal{A}^\top v)^\top c + \lambda|c|_1 \right) $$
这是 L1 范数的共轭函数。其解为：
$$ \begin{cases} 0, & \text{if } |\mathcal{A}^\top v|_\infty \le \lambda \ -\infty, & \text{otherwise} \end{cases} $$
其中 $|\mathcal{A}^\top v|_\infty \le \lambda$ 等价于对字典的每一列 $\phi_j$ 都有 $|v^\top\phi_j| \le \lambda$。当这个条件不满足时，我们可以让 $c$ 的某些元素朝 $-(\mathcal{A}^\top v)$ 的方向无限增大，使得值为 $-\infty$。
得到对偶问题: 将这个结果代回 max 问题，$-\infty$ 的情况可以忽略，我们只关心值为 0 的情况，即约束 $|\mathcal{A}^\top v|_\infty \le \lambda$ 必须满足。因此，固定字典 $\mathcal{A}$ 下的对偶问题是：
$$ \begin{aligned} \max{v} \quad & -\frac{1}{2}|v-y|_2^2 + \frac{1}{2}|y|_2^2 \ \text{s.t.} \quad & |\mathcal{A}^\top v|\infty \le \lambda \end{aligned} $$
(目标函数等价于最小化 $|v-y|_2^2$)

3.3. 从固定字典回到无限字典

现在，我们回到原始设定，即字典 $\mathcal{A}$ 并非固定的，而是包含了所有由单位向量 $u$ 生成的特征 $\sigma(Xu)$。

对偶问题的约束 $|\mathcal{A}^\top v|_\infty \le \lambda$，即 $|v^\top\phi| \le \lambda$，必须对字典中的每一列 $\phi$ 都成立。为了让这个对偶问题成为原问题的一个有效下界（并且由于强对偶性，是一个紧密的下界），这个约束必须对整个无限特征集 $\Phi$ 都成立。

因此，我们将约束替换为：

$$ |v^\top\sigma(Xu)| \le \lambda, \quad \text{对所有满足 } |u|_2=1 \text{ 的 } u \text{ 成立。} $$

3.4. 最终的半无限规划 (Semi-Infinite Program, SIP)

将这个无限约束代入，我们得到了最终的对偶问题，它是一个半无限规划：

$$ \begin{aligned} \max_{v \in \mathbb{R}^n} \quad & -\frac{1}{2}|v-y|_2^2 \ \text{s.t.} \quad & |v^\top\sigma(Xu)| \le \lambda, \quad \forall u \in \mathbb{R}^d, |u|_2=1 \end{aligned} $$

这个规划被称为半无限，因为它的优化变量 $v$ 是有限维的（$v \in \mathbb{R}^n$），但它却拥有无限约束（每个单位向量 $u$ 都对应一条约束）。

第四步：将无限约束转化为有限约束 (从 SIP 到可解问题)

我们在第三步得到的半无限规划 (SIP) 是一个理论上优美的结果，但它包含无限约束，无法直接用算法求解。这一步的目的是展示如何将这无限条约束转化为有限。

4.1. 回顾 SIP 和其挑战

到这里，我们的问题是：

$$ \begin{aligned} \max_{v \in \mathbb{R}^n} \quad & -\frac{1}{2}|v-y|_2^2 \ \text{s.t.} \quad & |v^\top\sigma(Xu)| \le \lambda, \quad \forall u \in \mathbb{R}^d, |u|_2=1 \end{aligned} $$

挑战在于约束条件 $|v^\top\sigma(Xu)| \le \lambda$ 必须对单位球面 $|u|_2=1$ 上无穷无尽的向量 u 都成立。

4.2. 超平面排列 (Hyperplane Arrangement)

这里的核心思想是：函数 $\sigma(Xu)$ 虽然依赖于连续变化的 u，但其“形状”只有有限多种。

ReLU 的分段线性特性: $\sigma(z) = \max(0, z)$ 是一个分段线性函数。因此，$\sigma(Xu)$ 也是一个关于 u 的分段线性函数。
符号模式 (Sign Pattern): 函数 $\sigma(Xu)$ 的具体形式取决于向量 $Xu$ 中每个元素 $x_i^\top u$ 的正负号。
超平面: 在 d 维的权重空间中，每个方程 $x_i^\top u = 0$ (对于样本 i=1,...,n) 都定义了一个穿过原点的超平面。
区域划分: 这 n 个超平面将整个 $\mathbb{R}^d$ 空间划分成了有限个多面体区域 (Polyhedral Regions)。
关键结论: 在任何一个特定的区域 k 内部，所有向量 u 产生的 $Xu$ 都具有完全相同的符号模式。

4.3. 用对角矩阵表示符号模式

对于每个区域 k，我们可以用一个 n×n 的对角矩阵 $Dk$ 来表示其对应的符号模式。$D_k$ 的对角线元素 $(D_k){ii}$ 为：

$$ (Dk){ii} = \begin{cases} 1, & \text{如果在该区域内 } x_i^\top u > 0 \ 0, & \text{如果在该区域内 } x_i^\top u \le 0 \end{cases} $$

利用这个矩阵，只要 u 位于区域 k，我们就可以将非线性的 $\sigma(Xu)$ 精确地写成线性形式：

$$ \sigma(Xu) = D_k X u $$

4.4. 分解无限约束

现在，我们可以将原始的无限约束分解到每个区域上。

$$ \sup_{|u|_2=1} |v^\top \sigma(Xu)| \le \lambda $$

等价于

$$ \max{k=1, \dots, p} \left( \sup{u \in \text{Region}_k, |u|_2=1} |v^\top \sigma(Xu)| \right) \le \lambda $$

其中 p 是区域的总数（有限个）。将 $\sigma(Xu) = D_kXu$ 代入，上式变为：

$$ \max{k=1, \dots, p} \left( \sup{u \in \text{Region}_k, |u|_2=1} |v^\top D_k X u| \right) \le \lambda $$

由于 $|v^\top D_k X u|$ 是 u 的线性函数（的绝对值），其在单位球上的最大值一定在整个球面上取到，而不仅仅是在某个区域的交集上。所以我们可以简化并得到一组有限的约束：

$$ \sup_{|u|_2=1} |v^\top D_k X u| \le \lambda, \quad \text{对每个模式 } k=1, \dots, p $$

根据对偶范数的定义，$\sup_{|u|_2=1} |z^\top u| = |z|_2$。因此，每条约束都变成了：

$$ |(v^\top D_k X)^\top|_2 \le \lambda \quad \iff \quad |X^\top D_k v|_2 \le \lambda $$

4.5. 等价的有限维对偶问题

我们将 SIP 中的无限约束替换为这 p 条有限的凸约束，得到一个标准的、可求解的凸优化问题（这是一个二次约束二次规划 QCQP）：

$$ \begin{aligned} \max_{v \in \mathbb{R}^n} & \quad -\frac{1}{2}|v-y|_2^2 \ \text{s.t.} & \quad |X^\top D_k v|_2 \le \lambda, \quad \text{for } k=1, \dots, p \end{aligned} $$

4.6. 恢复最终的、可解释的 Primal 问题 (Group LASSO)

虽然上述对偶问题已经可解，但为了得到更具解释性的形式，我们对它再次求对偶，从而得到原问题的最终 primal 形式。得到一个组稀疏 (Group LASSO) 回归问题：

$$ \min{Z_1, \dots, Z_p \in \mathbb{R}^d} \frac{1}{2} \left| \sum{k=1}^p Dk X Z_k - y \right|_2^2 + \lambda \sum{k=1}^p |Z_k|_2 $$

变量: Z_k 是一个 d 维向量，与第 k 个激活模式相关联。
模型: 模型将最终的输出 y 表达为来自不同激活模式的贡献 D_kXZ_k 的线性组合。
正则化: $\lambda \sum |Z_k|_2$ 是组 L1/L2 范数。它会鼓励整个向量 Z_k 同时为零。
解释: 这个模型通过选择少数几个（稀疏的）重要的激活模式（那些 $Z_k \neq 0$ 的模式）来拟合数据。

总结

至此，我们证明了，一个经典的两层 ReLU 网络的训练问题，尽管其原始形式是非凸的，但它与一个标准的、可以高效求解的凸优化问题（组 LASSO）完全等价。这就证明了核心论点：神经网络可以被看作是凸正则化器。从最终的组 LASSO 解中，可以恢复出原非凸网络的最优权重 $W_1^$ 和 $W_2^$。

p.s. 在课件中将变换为 $|W_{1j}|_2 = 1$ 的理由

这其实就是把第一层吸收到第二层系数里，而不影响网络输出，也不改变正则项的结构。

1. 缩放不改变输出

ReLU 满足正齐次性：

$$ \sigma(c,t) = c,\sigma(t), \quad c \ge 0 $$

所以如果我们把 $W{1j}$ 放大 $k>0$ 倍，并同时把 $W{2j}$ 缩小 $k$ 倍：

$$ \sigma(X (k W{1j})) \cdot \frac{W{2j}}{k} = k,\sigma(X W{1j}) \cdot \frac{W{2j}}{k} = \sigma(X W{1j}) W{2j} $$

网络输出完全不变。

2. 在最优缩放下，正则项是

经过 $\alpha_j$ 优化，我们已经得出最优正则是：

$$ \lambda \sum{j} 2|W{1j}|2 , |W{2j}| $$

（常数 2 可以并进 $\lambda$ 里）

3. 吸收长度到第二层

上面的正则是第一层长度 × 第二层绝对值的乘积。既然输出只依赖两者的乘积，我们完全可以把第一层长度定为 1，然后把它的值乘到第二层系数上去：

令：

$$ uj = \frac{W{1j}}{|W_{1j}|_2} \quad (|u_j|_2 = 1), $$

$$ cj = \text{sign}(W{2j}) \cdot |W{1j}|_2 |W{2j}| \quad (= W{2j} |W{1j}|_2) $$

这样：

输出：
$$ \sigma(X W{1j}) W{2j} = \sigma\big(X(|W{1j}|_2 u_j)\big),\frac{c_j}{|W{1j}|_2} = \sigma(X u_j), c_j $$
正则项：
$$ |W{1j}|_2 , |W{2j}| = |c_j| $$

因为 $|u_j|_2=1$ 已固定，所以只剩下对 $c_j$ 的 L1 正则。

Finished reading? Say something

[投资学习笔记] 推文分析学习：解读美国国债收益率曲线信号

yang — Fri, 08 Aug 2025 03:47:07 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/invest/daily_treasury_par_yield_curve_rates

核心结论：当前收益率曲线的多个结构表明，市场正在强烈预期未来的降息，经济“硬着陆”风险显著上升。投资者已开始抢跑交易政策拐点。

参考图表与数据来源

官方数据来源: 可以在美国财政部官网上查看每日更新的收益率曲线数据。
- 链接: U.S. DEPARTMENT OF THE TREASURY - Daily Treasury Par Yield Curve Rates
示例:
来源: https://x.com/corsica267/status/1953471063666594235

一、收益率曲线逐条解析

1. 1M–6M：流动性未吃紧

现象：1个月至6个月的短端收益率曲线平坦，无异常陡峭。
解读：短期融资市场资金充裕，没有出现流动性危机（“钱荒”）的迹象。

2. 1Y–3Y：倒挂 → 强降息预期

现象：1年期国债收益率高于3年期。
解读：市场认为短期利率偏高，但未来必然会下行。这是押注美联储将 大幅降息 的明确信号。

3. 2Y vs. 联邦基金利率：衰退前兆

现象：2年期国债收益率比联邦基金利率（FFR）下限低超过100个基点（bp）。
解读：这是一个强烈的历史信号，通常发生在 经济衰退前3–9个月，反映市场对未来降息的极高确定性。

4. 3Y–10Y：熊陡 → 长期通胀预期

现象：曲线呈现 熊陡（Bear Steepener），即短端利率下降，但长端利率下降更慢甚至上升。
解读：市场预期短期经济会下行（迫使联储降息），但 长期通胀压力依然存在，可能源于财政赤字或供应链等结构性问题。

5. 30Y–20Y：小幅正斜率 → 久期与供给风险

现象：30年期利率仅略高于20年期。
解读：超长期债券面临 久期风险（对利率波动敏感）和 供给压力（财政部发债）。同时也侧面反映了衰退预期，否则利差会更阔。

6. 10Y–2Y：倒挂大幅回正 → 经济见顶信号

现象：经典的10Y-2Y利差从深度倒挂状态迅速回归正值。
解读：历史（如2000年、2007年）表明，这种 快速正常化 往往是经济周期触顶、衰退即将来临的标志。

7. 3M–10Y：轻微倒挂 → 放水前的衰退交易

现象：3个月收益率略高于10年期收益率。
解读：这是另一个经典的衰退先行指标。市场开始提前布局 “放水”（降息）前的衰退交易，如买入长期债券。

8. 总体：硬着陆风险上升

综合判断：多个利差结构发出经济降温信号，但长端的通胀预期并未完全消失。如果政策应对不够平滑，经济很可能走向 硬着陆（即衰退伴随失业率急剧上升）。

二、对市场的含义

1. 市场抢政策拐点，美元见顶

投资者正在押注美联储政策将从紧缩转向宽松（降息）。
美元指数（DXY）可能因此见顶回落，资金或将流向黄金、大宗商品、非美资产等。

2. 若进入衰退，2Y–10Y利差将再度陡峭

经济衰退一旦确认，央行会快速降息，导致短端利率（2Y）下降速度远快于长端（10Y）。
这将导致 2Y-10Y 收益率曲线进一步 陡峭化。

3. 短期交易策略：逢“弱”做多

在9月/10月的重要时间窗口前，市场情绪敏感。
每当有疲软的经济数据公布，市场对降息的预期就会加强，从而推动资金流入 长期国债（如 TLT）和黄金（如 GLD）。

三、核心逻辑

收益率曲线形态 → 暗示经济周期处于 顶点向下 的阶段。
- 短端倒挂 + 2Y远低于FFR → 确认 高确定性的降息预期。
- 长端熊陡 → 确认 长期通胀预期未死。
交易含义：
- 中短期：做多长债 + 黄金，捕捉降息预期。
- 中长期：采取防御姿态，防范衰退风险，减配股票等高风险资产。

Finished reading? Say something

[投资学习笔记] 推文分析学习：从银行消费数据看美国经济

yang — Fri, 08 Aug 2025 02:54:00 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/invest/bank-consumer-spending-us-economy

核心观点：美国银行CEO通过观察其内部的消费增速数据，可以对美国整体的通胀和经济增长趋势做出判断。由于消费占美国GDP的大头，银行的消费数据可以视为经济的“温度计”。

一、核心逻辑链条

美国银行CEO曾表示：“我们看到的消费增速已经回到疫情前，所以通胀增速会放缓。”

这个判断基于以下推理：

消费是经济引擎：美国约60%的GDP来自于个人消费支出（PCE）。
银行掌握一手数据：银行通过信用卡、储蓄账户等业务，能实时掌握最真实的消费数据。因此，银行观察到的消费增速，在很大程度上约等于名义GDP中消费部分的增速。
增速的构成：消费增速（名义） = 真实消费增长（实际GDP增长的核心） + 价格上涨（通胀）。
推导经济状态：如果整体消费增速固定在某一水平（例如4.5%~5%），那么：
- 当通胀率超过这个水平时（例如 CPI > 5%），就意味着真实的消费增长必须是负数。
- 真实的消费负增长，则预示着 实际GDP负增长，即经济衰退。

二、关键变量

项目	含义	举例
名义GDP增长	未剔除通胀影响的经济增长率。	GDP增长了5%
实际GDP增长	剔除通胀影响后的、真实的经济增长率。	真实经济增长了2%
通胀	此处可理解为名义增长与实际增长之间的差值。	通胀率为3%
银行观察的消费增速	包含了真实消费增速和通胀的名义数据。	银行看到消费额增长5% （可能=2%真实增长+3%通胀）

三、逻辑的公式化表示

我们可以将这个逻辑简化为一个近似的公式：

$$\text{实际GDP增速} \approx \text{消费增速（名义）} - \text{通胀率}$$

基于美国银行CEO观察到的 4.5% ~ 5% 的消费增速，我们可以推演出不同的经济情景：

情景A：温和增长
- 如果通胀率为 2.5%
- 实际GDP增速 ≈ (4.5% ~ 5%) - 2.5% = 2% ~ 2.5%
- 结论：经济处于健康、温和的增长区间。
情景B：经济衰退
- 如果通胀率为 5.5%
- 实际GDP增速 ≈ (4.5% ~ 5%) - 5.5% = -1% ~ -0.5%
- 结论：经济已进入负增长，即技术性衰退。

四、为什么说银行财报是“经济温度计”？

银行的业务使其能敏锐地捕捉到经济活动的细微变化。这些变化最终会反映在财报上。

银行能看到的领先/同步指标包括：

信用卡消费：消费额度是上升还是下降？
储蓄账户：居民储蓄是在增加还是在被消耗？
贷款业务：消费贷款、企业贷款的申请是放缓还是加速？
资产质量：贷款违约率、坏账率是否正在增加？

因此，我们可以得出结论：

预警信号 🚨：当银行财报显示 消费放缓、储蓄下降、坏账率上升 时，通常是经济活动已经或即将步入衰退的信号。
健康信号 ✅：如果银行财报稳健，说明 居民消费和企业投资依然活跃，经济离衰退还有距离。

Finished reading? Say something

中文笔记：Software Is Changing (Again) by Andrej Karpathy

yang — Thu, 19 Jun 2025 08:16:08 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/ai/software_3.0

序言：午后听了Andrej Karpathy最新的演讲，大为震撼，于是有了此篇笔记。这里对每个小节做一些整理和思考。这里先说下结论：Markdown才是这时代最好的编程语言！（笑）

视频地址：

LLMs是新型计算机

在Andrej Karpathy的定义中，软件的变迁分为了三个时代。

1. Software 1.0: 传统代码

在1.0时代中，软件指的是你为计算机编写的代码。程序由程序员编写，并在计算机上执行。

2. Software 1.0: 神经网络

由于神经网络的崛起，神经网络中的参数被视为2.0时代中的编程代码。程序员不在需要直接编写这些代码，而是通过调整数据集并运行优化器来创建神经网络的参数。软件开发从显式编码转向数据驱动和模型训练，并且如Hugging Face平台成为2.0时代的GitHub平台，用于共享和可视化模型参数

3. Software 3.0: LLMs的时代

3.0时代指的便是LLM，因为他们可以直接生成程序。而英语等自然语言就变为了3.0时代的编程代码。比如，可以用少量提示来编程 LLM 进行情感分类，而不是编写Python代码或训练神经网络模型。

在这一节中Andrej Karpathy以自动驾驶的发展为例，在神经网络崛起（2.0时代）之后，工程师们使用了一些网络模型来替换了传统算法代码（如图像识别任务）。而现在才是3.0时代的开始，今后也将会有大量的软件被重写。

思考： 其实这里笔者认为3.0时代发展的最后是需要跳出传统的编程范式的（如使用C++，Python等代码来执行任务），LLM完全有能力直接操作逻辑电路中的0，1来生成程序。大规模数据集以及需要处理的信息量巨大（每个时钟周期数百万到数十亿个位的变化）都将是挑战。

LLMs是公用事业（Utilities），操作系统（Operating Systems）？

不得不说天才们的脑洞是真的大。这一节中LLM 先被比作公用事业（如电力），这主要基于其提供服务的方式和用户需求。LLM的开发过程类似于建设电网的基础设施，都是由大公司投入大量资本来得到的。然后他们通过API以按量付费（按token计费)的方式向用户提供智能服务，这相当于提供电力服务的运营支出。与公用事业服务的典型需求相似，用户对LLM API有很高的要求，包括低延迟、正常运行时间和一致的质量。

此外，由于存在像OpenRouter这样的工具，允许用户在不同的LLM之间轻松切换，因为LLM不争夺物理空间，因此可以有多个“电力供应商”。LLM的构建由于需要巨大的资本支出，并且技术栈发展迅速，研发秘密高度集中在少数LLM实验室内部，这使得它们也具备了晶圆厂的一些特征。

此外，LLM不仅仅是像电力或水那样的简单商品，它们正在成为日益复杂的软件生态系统。这里介绍了一种更深层的类比是将其视为新型的操作系统。与操作系统类似，LLM也存在少数闭源提供商（如Windows或Mac OS），以及开源替代方案（如Linux）。LLM本身也可以被视为CPU的等价物，其包含上下文窗口（context window）工作记忆（working memory）或内存。又如VS Code应用程序可以在Windows、Linux或Mac上运行，LLM应用（例如Cursor）可以在不同的LLM上运行。

LLM当前相当于60年代的计算机时代（云端，数据通过流式传输和批处理进行），首先LLM计算目前仍然非常昂贵，这迫使LLM集中部署在云端。而且用户通过分时共享（time sharing）的方式进行访问，而非完全独占其计算资源。个人计算革命尚未在LLM领域发生，因为目前经济上不划算。

LLM psychology

LLM被认为是“人类精神”或“人类的随机模拟”。这是因为它们通过在互联网上所有可用的文本进行训练，这些文本源自人类，因此它们展现出一种类似人类的涌现心理 (emergent psychology that is humanlike)。LLM拥有百科全书般的知识和记忆，能够记住比任何个体人类都多得多的信息，可以非常容易地记住SHA哈希等各种信息。

认知缺陷/局限性:

-幻觉 (Hallucinations): LLM会频繁地“产生幻觉”并“编造内容”。

-自我认知不足 (Insufficient self-knowledge): 它们没有一个非常好的或足够强的“内部自我认知模型”。

-锯齿状智能 (Jagged intelligence): 它们的智能是“锯齿状的”，这意味着它们在某些解决问题的领域可能表现出超人水平，但随后又会犯下“基本上没有人类会犯的错误”，例如坚持“9.11大于9.9”或者“strawberry”中有两个“r”。这些是“你可能会绊倒的粗糙边缘” (rough edges that you can trip on)。

-顺行性遗忘症 (Anterograde amnesia): LLM似乎也遭受某种类似于顺行性遗忘症的困扰。与人类同事不同，人类会随着时间学习和巩固知识，而LLM“本身不会这样做”。它们的上下文窗口 (context windows) 就像是工作记忆 (working memory)，必须直接对其进行编程，因为它们不会默认变得更聪明。这被比作电影《记忆碎片》(Momento) 和《初恋50次》(51st Dates) 中的主角，他们的“权重是固定的，上下文窗口每天早上都会被清空”。

-易受骗性与安全风险 (Gullibility and security risks): LLM非常容易上当，并且容易受到提示注入 (prompt injection) 风险的影响。它们还可能泄露你的数据。

通过LLM来构建部分自动化的产品

这部分的内容可能略显繁杂，但是中心思想是找到当前LLM缺陷/局限性的解决方法。这里最后的比喻是钢铁侠战衣，在面对有缺陷的LLM时，我们应该构建更多的是增强人类能力的工具（augmentation），而不是完全自主的机器人（autonomous robots）。

1. 需要GUI而不是直接通过文本交流

直接通过文本与LLM交互，感觉就像通过终端与操作系统对话一样，LLM生成的大量的文本很难阅读、解释和理解，需要通过视觉表示来加速人类验证和审计系统的工作。目前，尚未以通用方式真正发明出适用于LLM的通用GUI，尽管某些LLM应用已经拥有特定的GUI。例如，ChatGPT目前的界面类似于文本气泡，但仍缺乏一个跨所有任务的通用GUI。这里还展示了一波cursor和Perplexity的GUI。例如，cursor中查看代码差异时，红绿色的视觉变化比纯文本更容易理解和操作（接受或拒绝更改）。此外，Tesla的Autopilot仪表盘上的GUI会显示神经网络所看到的内容，这也是GUI的应用实例。

思考： 设计师们和可视化的研究者们的新舞台

2. 生成与验证的循环

当前LLM应用中的工作流通常是AI负责生成内容，而人类负责验证。为了提高工作效率，加快AI生成与人类验证之间的循环（generation verification loop）至关重要。有两种主要方式可以做到这一点：1.如前所述，GUI通过视觉表示大大提高了人类审计工作的速度和效率。2.保持AI受控（“驯服AI”）。

3. 驯服AI (Keep AI on the leash)

因为如上面所述的LLM缺陷和局限性，以及AI agents可能会过度反应，生成过大或难以管理的内容（如，一次性生成10,000行代码，这使人类成为瓶颈，难以验证和确保其无错误或安全问题）等问题，所以一个可控的AI是必要的。

解决方案：

1.提供更具体、更精确的Prompt

2.在AI辅助的编程中，分小块、增量地进行工作

例如在教育领域，可以设计一个教师来创建课程，然后一个学生来使用这些课程。这样，“课程”就成为一个可审计的中间工件，确保AI在特定教学大纲和项目进度下保持受控和一致。
自主滑块（Autonomy Slider）：主要目的是允许用户根据任务的复杂性来调整AI的自主程度。例如，在Cursor中，你可以选择轻量级补全（你主导），修改代码块（特定范围），修改整个文件，或者完全自主地处理整个代码库。

LLM是新的数字信息消费者

Andrej Karpathy发现完全使用LLM进行软件开发的一个问题是，目前的使用文档主要是为人类设计的，例如包含图形用户界面（GUI）的指示和视觉元素，这使得LLM难以直接理解和操作。要让AI Agent理解文档并正确调用API，必须调整现有的结构，使其变得对LLM友好。尽管未来LLM可能能够自行点击网页，但目前而言，主动迎合LLM并使其更容易访问信息仍然非常重要。这里介绍了一些方法和工具：

1.The /llms.txt file：

类似于网站用于指导网络爬虫的robots.txt文件，可以创建一个简单的Markdown格式的llm.txt文件。这个文件可以直接向LLM说明域名信息，告诉它们这个领域是关于什么的。相比让LLM解析复杂的HTML页面，这种方式更具可读性，且不易出错。

2. 转换为LLM更易理解的格式：

文档中的列表、加粗字体、图片等视觉元素，这些对LLM来说并不直接可访问。需要将这些文档转换为LLM更易理解的格式，例如Markdown。Vercel和Stripe等公司已经开始将他们的文档过渡到专门为LLM设计的格式。

3. 修改文档中的指令：

文档中常见的“点击”等动词对LLM来说是无效的，因为它们无法进行原生点击操作。因此，需要将这些动词替换为LLM代理可以直接执行的命令，例如使用curl命令。这使得文档内容可以直接转化为代理可执行的动作，如Vercel将“click”替换为等效的curl命令。

4. 模型上下文协议：

除了上述通用调整外，还有像Enthropic的模型上下文协议这样的专用协议，它提供了一种直接与AI代理通信的方式。

5. 数据摄取工具以适配LLM：

例如，像Gitingest这样的工具，可以将GitHub仓库的URL转换为一个包含所有文件内容和目录结构的单一巨大文本，方便直接复制粘贴到LLM中进行提问和处理。更进一步的例子是Deep Wiki，它不仅摄取原始文件内容，还会让AI对GitHub仓库进行分析，并自动生成整个文档页面，使其对LLM更具帮助。

思考： 这一部分值得深思。虽然真正的智能体还未出现，但互联网产品已不只服务于人类，未来或许会出现同时面向 AI agent 和人类的新概念产品。

Finished reading? Say something

中文笔记：An Introduction to Discrete Variational Autoencoders

yang — Mon, 16 Jun 2025 03:04:13 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/ai/discreteVAE

1. 统计学基础：

数学符号：

标量（Scalar-valued variable）：$x$,

向量、矩阵或向量拼接（Vectors, matrices, concatenations of vectors）：$\boldsymbol{x}$,

概率分布（Probability Distributions）：$ \mathcal{D} $, 从分布中采样或随机变量的记法：$x \sim \mathcal{D}$，

可学习的参数：$\boldsymbol{\psi}$，

概率密度函数（或概率质量函数）： $p$, $q$，

分布在参数 $\boldsymbol{\psi}$ 下对 $x$ 的评估： $p_{\boldsymbol{\psi}}(x)$，

概率分布下的期望表示： $\mathbb{E}{p{\boldsymbol{\psi}}(x)}[f(x)]$，积分形式展开：$ \mathbb{E}{p{\boldsymbol{\psi}}(x)}[f(x)] = \intx p{\boldsymbol{\psi}}(x) f(x) , dx $

概率与信息度量（Probabilities and Information Measures）

编号	概念	表达式
(1)	联合概率与条件概率关系	$p(A, B) = p(A \mid B)p(B) = p(B \mid A)p(A)$
(2)	全概率公式	$p(A) = \sum_{i=1}^{k} p(A \mid B_i)p(B_i)$
(3)	KL 散度（KL Divergence）	$D_{\mathrm{KL}}(q(A) \parallel p(A)) \triangleq \int q(A) \log \left[\frac{q(A)}{p(A)}\right] dA$
		$= -\int q(A) \log \left[\frac{p(A)}{q(A)}\right] dA$
(4)	熵（Entropy）	$H(p(A)) \triangleq -\int p(A) \log p(A) , dA$
(5)	交叉熵（Cross-Entropy）	$H(p(A), q(A)) \triangleq -\int p(A) \log q(A) , dA$
(6)	离散熵（Discrete Entropy）	$\mathrm{Entropy}(\mathbf{a}) \triangleq -\sum_{i=1}^{k} a_i \log a_i$
(7)	离散交叉熵（Discrete Cross-Entropy）	$\mathrm{CE}(\mathbf{a}, \mathbf{b}) \triangleq -\sum_{i=1}^{k} a_i \log b_i$
(8)	二元交叉熵（Binary Cross-Entropy）	$\mathrm{BCE}(\mathbf{a}, \mathbf{b}) \triangleq -a_1 \log b_1 - (1 - a_1) \log(1 - b_1)$
(9)	批量熵（Aggregate Entropy）	$\overline{\mathrm{Entropy}}(\boldsymbol{\alpha}) \triangleq \sum_{j=1}^{m} \mathrm{Entropy}(\mathbf{a}^{(j)})$
(10)	批量二元交叉熵（Aggregate Binary Cross-Entropy）	$\overline{\mathrm{BCE}}(\boldsymbol{\alpha}, \boldsymbol{\beta}) \triangleq \sum_{j=1}^{m} \mathrm{BCE}(\mathbf{a}^{(j)}, \mathbf{b}^{(j)})$

复习：

条件概率表示在事件 B 发生的前提下，事件 A 发生的概率是多少。
KL 散度衡量的是：如果你用分布 p 来近似真实分布 q，会造成多大的“信息损失”。如果 $p(A) = q(A)$，即两分布完全一致，KL 散度为 0。它是非对称的：$D{\mathrm{KL}}(q(A) \parallel p(A)) != D{\mathrm{KL}}(p(A) \parallel q(A))$
熵用来量化一个概率分布的不确定性。如果一个事件很确定 —— 熵为 0。如果事件非常不确定，比如硬币有一半的几率是正面，一半是反面，那么结果完全不可预测 —— 熵最大。
交叉熵是评价用预测分布 q 来描述真实分布 p 有多糟？ = 真实熵 + 预测带来的额外损失（KL 散度）

离散概率分布（Discrete Probability Distributions）

伯努利分布（Bernoulli Distribution）：适用于仅有两个结果（例如抛硬币）的事件。

定义为：

$ X \sim \mathrm{Bernoulli}(p) $, p是一种情况发生的概率

伯努利分布概率质量函数为：

$f_X(x) = \begin{cases} p & \text{if } x = 1 \\ 1 - p & \text{if } x = 0 \end{cases}$

亦可写作：$f_X(x) = p^x (1 - p)^{1 - x}$

分类分布（Categorical Distribution）：是伯努利分布在 $k > 2$ 个结果上的扩展。

定义为：

$X \sim \mathrm{Cat}(p)$

概率质量函数为：

$f_X(x) = \begin{cases} p_1 & \text{if } x = 1 \\ \vdots \\ p_k & \text{if } x = k \end{cases}$

可以使用Iverson Bracket简明表示为：

$[x = i] = \begin{cases} 1 & \text{if Statement is true}\\ 0 & \text{Otherwise} \end{cases}$

同样可以简写为：$fX(x) = \sum{i=1}^{k} [x = i] \cdot p_i$

最大似然估计（Maximum Likelihood Estimation）

当我们建立一个模型来估计分布 $p_{\boldsymbol{\psi}}(X)$ 时，常见的方法是选择一组最优参数 $\hat{\boldsymbol{\psi}}$，使得在该模型下观察到的数据的联合概率(joint probability)最大。这种方法称为最大似然估计（MLE）。

给定一组观测数据 ${\boldsymbol{x}i}{i=1}^n$，我们希望找到能最大化似然函数的参数：

$\hat{\boldsymbol{\psi}} = \arg\max{\boldsymbol{\psi}} \mathcal{L}(\boldsymbol{\psi}) = \arg\max{\boldsymbol{\psi}} \prod{i=1}^{n} p{\boldsymbol{\psi}}(x_i)$

由于直接对乘积求最大值在数值计算中容易出现问题，我们通常对似然函数取对数，转为对数似然函数:

$\hat{\boldsymbol{\psi}} = \arg\max{\boldsymbol{\psi}} , \ell(\boldsymbol{\psi}) = \arg\max{\boldsymbol{\psi}} \sum{i=1}^{n} \log p{\boldsymbol{\psi}}(x_i)$

对数不会改变最大值的位置，所以最终解是相同的。

蒙特卡洛采样（Monte Carlo Sampling）

在优化模型时，我们经常需要计算如下形式的目标函数的梯度：

$$ \nabla\phi , \mathbb{E}{\mathbf{x} \sim \mathcal{D}}[f_\phi(\mathbf{x})] $$

其中：

$\phi$ 是模型的可学习参数；
$\mathcal{D}$ 是某个固定的分布（通常是训练数据分布）；
$f_\phi(\mathbf{x})$ 是与输入 $\mathbf{x}$ 和参数 $\phi$ 有关的函数，例如损失函数。

上述期望一般是一个积分（或和），它包含了所有可能样本 $\mathbf{x}$ 的信息。但是，在实际中：

数据集有限，我们无法枚举所有 $\mathbf{x} \sim \mathcal{D}$；
积分难以解析求解，尤其当 $\mathcal{D}$ 是复杂或隐式分布。

因此，通常我们只能对一个（或几个）样本 $\mathbf{x}$ 估计这个梯度：

$$ \nabla\phi f\phi(\mathbf{x}) $$

这种做法称为 蒙特卡洛估计（Monte Carlo estimate）。

它的核心思想是： > 用少量样本的梯度来近似整个分布下期望的梯度。

当我们采样 $n$ 个样本 ${\mathbf{x}i}{i=1}^n \sim \mathcal{D}$，对每个计算梯度并取平均，有：

$$ \frac{1}{n} \sum{i=1}^{n} \nabla\phi f\phi(\mathbf{x}_i) \to \nabla\phi , \mathbb{E}{\mathbf{x} \sim \mathcal{D}}[f\phi(\mathbf{x})] \quad \text{as } n \to \infty $$

这意味着，当样本数量足够大时，我们就可以逼近真实的期望梯度。这个结论依赖于 大数定律（Law of Large Numbers）。

使用蒙特卡洛采样得到的估计梯度可以简写为：

$$ \nabla\phi , \mathbb{E}{\mathbf{x} \sim \mathcal{D}}[f\phi(\mathbf{x})] \approx{mc} \nabla\phi f\phi(\mathbf{x}) \tag{21} $$

即，用单个样本的梯度来近似整体期望的梯度。

2. VAE的机制

自编码器（Autoencoder）:

结构：编码器 $f{\theta}$ → 潜在空间特征 $\mathbf{z}$ → 解码器 $g{\phi}$。
目标：将 $\mathbf{x}\in\mathbb{R}^p$ 压缩到低维 $\mathbf{z}$，再重建 $\hat{\mathbf{x}}\approx\mathbf{x}$。

变分自编码器（Variational Autoencoder）的机制
- 将确定性潜在空间特征替换为 指定的潜在空间中的分布（通常采用独立高斯）：
$$ (\boldsymbol{\mu},\boldsymbol{\sigma}) = f_{\theta}(\mathbf{x}),\quad z_i \sim \mathcal{N}\bigl(\mu_i,\sigma_i^2\bigr) $$
- 解码器对样本进行重建：
$$ \hat{\mathbf{x}} = g_{\phi}(\mathbf{z}) $$
- 训练损失（ELBO 形式）：
$$ \mathcal{L} = \mathbb{E}{q{\phi}(\mathbf{z}\mid\mathbf{x})} \bigl[-\log p_{\theta}(\mathbf{x}\mid\mathbf{z})\bigr]
- D{\mathrm{KL}}!\bigl(q{\phi}(\mathbf{z}\mid\mathbf{x});|;p(\mathbf{z})\bigr) $$
– 第一项 = 重建误差；第二项 = 先验正则化。

3. 目标函数推导

1. 无法直接优化的目标

最终目标： 和许多机器学习模型一样，我们的根本目标是最大化数据的对数似然 (log-likelihood)，即 $log p_{\theta}(x)$。这个值代表了在给定模型参数 $θ$ 的情况下，观测到真实数据 $x$ 的概率。概率越大，说明我们的模型越能“解释”或“生成”真实数据。
VAE 的设定： VAE 是一个隐变量模型 (Latent Variable Model)。它假设我们观测到的数据 $x$ 是由一些我们无法直接观测到的隐变量 z (latent variables) 所生成的。
数学表达： 我们可以通过对所有可能的隐变量 z 进行积分（边缘化），来表达 $p_{\theta}(x)$：
$$ p{\theta}(x) = \int p{\theta}(x, z) dz $$

使用条件概率法则，上式可以写为：

$$ p{\theta}(x) = \int p{\theta}(x|z)p_{\theta}(z)dz $$

这里：
- $p(z)$ 是隐变量的先验分布（我们对 z 的预先假设，通常设为简单的标准正态分布）。
- $p(x|z)$ 是给定一个隐变量 z，生成数据 x 的概率。
棘手之处： 公式中的积分是难以计算的 (intractable)。因为隐空间 z 通常是高维连续的，我们无法穷举所有 z 来计算这个积分。即使 z是离散的，其计算量也会随着维度呈指数级增长 (m^n)。

2. 推导可行的目标函数

既然直接优化行不通，我们需要找到一个替代方案。

第一步：引入后验概率 我们利用贝叶斯公式 $p(z|x) = p(x,z) / p(x)$，反向得到 $p(x) = p(x,z) / p(z|x)$。取对数后：
$$ \log p{\theta}(x) = \log\left[\frac{p{\theta}(x, z)}{p_{\theta}(z|x)}\right] $$
这个式子虽然没有了积分，但引入了一个新的难题：后验概率 $p{\theta}(z|x)$。计算它需要知道 $p{\theta}(x)$，这又回到了原点，陷入了循环论证。
第二步：关键技巧——引入一个辅助分布 q(z) 为了打破僵局，我们引入一个由参数 $ϕ$ 控制的、关于 $z$ 的任意概率分布 $q(z)$。然后我们进行一系列数学变换：

$$ \begin{align} \log p{\theta}(x) &= \mathbb{E}{q{\phi}(z)}[\log p{\theta}(x)] \ &= \mathbb{E}{q{\phi}(z)}\left[\log\frac{p{\theta}(x, z)}{p{\theta}(z|x)}\right] \ &= \mathbb{E}{q{\phi}(z)}\left[\log\left(\frac{p{\theta}(x, z)}{q{\phi}(z)} \cdot \frac{q{\phi}(z)}{p{\theta}(z|x)}\right)\right] \ &= \mathbb{E}{q{\phi}(z)}\left[\log\frac{p{\theta}(x, z)}{q{\phi}(z)}\right] + \mathbb{E}{q{\phi}(z)}\left[\log\frac{q{\phi}(z)}{p{\theta}(z|x)}\right] \end{align} $$
- 变换解释：
  - 行(1)：因为 $log p(x)$ 相对于 $z$ 是个常数，所以它在任何关于 $z$ 的分布 $q$下的期望都是它本身。
  - 行(3)：在对数内部，同时乘以和除以 $q(z)$，这是一个值为1的恒等变换。
  - 行(4)：利用对数的性质 $log(a*b) = log(a) + log(b)$，将式子拆成两项。

3. ELBO 的诞生及其重要意义

公式(4) 的拆分是整个推导的核心。让我们来分析这两项：

第二项：KL 散度
$$ \mathbb{E}{q{\phi}(z)}\left[\log\frac{q{\phi}(z)}{p{\theta}(z|x)}\right] = D{KL}(q{\phi}(z) \ || \ p_{\theta}(z|x)) $$
这一项正是 q 分布与真实后验分布 p(z|x) 之间的KL 散度 (KL-divergence)，永远非负。
第一项：证据下界 (ELBO) 既然 $log p(x)$ 等于第一项加上一个非负的 KL 散度，那么第一项必然是 $log p(x)$ 的一个下界。这个下界就是我们梦寐以求的可优化的目标，它被称为证据下界 (Evidence Lower Bound, ELBO)。如果我们能找到参数$\theta$和$\phi$使该下界最大化（同时不导致其他项发散），那么这将也是一个接近最大化对数似然本身的解。
让 q 更有效： q 分布之前是任意的，但我们的目标是让这个下界尽可能地接近真实的 log p(x)。什么时候下界最紧呢？当 KL 散度为 0 时。而 KL 散度为 0 的条件是当且仅当两个分布完全相同，即 q_{\phi}(z) = p_{\theta}(z|x)。这启发我们，q 分布应该用来近似那个我们算不出来的真实后验 p(z|x)。因此，我们让 q 也依赖于 x，将其写为 q(z|x)。
最终的目标函数： 将 q(z) 替换为 q(z|x) 后，我们得到了最终的 ELBO 表达式：
$$ \mathcal{L}{\text{ELBO}{\theta, \phi}}(x) \triangleq \mathbb{E}{q{\phi}(z|x)}\left[\log\frac{p{\theta}(x, z)}{q{\phi}(z|x)}\right] $$
我们的优化问题就从“最大化 log p(x)” 变成了 “最大化 ELBO”。

4. ELBO 的精妙之处：双重优化

为了更好地理解我们到底在优化什么，我们将公式重新整理一下：

$$ \mathcal{L}{\text{ELBO}{\theta, \phi}}(x) = \log p{\theta}(x) - D{KL}(q{\phi}(z|x) \ || \ p{\theta}(z|x)) $$

从这个式子可以看出，最大化 ELBO 同时在做两件事情：

最大化数据似然 log p(x)： ELBO 的提升会直接推高 $log p_{\theta}(x)$。这意味着我们的模型更好地捕捉到观测数据的分布。
最小化 KL 散度 DKL(...)：对于一个给定的数据 $x$，$log p(x)$ 是一个定值。此时，要让 ELBO 变大，就必须让 KL 散度变小。这使我们的近似后验分布 $q(z|x)$去不断逼近真实的后验分布 $p(z|x)$。

3. 离散 VAE (Discrete VAE)

离散 VAE 的隐空间 z 不再是连续的，而是由一系列离散的类别变量构成。

1. 隐空间 `z` 的设计

结构：隐空间由 D 个独立的隐变量组成，每个隐变量可以从 K 个离散的类别中取值。
表示方法：为了在实践中实现这一点，每个隐变量都用一个 one-hot 编码的向量来表示。向量的长度为 K，在被选中的类别索引处为“1”，其余位置为“0”。
数学表达：因此，一个完整的隐样本 z 是一个 D x K 的矩阵，即 z ∈ {0,1}^{D×K}，并且对于 D 个变量中的任何一个 d，其 K 个类别的总和都为1:
$$ \sum_{k=1}^{K} z^{(d)}_k = 1 \quad \forall \ d \in [1,D] $$

2. 模型组件的具体选择

先验分布 p(z)
- 对于离散的隐变量，一个自然的选择是均匀范畴分布 (Uniform Categorical Distribution)。
- 这意味着，在没有任何信息的情况下，我们假设每个隐变量 z^(d) 从 K 个类别中选择任何一个的概率都是相等的 (1/K)。
- 公式：
  $$ z^{(d)} \sim_{\text{iid}} \text{Cat}(\mathbf{K}^{-1}) $$
  这里的 K^-1 表示一个所有元素都为 1/K 的概率向量。
编码器 Encoder (近似后验 q(z|x))
- 编码器的作用是根据输入数据 x 来推断隐变量 z 的分布。
- 它是一个神经网络 f_ϕ(x)，其输出是 D x K 个概率值。这些概率值定义了 D 个范畴分布的参数。
- 前向传播过程：我们将输入 x 送入编码器网络 f_ϕ(x)，得到每个隐变量 z^(d) 的类别概率，然后从这个分布中进行采样，得到 one-hot 形式的 z。
- 公式：
  $$ z^{(d)} \sim{\text{iid}} \text{Cat}(f{\phi}(x)^{(d)}) $$
解码器 Decoder (数据似然 p(x|z))
- 解码器的作用是根据隐变量 z 重建原始输入概率 x。
- 以二值化 MNIST 数据为例：
  - 数据预处理：我们将 MNIST 的每个像素二值化为黑色（0）或白色（1），即 x ∈ {0,1}^P，其中 P 是像素总数。
  - 分布选择：对于这种二元数据，最自然的概率分布是伯努利分布 (Bernoulli Distribution)。
  - 解码器网络 g_θ(z)：它是一个神经网络，接收隐变量 z 作为输入，输出 P 个概率值。每个概率值对应一个像素，代表该像素为白色（1）的概率。
  - 公式：
    $$ x^{(p)} \sim{\text{iid}} \text{Bernoulli}(g{\theta}(z)^{(p)}) $$

4. 离散 VAE 的梯度推导与损失函数

理解了离散 VAE 的模型结构后，最后一步是推导其梯度，以便我们进行优化，并明确最终的损失函数。ELBO 的目标是最大化，但在实践中我们通常最小化其负数，即-ELBO。

1. 解码器（Decoder）参数 `θ` 的梯度 `∇θ`

这部分的推导相对直接。

核心思路：ELBO 中关于 θ 的项是 E[log pθ(x|z)]。因为期望是针对 qϕ(z|x) 计算的，它不依赖于 θ，所以我们可以将梯度算子 ∇θ 直接移入期望内部。
$$ \nabla{\theta} \mathbb{E}{q{\phi}(z|x)} [\log p{\theta} (x|z)] = \mathbb{E}{q{\phi}(z|x)} [\nabla{\theta} \log p{\theta} (x|z)] $$
推导过程：
1. 这一交换使得我们可以使用蒙特卡洛采样来近似梯度：从 qϕ(z|x) 中采样一个 z，然后计算 ∇θ log pθ(x|z)。
2. 对于二值化的独立像素，log pθ(x|z) 可以分解为每个像素的对数概率之和。
3. 每个像素服从伯努利分布 p(x) = p^x(1-p)^(1-x)。其对数概率为 x log(p) + (1-x) log(1-p)。
4. 将解码器网络 gθ(z) 的输出视为伯努利分布的参数 p，我们得到：
  $$ \begin{align} \hat{\nabla}{\theta} &\approx \nabla{\theta} \left( \sum{p=1}^{P} x^{(p)} \log g{\theta} (z)^{(p)} + (1 - x^{(p)}) \log(1 - g{\theta} (z)^{(p)}) \right) \ &= -\nabla{\theta} \sum{p=1}^{P} \text{BCE}(g{\theta} (z)^{(p)}, x^{(p)}) \ &= -\nabla{\theta} \text{BCE}(g{\theta} (z), x) \end{align} $$
  这里的梯度就是我们非常熟悉的二元交叉熵 (Binary Cross-Entropy, BCE) 损失的负梯度。在任何自动微分库（如 PyTorch）中，这都可以被直接计算。

2. 编码器（Encoder）参数 `ϕ` 的梯度 `∇ϕ`

这部分要复杂得多，因为它包含两项，且其中一项的梯度不能直接计算。

A. KL 散度项 `-DKL(qϕ(z|x)||p(z))` 的梯度

核心思路：总的 KL 散度是 D 个独立隐变量的 KL 散度之和。我们可以先分析单个变量，再求和。
推导过程：对于单个隐变量 z^(d)，其 KL 散度展开后经过化简可以得到：
$$ \begin{align} -D{KL}(q{\phi}(z^{(d)}|x)||p(z)) &= \sum{k=1}^{K} q{\phi}(z^{(d)}k |x) \log \frac{p(z)_k}{q{\phi}(z^{(d)}k |x)} \ &= \sum q \log p - \sum q \log q \ &= \log \frac{1}{K} \sum q - \sum q \log q \quad &(\text{因为 } p(z) \text{ 是均匀分布}) \ &= -\log K + \text{Entropy}(q{\phi}(z^{(d)}|x)) \end{align} $$
这里 Entropy 是编码器输出的范畴分布的熵。
对所有 D 个变量求和，这一项的梯度变为：
$$ \nabla{\phi} \sum{d=1}^{D} \text{Entropy}(f{\phi}(x)^{(d)}) = \nabla{\phi} \text{Entropy}(f_{\phi}(x)) $$
最大化 ELBO 意味着要最大化编码器输出分布的熵，这鼓励编码器不要对某个类别过于自信，起到正则化的作用。

B. 重建项 `E[log pθ(x|z)]` 的梯度

挑战：在这里，我们不能将梯度 ∇ϕ 直接移入期望，因为期望本身就是对 qϕ 计算的，qϕ 依赖于 ϕ。
解决方案：Log-Derivative Trick 我们使用一种名为“对数-导数技巧”（在强化学习中也叫 REINFORCE）的方法来重写梯度：
$$ \nabla{\phi}\mathbb{E}{q{\phi}(z|x)} [f(z)] = \mathbb{E}{q{\phi}(z|x)} [f(z) \nabla{\phi} \log q_{\phi}(z|x)] $$
这个技巧再次将梯度移入了期望内部，使我们又能使用蒙特卡洛采样。
推导过程：
1. 应用该技巧后，我们的梯度近似为：
  $$ \hat{\nabla}{\phi} \approx \log p{\theta} (x|z) \cdot \nabla{\phi} \log q{\phi}(z|x) \quad (73) $$
2. 第一部分 log pθ(x|z) 我们已经知道，它就是负的 BCE 损失：-BCE(gθ(z), x)。
3. 第二部分 log qϕ(z|x) 是 D 个隐变量的对数概率之和。对于一个被采样出的 one-hot 样本 z，log qϕ(z|x) 等于所有被选中类别的对数概率之和。记 k(d) 为第 d 个变量被采样的类别索引，则：
  $$ \log q{\phi}(z|x) = \sum{d=1}^{D} \log f{\phi}(x)^{(d)}{k^{(d)}} \quad (80) $$
4. 组合起来，得到最终的梯度表达式：
  $$ \hat{\nabla}{\phi} \approx - \text{BCE}(g{\theta} (z), x) \nabla{\phi} \left( \sum{d=1}^{D} \log f{\phi}(x)^{(d)}{k^{(d)}} \right) \quad (81) $$
直观理解：这个梯度形式可以看作是：[奖励] 乘以 [采取动作的对数概率的梯度]。这里的“奖励”是重建效果的好坏（-BCE，重建越好，此项值越大），“动作”是编码器选择了某个类别。

3. 最终的 ELBO 损失函数

综合以上推导，我们可以写出在训练中实际计算和追踪的 ELBO 的蒙特卡洛近似形式。记住，我们的目标是最大化 ELBO。

$$ \mathcal{L}{\text{ELBO}} \approx \underbrace{\text{Entropy}(f{\phi} (x)) - D \log K}{\text{来自 } -D{KL}(q||p) \text{ 项}} \underbrace{- \text{BCE}(g{\theta} (z), x)}{\text{来自 } E_q[\log p(x|z)] \text{ 项}} \quad (83) $$

在实践中，我们通常会最小化 -ELBO，其损失函数为：

$Loss = BCE 重建损失 - 熵正则化项 + 常数$

这个公式清晰地告诉了我们训练离散 VAE 的全部内容：

最小化重建损失 (BCE)：让解码器学会如何从隐变量中恢复原始图像。
最大化熵 (Entropy)：鼓励编码器的输出分布更多样化，防止模式坍塌。
D log K 是一个常数，在比较不同隐空间大小的模型时很重要，但在单次训练的梯度计算中可以忽略。

Finished reading? Say something

Mac连接Dell外接显示器时RGB模式颜色反转的解决方法

yang — Sat, 14 Jun 2025 08:02:34 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/tech/mac-rgb-20250614

问题说明

Mac外接Dell显示器时，会被识别为YPbPr模式，导致画面偏色。当在显示器设置中切换为RGB模式时，又可能出现颜色反转的问题。

1. 解决方案：修改系统配置文件

英文原文：https://forums.macrumors.com/threads/mbp-m1-and-lg-27uk850-w-washed-out-colors.2270452/page-5?post=30262233#post-30262233

步骤

由于plist文件是二进制文件，所以无法用vscode直接打开，这里我使用的是vscode中的Binary Plist插件。
使用命令行前往文件夹: cd ~/Library/Preferences/ByHost
查找名为 com.apple.windowserver.displays.[UUID].plist 的文件备份并删除。（怕有问题的话建议备份！）
使用命令行前往文件夹： cd /Library/Preferences
找到并打开 com.apple.windowserver.displays.plist 文件
注意：由于是系统文件，可能需要chmod来更改权限为777，修改完成后记得改回之前权限644。
注意：此文件为plist，必须用支持的编辑器（如vscode中的Binary Plist插件）打开。
查找所有 CurrentInfo，找到对应的 ... 段落
注意：应该可以仅调整需要的分辨率，我并没有尝试。
在对应段落的后插入如下内容：
不必修改 UnmirrorInfo 开头的段落。

LinkDescription

    BitDepth
    8
    EOTF
    0
    PixelEncoding
    0
    Range
    1

例（这里仅展示了第一个修改处）：





    DisplayAnyUserSets
    
        Configs
        
            
                ConfigVersion
                1
                DisplayConfig
                
                    
                        CurrentInfo
                        
                            Depth
                            8
                            High
                            900
                            Hz
                            60
                            IsLink
                            
                            IsVRR
                            
                            OriginX
                            0.0
                            OriginY
                            0.0
                            Scale
                            2
                            Wide
                            1440
                        
                        LinkDescription
                        
                            BitDepth
                            8
                            EOTF
                            0
                            PixelEncoding
                            0
                            Range
                            1
                        
                        Rotation
                        0.0
                        UUID
                        [UUID xxx]
                        UnmirrorInfo
                        
                            Depth
                            8
                            High
                            900
                            Hz
                            60
                            IsLink
                            
                            IsVRR
                            
                            OriginX
                            0.0
                            OriginY
                            0.0
                            Scale
                            2
                            Wide
                            1440

2. 考察：RGB 与 YPbPr 显示模式的区别（ChatGPT）

比较项目	RGB 模式	YPbPr 模式
信号类型	数字信号，每个颜色分量单独传递	模拟色差信号，分量混合传递
色彩表现	色彩鲜艳自然，准确还原	色彩偏淡，容易发灰或失真
推荐场景	电脑、Mac、显示器等数码设备	电视、DVD、老视频设备等

Finished reading? Say something

多臂赌博机（Multi-Armed Bandit）问题与UCB

yang — Mon, 14 Apr 2025 15:01:18 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/ai/bandit

强化学习入门：理解 UCB 动作选择策略

前言：

最近在阅读强化学习导论，由于内容过于理论，看的有些迷茫。为了更好地理解相关知识，计划开始结合AI的回答来做一些笔记。

多臂赌博机（Multi-Armed Bandit）

想象一下你面前有很多台老虎机，每台老虎机吐钱的概率是不同的，但是你事先不知道哪台概率高，哪台概率低。你的目标是在有限的次数内，尽可能多地从这些老虎机里赢钱。这就是一个经典的多臂赌博机（Multi-Armed Bandit）问题。

在强化学习中，智能体（Agent）就像是玩老虎机的你，它需要在一个环境中做出动作（Action），比如选择拉哪一台老虎机的摇臂。每次动作之后，环境会给出一个奖励（Reward），比如老虎机吐出的钱。智能体的目标是学习一个策略（Policy），也就是选择动作的方法，来最大化它能获得的总奖励。

核心挑战：探索 (Exploration) vs. 利用 (Exploitation)

智能体如何选择动作呢？

利用 (Exploitation):
- 做法： 坚持选择当前已知能够带来最高平均奖励的动作（比如一直玩那台看起来最容易出钱的老虎机）。
- 好处： 能稳定地获得当前已知的最好奖励。
- 坏处： 可能错过一个实际上更好但尚未充分尝试的动作。
探索 (Exploration):
- 做法： 尝试不同的动作，包括那些目前看起来不是最优的，目的是收集更多关于它们的信息（比如试试别的老虎机）。
- 好处： 有机会发现比当前已知最优动作更好的新动作。
- 坏处： 可能会在效果不佳的动作上浪费尝试次数和时间。

简单的动作选择策略及其局限

贪心策略 (Greedy): 永远选择当前看起来最好的动作。
- 问题： 纯粹利用，容易陷入局部最优解，无法发现潜在的更好选择。
ε-贪心策略 (Epsilon-Greedy): 大部分时间选择当前最好的动作（利用），但有 ε 的小概率随机选择一个动作（探索）。
- 问题： 探索是完全随机的，不够智能，没有优先考虑那些“更有探索价值”的动作。

UCB (Upper Confidence Bound - 置信区间上界) 策略

优先考虑：
- 已被证明效果好的动作（高平均奖励）。
- 尚未充分尝试，但潜力巨大（因为不确定性高而被赋予乐观估计）的动作。
效果： 它鼓励智能体尝试那些“知之甚少”的选项，有效避免过早收敛到次优动作，并随着信息积累逐渐聚焦于最优动作。

它不仅考虑一个动作过去的平均奖励，还考虑我们对这个动作估计的 不确定性。对于每个可能的动作，UCB 会计算一个分数：

UCB 分数 = (当前估计的平均奖励) + (不确定性奖励加成)

当前估计的平均奖励 (Exploitation Part):
- 计算方式：该动作迄今为止获得的总奖励 / 该动作被选择的总次数。
- 作用：反映了该动作基于历史数据的表现，数值越高，越倾向于被“利用”。
不确定性奖励加成 (Exploration Part):
- 这是 UCB 的关键，用于量化对动作估计的不确定程度。
- 特点：
  - 一个动作被选择的 次数越少，我们对它的了解就越少，不确定性越高，这个 加成项就越大，鼓励探索。
  - 一个动作被选择的 次数越多，我们对它的估计就越自信，不确定性越低，这个 加成项就越小。
- 依赖关系： 这个加成项通常与总尝试次数（$N$）和该特定动作被尝试的次数（$n_a$）有关。一个常见的形式是 $c * sqrt(ln(N) / n_a)$，其中 $c$ 是一个超参数，用于控制探索的程度。

UCB 的工作流程

在每个决策点：

智能体为每一个可以采取的动作计算其当前的 UCB 分数。
智能体选择那个 UCB 分数最高的动作 执行。

这样如何平衡探索与利用？

高平均奖励 + 低不确定性： 如果一个动作历史表现很好且被尝试多次（不确定性加成小），它的高平均奖励仍然可能使其 UCB 分数最高（倾向于利用）。
中等/低平均奖励 + 高不确定性： 如果一个动作历史表现一般或较差，但被尝试的次数很少（不确定性加成很大），这个巨大的加成可能使其总 UCB 分数超过其他动作，从而被选中（倾向于探索）。
动态调整： 随着某个动作被选择次数增多，其不确定性加成会下降。
- 如果它确实是好动作，其平均奖励会保持或升高，继续被选中。
- 如果它是不好的动作，其平均奖励会下降，即使不确定性降低，总分也会下降，被选中的概率降低。

Finished reading? Say something

Deep Research: Deep Unrolling with ADMM and LASSO in NLOS, CT, and MRI Inverse Problems

yang — Wed, 26 Feb 2025 04:29:16 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/ai/deep_search_001

I will conduct a deep investigation into how deep unrolling (deep unfolding) methods integrate with ADMM and LASSO for solving inverse problems in NLOS, CT, and MRI reconstruction. The research will focus on:

Mathematical Formulation: High-level explanations of inverse problem models and how deep unrolling approximates iterative solvers.
Algorithmic Implementation: How neural networks integrate with ADMM and LASSO-based optimization.
Code-Level Analysis: PyTorch implementations from open-source repositories.
Performance Evaluation: Comparison between deep unrolling methods and traditional solvers in terms of accuracy, convergence speed, and robustness.
Datasets: Identification of benchmark datasets for NLOS, CT, and MRI testing.
Sources: Top conferences and major journals, prioritizing those with GitHub links.

Deep Unrolling with ADMM and LASSO in NLOS, CT, and MRI Inverse Problems

Mathematical Formulation of Inverse Problems

Inverse Problems in NLOS, CT, and MRI: Non-line-of-sight (NLOS) imaging, X-ray computed tomography (CT), and magnetic resonance imaging (MRI) can all be framed as inverse problems. Each has a forward model linking hidden object properties to measured data, and the goal is to invert this model to recover an image. NLOS imaging measures time-resolved light after multiple bounces to reconstruct a hidden scene – a highly ill-posed problem due to severe loss of information and noise (different hidden scenes can produce the same measurement)DIGLIB.EG.ORG . CT reconstruction seeks to recover an interior attenuation map $x$ from projection data $y$ (sinograms) via the Radon transform $A$ (i.e. $y = A x + \epsilon$)UNI-BREMEN.DE . Because the Radon transform is a compact operator and only finitely many angles are measured, the inverse problem is ill-posedUNI-BREMEN.DE , especially in limited-view or low-dose CT scenarios. MRI reconstruction involves recovering an image from undersampled $k$-space (Fourier) measurements. Accelerating MRI by undersampling makes the inverse problem ill-posed, as too few measurements lead to infinitely many solutions. Compressed sensing (CS) theory addresses this by imposing priors like sparsity (e.g. in wavelet domain) to constrain the solutionARXIV.ORG .

Role of ADMM and LASSO in Sparse, Ill-posed Problems: Ill-posed inverse problems are often tackled by formulating a regularized optimization. A common example is the LASSO (Least Absolute Shrinkage and Selection Operator), which uses an $\ell_1$-norm penalty to promote sparsity in the solution. For instance, one may solve $\min_x |A x - y|^2 + \lambda |x|_1$ for CT/MRI (sparsity in the image or transform domain) or use an $\ell_1$ prior on the hidden scene in NLOS to exploit sparsity of reflectance. The Alternating Direction Method of Multipliers (ADMM) is a popular algorithm to solve such problems with composite objectives. ADMM splits the problem into subproblems that are easier to handle – e.g. a least-squares data fidelity update and a separate proximal update for the $\ell_1$ regularization (soft-thresholding)WEB.STANFORD.EDU . In effect, ADMM transforms a difficult global problem into alternating updates that converge to the solution under broad conditions. For LASSO and related sparse recovery tasks, ADMM is advantageous because each iteration involves a simple shrinkage (thresholding) step for the sparsity priorMATH.CUHK.EDU.HK . This makes it well-suited for large-scale imaging problems. Overall, ADMM (and similar proximal algorithms) provide a physics-guided iterative framework to enforce data fidelity and sparsity constraints, helping recover images from limited or noisy measurements.

Deep Unrolling as Learned Iterative Optimization: Deep unrolling (or unfolding) is a technique that bridges model-based iterations and data-driven learning. In deep unrolling, one starts from a traditional iterative algorithm (like ADMM or ISTA used for solving the above inverse problems) and unrolls its update steps into the layers of a deep neural network. The fixed iterations of the original algorithm become a sequential network of a corresponding number of layers. Importantly, certain parameters of the algorithm (e.g. step sizes, regularization weights, thresholds) are treated as learnable parameters to be optimized during training. This way, the network mimics the physics-based solver but can learn an optimal strategy for convergence from data. Deep unrolled networks thereby approximate iterative optimization while leveraging training data to achieve faster or better reconstruction. They maintain interpretability (each layer has a known meaning) and often require far fewer iterations to reach high-quality solutions . In summary, the mathematical models (forward operators for NLOS, CT, MRI and sparsity priors) remain at the core, but deep unrolling introduces learnable components within the iterative solution process to handle ill-posedness more effectively.

Algorithmic Implementation of Deep Unrolling with ADMM

Unrolling ADMM-Based Solvers: Deep unrolling techniques often take inspiration from ADMM when dealing with LASSO-like formulations. For example, the ADMM algorithm for a sparse inverse problem might involve: (1) a data update solving a least-squares problem (e.g. using the measured data to update the image), (2) a sparsity proximal update applying soft-thresholding (shrinkage) to enforce sparsity, and (3) a dual variable update. In a deep unrolled network, this sequence is mapped to network layers. Each unrolling stage (layer or group of layers) performs operations analogous to one ADMM iteration . Trainable parameters can be introduced, such as learnable thresholds or relaxation parameters at each stage. Yan et al.’s ADMM-CSNet is a concrete example: it unrolls the ADMM solver for compressive sensing, with each network layer implementing the analytical steps of ADMM while allowing certain parameters (like the penalty parameter $\rho$ or threshold values) to be tuned by learning . This combination retains the structured updates of ADMM but adapts them for better performance on natural images. In practice, unrolled ADMM networks often limit the number of stages (e.g. 5–15 iterations unrolled) to balance complexity and performance, since a fully convergent ADMM (tens of iterations) may not be needed once learning finds a good set of parameters.

Neural Networks Approximating Iterative Solvers: In deep unrolling, neural network components are sometimes inserted to replace or augment specific steps of the algorithm. A common approach is to replace the “proximal operator” (which enforces the prior) with a small learned network. For instance, instead of a simple soft-threshold, one could use a convolutional neural network as a learned denoiser or projector in each iteration. Many architectures have been proposed along these lines. ISTA-Net (Zhang et al. 2018) unrolls the Iterative Shrinkage-Thresholding Algorithm (ISTA) for sparse image reconstruction, with learnable transforms and thresholds at each layerMDPI.COM . ADMM-Net and ADMM-CSNet (Yang et al. 2016, 2018) unroll ADMM for MRI and general compressive imaging, learning the regularization parameters and achieving significant speed-upsGITHUB.COM . Learned Primal-Dual (Adler and Öktem 2018) unrolls a primal-dual hybrid gradient method for CT, with learnable operator blocks that handle forward and backward projection updatesMDPI.COM . In all these cases, the iterative solver’s mathematical operations (e.g. Fourier transforms in MRI, Radon projections in CT, light transport in NLOS) are embedded in the network’s layers, ensuring that physics constraints are hard-wired. The neural network introduces flexibility through trainable weights that can, for example, adapt the strength of regularization per iteration or learn an enhanced prior beyond simple sparsity. This yields an interpretable yet trainable system: as one paper puts it, unrolled networks “incorporate the forward model of the imaging system…and [replace] one or more of its steps with a neural network,” then train by unrolling for a fixed number of stepsARXIV.ORG . Essentially, the neural network learns to optimize – it mimics the trajectory of an iterative algorithm but in a data-driven manner.

Deep Learning vs. Analytical Approaches: Compared to purely analytical solvers, deep unrolled methods offer a trade-off between domain knowledge and data adaptation. Traditional algorithms (like ADMM, FISTA, or iterative ART for CT) rely on manually chosen parameters and often need many iterations for high accuracy. They are generic and can be applied to any new data if one can tune the hyperparameters, but they don’t learn from examples. Deep unrolling infuses learning into these iterations: the algorithm’s structure prevents the network from straying far from feasible solutions, while learning allows cutting down iteration count and adjusting to the characteristics of training data. This often yields faster convergence and higher accuracy . For example, ADMM-CSNet was shown to reconstruct images with the same accuracy using 10% fewer measurements than standard iterative methods and to run ~40× faster than a traditional algorithm on compressive sensing tasks . In CT reconstruction, unrolled model-based networks have outperformed conventional iterative reconstruction in low-dose or few-view settings in terms of PSNR/SSIM, since the network can suppress noise and artifacts more effectively by learning from training images . Moreover, unrolled models tend to be more interpretable than generic deep networks (like pure U-Nets) because each layer corresponds to a known operation, making it easier to trust and analyze the reconstruction process . However, there are also trade-offs: deep models require a representative training dataset and may generalize poorly outside the training distribution. If the noise level, sampling pattern, or object characteristics change, a network might not perform optimally unless retrained or designed for robustness. In a reported case, a top-performing MRI reconstruction network failed to reconstruct critical details when the test data had a slightly different noise distribution than what it saw in trainingARXIV.ORG . Traditional solvers, in contrast, can handle such changes by re-tuning parameters (since they don’t learn specific features, they inherently apply the same physics prior to any data). Researchers are actively addressing these issues, for example by incorporating uncertainty modeling or robust training into unrolled networks. In summary, deep unrolling marries the reliability of physics-based algorithms with the adaptivity of deep learning – often yielding superior speed and accuracy on benchmark tasks – but care must be taken to ensure they remain robust and generalizable beyond the training set.

Code-Level Analysis and Implementation Details

Open-Source Implementations: The growing interest in deep unfolding has led to numerous open-source projects and research code releases. For instance, DeepInverse (deepinv) is a PyTorch-based library that provides a unified framework for inverse problems, including easy-to-build unfolded architectures for algorithms like ADMM and forward-backward splittingGITHUB.COM . It offers predefined physics operators (for MRI, CT, deblurring, etc.) and allows researchers to quickly prototype unrolled networks. Many research papers also release code on GitHub: the authors of Deep ADMM-Net for compressive MRI (NIPS 2016) released MATLAB code for their trained model and layers (with the ADMM updates implemented as network layers)GITHUB.COM . More recent works typically use Python frameworks – e.g., CTprintNet (an unrolled few-view CT reconstruction method) was implemented in Python using PyTorchMDPI.COM . Similarly, a primal-dual unrolling approach for CT with a total variation prior (sometimes called PD-Net) was implemented in PyTorch in a 2021 studyAAPM.ONLINELIBRARY.WILEY.COM . These repositories often include training scripts and pretrained models, making it easier to reproduce results on common datasets. Beyond individual papers’ code, community toolboxes like MIRTorch (Michigan Image Reconstruction Toolbox) and ODL (Operator Discretization Library) have started integrating deep learning modules with classic reconstruction, enabling custom unrolled networks using their operators.

ADMM-Based Unfolding in PyTorch: Implementing an ADMM-unrolled network in PyTorch typically involves writing a custom nn.Module for each iteration (or one module that iterates internally). Each iteration might consist of operations like linear forward projection (or Fourier transform), an inverse or pseudo-inverse step (often a convolution or FFT-based solve), and a shrinkage/threshold step. These are differentiable operations (soft-threshold is differentiable almost everywhere), so the whole unrolled loop can be backpropagated through. One straightforward approach is to explicitly unroll a fixed number of iterations and treat them as sequential layers. PyTorch’s autograd will handle the gradients through all iterations. However, unrolling many iterations can lead to high memory usage because intermediate states from every layer must be stored for backpropagation. Researchers address this in code by using techniques like gradient checkpointing, which trades extra computation for lower memory by recomputing some intermediate results on the fly instead of storing them. Another approach is sequential or stage-wise training : instead of backpropagating through all unrolled iterations at once, one can train the network one iteration at a time or in small blocks, gradually building up to the full unrolled depth. This was demonstrated to drastically reduce memory requirements (by up to 98%) and enable training very deep unrolled models that were previously infeasiblePMC.NCBI.NLM.NIH.GOV . For example, in 3D imaging (where each “image” is a volume, greatly increasing memory use), a sequential training strategy allowed a fully unrolled 3D reconstruction network to be trained when naive end-to-end training would exhaust GPU memoryPMC.NCBI.NLM.NIH.GOV .

Efficient GPU Computation: Deep unfolding for imaging typically leverages heavy linear algebra (Fourier transforms in MRI, large matrix multiplications for CT projections, etc.), so efficient GPU utilization is critical. Implementations use batched operations and built-in primitives (like FFT routines or sparse matrix multiply) to speed up these physics-based layers. In MRI, for instance, one can implement the forward model as a fast FFT with masking of k-space samples, which PyTorch (or underlying libraries like FFTW/CuFFT) can do quickly. In CT, projection and backprojection can be the most time-consuming steps; libraries may use GPU-accelerated projectors (some researchers use custom CUDA kernels or the ASTRA toolbox via PyTorch interfaces). By integrating these with learnable CNN modules for the regularization part, the entire unrolled network runs on GPU end-to-end, avoiding slow data transfers. Memory optimization also involves using lower precision (mixed-precision training with float16) which many PyTorch frameworks support to reduce memory and increase throughput, without significant loss in reconstruction quality. Moreover, modern PyTorch allows dynamic computation graphs, so some implementations use a for-loop in the forward method to iterate through layers – this makes it easy to adjust the number of unrolled iterations as a parameter. Care is taken in such cases to ensure the loop is unrolled in the graph for backprop (PyTorch will unroll it since the loop length is fixed, resulting in the same effect as manually writing layers).

In summary, code-level best practices for deep unrolling include: using existing high-performance ops for the physics model, structuring the network in iterative blocks, managing memory via checkpointing or staged training, and leveraging GPU parallelism for heavy computations. Many open-source codes and libraries are available, lowering the barrier to implement ADMM-based unfolding for NLOS, CT, or MRI problems.

Performance Evaluation of Deep Unrolling vs Traditional Solvers

Accuracy and Reconstruction Quality: Deep unrolled methods have demonstrated excellent accuracy in reconstructing images compared to traditional solvers. By learning an optimized prior or tuning algorithm parameters, they often achieve lower error (higher PSNR, SSIM) than purely physics-based methods under the same conditions. For example, ADMM-CSNet (for compressive sensing) improved reconstruction PSNR by about 3 dB over previous state-of-the-art methods at 20% sampling . In MRI, unrolled networks like MoDL and VarNet (a variational network used in the fastMRI challenge) have been shown to recover finer details and yield more natural-looking images than conventional compressed sensing (which might leave residual artifacts or noise) – provided the test data is similar to the training distribution. In CT, unrolled approaches (e.g. learned primal-dual, ISTA-Net) have produced higher quality images from sparse-view or low-dose data than analytic algorithms (like FBP with filtering) or even hand-crafted iterative regularization (like TV minimization). They excel at suppressing noise and streak artifacts by learning from example images. That said, the ultimate accuracy can depend on training: if the network is well-trained on diverse data, it can outperform traditional methods across a range of cases; if not, there may be situations where a carefully tuned iterative method could still rival or beat the learned method (especially if the latter encounters an out-of-distribution case).

Convergence Speed and Efficiency: One of the key advantages observed in deep unrolling is faster inference – i.e., obtaining a solution in a fixed small number of network layers, rather than iterating until convergence. Traditional solvers like ADMM or conjugate gradient might need tens or hundreds of iterations for high-quality results, which can be slow (seconds or minutes for large 3D data). Unrolled networks typically cap the iterations (e.g. 10 iterations unrolled) and learn to make each iteration as effective as possible . The result is a huge speed-up at runtime. Empirical studies show large gains: for instance, a learned ADMM network was ~40 times faster than solving TV minimization via iterative methods for image deblurring . Even compared to other deep methods not using unrolling, the model-based unrolled networks can be more efficient – ADMM-CSNet was about twice as fast as a conventional CNN reconstruction called ReconNet . These speed-ups are crucial for applications like real-time imaging (e.g., reconstructing MRI on-the-fly during a scan, or CT during an interventional procedure). The training phase of deep networks is of course computationally intensive (may take hours or days on GPUs), but that is a one-time cost; traditional solvers don’t require training but “pay” the cost every time in terms of slow runtime. Thus, once deployed, deep unrolling can give fast and consistent performance case after case. An important point in convergence: unrolled networks do not iterate until a mathematical convergence criterion – they stop at the fixed depth. If more accuracy is needed, one must train a deeper network or use a different approach. However, in practice, well-designed unrolled networks reach a good reconstruction within their fixed iterations and further iterations yield diminishing returns.

Generalization and Robustness: Generalization refers to how well a trained deep unrolling method works on data that differ from its training data. This is a critical evaluation aspect. Robustness to noise: Traditional regularized solvers (ADMM, etc.) can be re-run with adjusted parameters for different noise levels, and methods like LASSO or total variation are known to handle noise by appropriate regularization. A deep unrolled network can be trained with simulated noise in the training data to encourage robustness – and many studies do add noise augmentation so the network learns to handle it. Within the range of noise it has seen, an unrolled model often maintains good performance (and can even outperform human-tuned algorithms, since it may learn an optimal denoising strategy). But if the noise statistics change significantly, the network might not optimally adapt. For example, an MRI unrolled network trained on one noise level saw a drop in performance when evaluated on data with a different noise varianceARXIV.ORG . Recent research (like the PUN method mentioned earlier) is looking at ways to improve robustness, e.g., by pruning networks or using techniques from domain generalization. Missing data or different sampling patterns: Similar concerns arise if, say, a CT network trained on one set of angles is applied to a scenario with a totally different set of projection angles, or an MRI network trained on random undersampling is tested on radial undersampling. If the forward model is integrated (which it is, in unrolled networks), the network knows the physics for any sampling pattern (since the forward operator $A$ can be changed or is a parameter to the network). However, the learned prior might be overfit to certain artifact patterns. A study of the 2019 fastMRI challenge noted that even minor shifts in the sampling pattern caused some deep learning models to miss subtle featuresARXIV.ORG . Unrolled models, by virtue of being closer to model-based, generally handle moderate shifts better than pure black-box models, but they are not immune to generalization issues.

In terms of hardware and deployment constraints: Traditional iterative solvers can be run on CPU (slowly) or GPU, and one can trade off speed for memory (e.g., using more iterations uses more time but not necessarily more memory at once). Deep networks typically require a capable GPU (or specialized accelerator) at inference to achieve their speed advantage, which might be a constraint in some low-resource settings. They also consume GPU memory proportional to model size. However, once properly engineered, even large unrolled models (with e.g. 10 convolutional layers per iteration and 10 iterations) can run on a modern GPU within tens of milliseconds. There have been demonstrations of deep reconstructions running on scanner hardware or within streaming recon systems in hospitals, indicating it’s feasible with current tech. Another consideration is stability and reliability : iterative methods can be stopped early or adjusted if something looks off in the reconstruction; a deep network gives one shot output. Thus, extensive validation is needed to ensure the network is reliable across cases. Many evaluations in literature report not just average error metrics but also worst-case errors and visual inspection for artifacts, to ensure the learned method doesn’t introduce unpredictable errors (an essential aspect if these are to be used in safety-critical applications like medical diagnostics). So far, results are encouraging – deep unrolled methods often produce fewer noticeable artifacts than traditional methods, rather than more, when applied within their tested regimeDIGLIB.EG.ORG DIGLIB.EG.ORG .

In summary, performance evaluations generally find that deep unrolling offers major gains in speed and often in reconstruction fidelity for NLOS, CT, and MRI inverse problems. The trade-offs lie in the need for training and the careful assessment of robustness. Ongoing research is closing the gap in generalization, with techniques to make these learned solvers more adaptable to new conditions without retraining.

Datasets for NLOS, CT, and MRI

NLOS Imaging Datasets: Non-line-of-sight imaging is still a developing field, and as such, there are not as many standardized large-scale datasets as in CT or MRI. Nonetheless, there are a few benchmarks emerging:

Synthetic Data: One example is the Zaragoza NLOS dataset , which provides synthetic scenes and their corresponding time-resolved measurements rendered with physics-based simulationGRAPHICS.UNIZAR.ES . This allows researchers to test algorithms on a variety of known scenes.
Real Data: Many NLOS research papers provide data from their hardware experiments (e.g., the classic NLOS corner camera experiment by Velten et al. had a small set of examples like hidden mannequins, and later works by MIT or Stanford include captured transients for hidden scenes like cardboard cut-outs or real objects). These are sometimes released alongside papers. For instance, Chen et al. 2020 (“Learned Feature Embeddings for NLOS”) and subsequent works provide their captured data of hidden objects as a reference.
Recent Public Datasets: A passive NLOS imaging dataset was released with a 2022 study that used an optimal transport frameworkIEEEXPLORE.IEEE.ORG . This dataset (NLOS-Passive) contains measurements where the illumination is ambient light rather than an active laser, which is a different challenge. Additionally, code and data for specific algorithms like the frequency-domain NLOS (f-k migration) by Lindell et al. were made public on GitHubGITHUB.COM , which includes some real-world scans.
Towards a Benchmark: The 2023 MIMU paper introduced a large-scale synthetic dataset with varying resolutions and noise, aiming to serve as a universal training set for NLOS learning methodsDIGLIB.EG.ORG . This hints at a future where NLOS might have something akin to ImageNet for hidden scenes.

In summary, NLOS datasets exist in pieces – researchers often compile their own – but there is a trend toward sharing and standardizing, with both synthetic benchmarks and real captured data becoming available for the community.

CT Reconstruction Datasets: The medical imaging community has several well-known datasets for CT reconstruction, especially for low-dose or sparse-view reconstruction tasks:

The Mayo Clinic Low-Dose CT Challenge Dataset (AAPM 2016) is a widely used set of 3D CT scans. It contains paired high-dose and simulated low-dose scans of patients (mostly abdominal scans). This allows supervised training and evaluation of denoising or reconstruction algorithms that turn low-dose input into high-quality output. Many deep learning papers (including unrolling methods like LEARNMDPI.COM or others) have used this data to show improvements over filtered back-projection or iterative methods.
The LoDoPaB-CT dataset (Low Dose Parallel Beam CT) is a public benchmark introduced in 2020. It consists of over 40,000 2D slice images derived from ~800 patients from the LIDC-IDRI lung CT database, along with simulated projection data (Radon transform) at low-dose settingsARXIV.ORG . Importantly, LoDoPaB provides a standardized train/validation/test split, which has been used in papers to quantitatively compare different reconstruction approachesARXIV.ORG . It’s becoming a reference dataset for sparse-view and low-dose CT research.
Sparse-View CT data: Some works simulate extremely limited angle scenarios. The IEEE VIP Cup 2019 provided a challenge dataset for sparse-view (e.g., 30 or 60 views) CT reconstruction, which has been used to test learned primal-dual and other unrolled methods.
Phantom and Analytical Data: Beyond patient scans, there are standard phantoms like the Shepp-Logan phantom (a classic synthetic image) for which “true” projections can be computed. While not a dataset per se, these are often used for initial validation of algorithms. Some studies create their own simulation datasets by taking images (from MRI or other modalities or even natural images) and treating them as “CT slices” to forward project. This is especially common in academic research to provide unlimited training data (since one can generate as many random phantom images and their projections as needed).
It’s worth noting that clinical CT datasets are typically 3D (volumes), but many learning papers work with 2D slices for simplicity. However, some recent research is moving to 3D reconstruction with unrolled networks as well, using multiple slices or even the full volume as input (which is much heavier computationally).

MRI Reconstruction Datasets: MRI has benefited from some large open datasets in recent years, spurred in part by challenges to accelerate MRI:

The fastMRI dataset by NYU (in collaboration with Facebook AI) is a comprehensive open dataset for accelerated MRI reconstructionARXIV.ORG . It includes over 1,500 clinical MRI scans of the knee (multi-coil raw k-space data and ground truth images), as well as brain MRI data. The fastMRI dataset comes with predefined train/val/test splits and has been used in a public leader-board competition. It has become a standard for evaluating deep learning reconstruction – many unrolled models (VarNet, MoDL, etc.) report results on fastMRI, making it easy to compare performance.
The Calgary-Campinas (CC359) dataset is another open set containing 359 multi-coil brain MRI volumes at 3T, which can be used for training and evaluating reconstruction algorithms. It’s somewhat smaller than fastMRI but focused on brain imaging.
The Stanford MRNet dataset contains knee MRI images (sagittal knee scans) primarily for abnormality detection, but it has been repurposed by some for reconstruction research by simulating undersampling on those images.
Legacy datasets like the IXI dataset (a collection of brain MR images from multiple scanners) or OASIS (brain MRI for Alzheimer’s study) have also been used to augment training data or evaluate generalization, though they are not specifically designed for reconstruction tasks.
In MRI, researchers also use a lot of simulated undersampling. Given a fully-sampled image or k-space, one can digitally undersample (e.g., drop 80% of k-space points) to create a test case for reconstruction. Thus, even if a fully-sampled image dataset is small, one can create many undersampled versions to test different acceleration factors or patterns. For example, the fastMRI challenge provided specific masks (Cartesian subsampling patterns) for 4x or 8x acceleration, and participants evaluated their unrolled models on those.
Additionally, there are sample data from manufacturers or hospitals (like Mayo Clinic has released some MRI data, and the ISMRM challenge datasets for reconstruction like the 2013 challenge on dynamic imaging). These are used in more specialized scenarios (e.g., dynamic MRI, where time sequences are reconstructed). Overall, for each modality – NLOS, CT, MRI – the field is supported by a growing set of datasets that enable both training of deep models and standardized evaluation. NLOS is catching up with more shared data emerging. CT has both real clinical data and plenty of simulated data in public domain. MRI likely has the largest public datasets for this purpose (fastMRI being a prime example). The availability of open datasets and code is accelerating research on deep unrolling, allowing rigorous comparison of techniques in solving these challenging inverse problems.

References:

MIMU (Multi-scale Iterative Model-guided Unfolding) network for NLOS: ill-posed problem discussionDIGLIB.EG.ORG and multi-scale unrolling approachDIGLIB.EG.ORG .
LoDoPaB-CT dataset paper: describes CT inverse problem and dataset detailsUNI-BREMEN.DE ARXIV.ORG .
Pruning Unrolled Networks (PUN) for MRI: introduction to unrolled networksARXIV.ORG and generalization issues noted in fastMRI challengeARXIV.ORG .
Monga et al., IEEE SPM 2021: overview of algorithm unrolling and ADMM-CSNet performance gains .
Yang et al., TPAMI 2020: ADMM-CSNet, unrolling ADMM for compressive imaging (builds on NIPS 2016 ADMM-Net for MRI) .
Adler & Öktem 2018: Learned Primal-Dual for CT (unrolled primal-dual algorithm)MDPI.COM .
Zhang & Ghanem 2018 (CVPR): ISTA-Net (unrolled ISTA) for image CSMDPI.COM .
Hammernik et al. 2018: Variational Network for MRI (unrolled gradient descent with learned regularizer).
DeepInverse (PyTorch library) documentationGITHUB.COM .
GitHub : yangyan92/Deep-ADMM-Net (NIPS 2016 code release)GITHUB.COM ; deepinv/deepinv (library)GITHUB.COM .
CTprintNet 2023: PyTorch implementation for unrolled CT reconMDPI.COM .
G. Chen et al. 2020, TOG: Learned Feature Embeddings for NLOS (NLOS dataset and learning-based recon).
X. Liu et al. 2023, CGF: MIMU NLOS unfolding with dataset augmentationDIGLIB.EG.ORG .
FastMRI dataset paper (Zbontar et al. 2019)ARXIV.ORG .
Johannes Leuschner et al. 2021: LoDoPaB-CT dataset descriptionARXIV.ORG .
Memory-efficient unrolling: G. Zaharchuk et al. 2020 (FBSEM-Net for PET) – sequential training reduces memory by 98%PMC.NCBI.NLM.NIH.GOV .
Additional references as cited inline above.

Finished reading? Say something

中文笔记：DERT

yang — Thu, 26 Sep 2024 07:37:41 GMT

This rendering is generated by Shiro API, and there may be formatting issues. For the best experience, please visit:https://www.coder-nova.com/posts/ai/dert

前言

最近在看Video Moment Retrieval和Highlight Detection的相关研究，其中效果比较好的工作如Moment-DETR，QD-DETR，CG-DETR都使用了DETR作为基本结构？所以有了这篇笔记。

DETR (Detection Transformer) 概述

DETR 将目标检测任务视为集合预测问题，主要目标是设计一个端到端的模型，不依赖人工设计的先验信息（如non-maximum suppression和anchor generation）。该网络可以分为四个主要部分：

CNN特征提取：
使用卷积神经网络（CNN）从输入图像中提取特征。
Transformer编码器进行全局特征表征：
编码器采用 Transformer 架构，可以更好地捕捉图像中的全局上下文，有助于后续移除冗余的预测框。
使用解码器生成预测框：
Transformer 解码器用于生成预测框。Object Query 用来控制生成多少个预测框。
双边匹配损失：
使用双边匹配损失来将预测框与真实标注进行匹配，未匹配上的预测框会被标记为背景。在推理时，这一步会被省略，使用一个设定好的阈值来选择置信度较高的预测框作为输出。

DETR 成功的原因

DETR 的成功很大程度上归功于 Transformer，能够更好地从图像中抽取全局特征，从而提高检测性能。

集合预测损失

DETR 引入了一种新颖的集合预测损失，包含以下两个关键步骤：

最优双边匹配 (optimal bipartite matching)：
这个过程类似于为工人分配任务，目标是使总成本最小化。在 DETR 中，预测框相当于工人，真实框相当于任务，cost矩阵中的cost为分类损失和预测框损失之和。这个匹配问题可以通过匈牙利算法（在 Python 中使用 scipy.linear-sum-assignment）解决。
损失计算：
在预测框与真实框匹配后，计算实际损失并进行反向传播。与传统目标检测损失相比，DETR 进行了以下改进：
- 去掉分类损失中的 log 函数。
- 预测框损失： 除了使用 L1 损失，还引入了 IOU（generalized Intersection over Union）损失，用于约束那些因为全局特征提取而预测出的大框。

推理过程

在推理时，第四步（双边匹配损失）会被省略，直接通过设置的置信度阈值来筛选输出的预测框。

Finished reading? Say something

My world

[投资学习笔记] 阻力位与箱体

一、 阻力位 (Resistance Level)

(一) 如何确定阻力位？

(二) 原因分析

(三) 阻力位突破判断技巧

二、 箱体 (Trading Range / Box)

(一) 如何识别箱体？

(二) 箱体优化技巧

三、 综合判断

四、如何识别假突破

(一) 突破前的信号

(二) 突破的验证方法

(三) 突破后的确认技巧

[中文笔记] Convex Optimization for Neural Networks

核心问题：如何将两层 ReLU 网络的非凸问题转为凸问题求解？

完整的推导路线图

问题设定

第一步：变量缩放 (The Scaling Trick)

1.1. 定义缩放变换

1.2. 为什么网络输出不变？

1.3. 正则化项如何变化？

1.4. 寻找最优缩放

1.5. 结论：从 L2 到 L1 路径范数

第二步：权重规范化与 LASSO 形式

2.1. 再次重参数化

2.2. 代入模型

2.3. 等价的 LASSO 问题

第三步：推导凸对偶，得到半无限规划 (SIP)

3.1. 定义问题和字典

3.2. 固定字典下的标准 LASSO 对偶

3.3. 从固定字典回到无限字典

3.4. 最终的半无限规划 (Semi-Infinite Program, SIP)

第四步：将无限约束转化为有限约束 (从 SIP 到可解问题)

4.1. 回顾 SIP 和其挑战

4.2. 超平面排列 (Hyperplane Arrangement)

4.3. 用对角矩阵表示符号模式

4.4. 分解无限约束

4.5. 等价的有限维对偶问题

4.6. 恢复最终的、可解释的 Primal 问题 (Group LASSO)

总结

p.s. 在课件中将变换为 $|W_{1j}|_2 = 1$ 的理由

1. 缩放不改变输出

2. 在最优缩放下，正则项是

3. 吸收长度到第二层

[投资学习笔记] 推文分析学习：解读美国国债收益率曲线信号

参考图表与数据来源

一、 收益率曲线逐条解析

二、 对市场的含义

三、 核心逻辑

[投资学习笔记] 推文分析学习：从银行消费数据看美国经济

一、 核心逻辑链条

二、 关键变量

三、 逻辑的公式化表示

四、 为什么说银行财报是“经济温度计”？

中文笔记：Software Is Changing (Again) by Andrej Karpathy

LLMs是新型计算机

1. Software 1.0: 传统代码

2. Software 1.0: 神经网络

3. Software 3.0: LLMs的时代

LLMs是公用事业（Utilities），操作系统（Operating Systems）？

LLM psychology

通过LLM来构建部分自动化的产品

1. 需要GUI而不是直接通过文本交流

2. 生成与验证的循环

3. 驯服AI (Keep AI on the leash)

LLM是新的数字信息消费者

1.The /llms.txt file：

2. 转换为LLM更易理解的格式：

3. 修改文档中的指令：

4. 模型上下文协议：

5. 数据摄取工具以适配LLM：

中文笔记：An Introduction to Discrete Variational Autoencoders

1. 统计学基础：

数学符号：

概率与信息度量（Probabilities and Information Measures）

离散概率分布（Discrete Probability Distributions）

最大似然估计（Maximum Likelihood Estimation）

蒙特卡洛采样（Monte Carlo Sampling）

2. VAE的机制

一、阻力位 (Resistance Level)

二、箱体 (Trading Range / Box)

三、综合判断

一、收益率曲线逐条解析

二、对市场的含义

三、核心逻辑

一、核心逻辑链条

二、关键变量

三、逻辑的公式化表示

四、为什么说银行财报是“经济温度计”？

1. 隐空间 `z` 的设计

1. 解码器（Decoder）参数 `θ` 的梯度 `∇θ`

2. 编码器（Encoder）参数 `ϕ` 的梯度 `∇ϕ`

A. KL 散度项 `-DKL(qϕ(z|x)||p(z))` 的梯度

B. 重建项 `E[log pθ(x|z)]` 的梯度