ezyang's research log @ezyang - Tumblr Blog

Ed's parental leave: Week 3 (Feb 23) - Cursor on an unfamiliar project

I worked on another AI coding project, but this time, it was on a project and a language stack that I had no experience in (TypeScript and Vue), giving me some calibration on what it is like to work on an AI coding project on a preexisting (though not too large) codebase.

Some overall thoughts:

RAG was kind of hit or miss. It felt really good when Cursor was able to find the right spot to edit some code, but it sometimes didn't work.

Lint setup seemed pretty important. I appreciated the models directly fixing lint problems as they showed up. One thing that was frustrating is that the LSP integration wasn't full fidelity, so I still had to go manually run npm run build)

For anything that required fine grained attention to detail, simple prompts did not work (I didn't try very hard to make very detailed prompts to cause problems). The model would just make bad decisions and then get stuck. In early cases, I would try iterating with the model (doing follow ups to fix problems), sometimes this worked, but near the end to do really detail work I ended up just coding it directly. In this sense, Cursor presents like a junior engineer, but it really isn't, it will definitely get stuck if you have to do back-and-forth. What DOES work is one-shotting the prompt the first time, with something that's the right level of difficulty.

TAB is good. As you do AI coding you start calibrating for "this should be easy for the model" and TAB does a good job of presenting a coherent mental model. But also, TAB is bad when you're writing comments, the autocompletes are so idstracting, stahhhhp lol.

You gotta be really careful what you tell the model to do. If you misunderstand the requirement and tell the model to do the wrong thing it will happily drive itself off a cliff.

There's some disfluency in Cursor for "I'm looking at this code" and then feed it to Composer (unlike cmd-K). Hopefully they can fix this.

A lot of superstition around asking thinking models to generate implementation plans. This didn't work for me, but maybe it's because I'm a better planner than the LLM.

I spent a lot of time verifying correctness of LLM changes. I would have appreciated more tooling around "grounding" changes with the truth of the original, pre-LLM modified code. This got tricky when I accepted buggy LLM code and it became the canon. No easy way to feed the LLM the original source of truth. Moral is probably to NOT accept changes until they're verified correct, as at that point the LLM loses the pristine context. (This is not as much of a problem for humans, because the pristine context is the merge base between your PR and main)

Asking the LLM to write well factored code is asking too much. Make it write the dumb thing first and then ask it to refactor.

Cursor/Claude is fucking nuts at UI code. Complete game changer. I will never learn how to write Vue at this rate. But you still have to figure out and tell the model what UI to build (but quick iteration means it's easy to draft things out and try it and see if you like it)

LLM continues to be good at banging out code in an unfamiliar language. But it doesn't always make good choices (as we discovered in human code review). It's good for drafting and then you go and edit it line by line, it's like breaking writer's block.

#llms #ai coding #cursor

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Ed's parental leave: Week 2 - TTS

I played around with compiling/exporting some TTS models, specifically Kokoro and Zonos. Zonos I got to export successfully, see https://github.com/Zyphra/Zonos/pull/57. I did this in a series of public livestreams, which I also recorded and will eventually upload to YouTube. This produced some juicy guidance for the export team at Meta. It also made me realize that we really need to update our benchmark suite: it would be great to have a separate uv venv for every benchmark we want to run, this would really turbocharge the benchmark process.

#pytorch

Ed's parental leave: Day 5 (Feb 8) - Draw the rest of the owl

Well, I finished my scraper: I can feed it the page I want scraped, and it will go and download everything I want and put it into a bunch of files on my filesystem, one per entry. I have mixed feelings about the AI coding experience. Let's try to tease them apart.

The good:

Not having to actually write code reduces cognitive load. I still have to design, fix bugs, do testing, iterate on UX, but I don't have to worry about writing correct Python code or how to use Selenium API correctly. I'm not sure if it made me faster, but I don't think I gained any skills in writing code that programs against Selenium, and that's fine by me.

Cursor is a refactoring machine. I changed my mind about overall program structure a lot, and it was easy to do things like add a new kwarg and thread it through everywhere, add logging for all wait operations, change the program to load all tabs at once rather than processing them one-by-one, etc. and it usually one shot all of these tasks. Normally doing these rewrites is annoying but I definitely felt like I could easily restructure the program as I better understood requirements. I also did not mind throwing away code and writing it again from scratch, and did this several times.

While I initially painstakingly prompted how exactly I wanted the scraping flow to go for a page, later in the development process I just asked the LLM to implement the scraping procedure for a page directly. These never worked zero shot, but they worked /enough/ to be a useful place to start debugging.

I really appreciated being able to ask the LLM for logging. I definitely would not have been disciplined enough to do good logging on a prototype but the logging was definitely useful.

The meh:

I noticed that sometimes the LLM is gravitationally pulled to certain ways of solving things. In principle, these can be solved with better prompting, though sometimes I decided I didn't care and let the LLM do it in its default style. For example, the LLM constantly wanted me to pass driver in as an argument to my functions, even though I had set things up so the driver was globally available and could always be inferred. Well, explicitly passing the driver in did end up being useful occasionally.

I still have some environment setup problems; for example, I needed to mid-stream upgrade the Python version in uv, and my pylance is still broken for some module imports for some reason. Cursor was not that helpful zero shot; I suspect careful prompting of Claude can help me debug this but Cursor isn't really bringing anything to the table here.

When using exclusively Composer, it's easy to notice that the LLM is re-generating the entire function every time. The longer the function, the more time it takes. Making matters worse, the LLM likes making wordy functions. Oh well. Goes to show that you really do want an unreasonably fast tok/s here.

I still had to do all the debugging myself, since the LLMs wrote bugs. One of the funnier bugs I had to resolve was why all the elements I was hoping to see were missing (answer: because the driver was actually focused on DevTools). It didn't feel like there would have been very much utility in prompting the model into figuring out the problem. In general I feel LLMs are very good at remembering solutions to problems they've seen in their training, but not very good at actually methodically debugging something. (Prompt problem? Maybe. Or maybe you just need some CoT.)

I didn't really rely on the LLM much for high level design of the scraper. I guess this is reasonable; with the hot reloading exploration yesterday, I've been trying to set things up in an unconventional way, whereas the LLM is going to try to do ordinary things.

The bad:

All in all, AI Coding didn't really make my problem feel "easy". I didn't have very many magic moments where the LLM melted a problem that I wasn't expecting it to melt. I still had to do all the debugging. I guess this is why most of the AI companies are trying to get end to end flows working, because that's what really squeezes the juice.

LLM really bad at refusal when asked to do impossible things. "I want to write a function that detects what Selenium window is currently focused, e.g., if the user had manually changed the focus in the browser." This is apparently impossible in Selenium. But the LLM will send you off on a merry goose chase with code that doesn't work.

The LLM has really bad default habits about exception handling, bare excepts everywhere. I think careful prompting will help here but it really is very annoying that this is the default.

Multi file refactoring doesn't work very well, matching observations I saw from others that multi-file makes Cursor struggle more to find the right context.

#llm #ai coding #web scraping

Ed's parental leave: Day 4 (Feb 7) - Ed hot reloads a car

Yesterday was a rest day, I bought a new laptop and played around with deep research. I also had a lovely chat with some folks at Cursor, Sualeh kindly nudged me into using Composer, which is what I did a lot of today.

What I did:

I tried the puppeteer MCP to see if it can do direct scraping with just Cursor Agent. It sure seems like it's doing something! But as best I can tell, the actual HTML on the page never makes it back to the model, so the LLM is just inventing actions to take (which ALMOST works, probably because the site I am scraping is well known.)

Ran a bunch of Deep Researches about all sorts of things. I did spent some time looking at Browser-Use and LaVague, but I didn't like that they heavily relied on LLMs in the actual crawl loop.

Spent a bunch of time thinking about how I wanted to architect my scraper. This was actually spinning me in circles: I could not think of a generic workflow that (1) avoided using LLMs in the actual code, (2) had a really nice UX for programming the scraping actions, even under the simplifying assumption that I was OK with specializing to one website. I ended up punting here; I'm going to treat this as a pure coding project, not going to try to get fancy with agents, and get some more experience here, maybe I'll get an idea when I'm done.

Unsuccessfully tried to get Sonnet to rewrite my program to have a daemonized browser that I would reconnect to when I ran commands. This also included an attempt where I moved just the browser creation to its own file and checked if it could be one shot. I also attempted to ask Sonnet to make a plan before coding, but this didn't seem to help much.

Rearchitected the program to be used with ipython + autoreload, now I can just keep the browser around in the persistent Python process, no muss no fuss. I think this will be really good! And probably I should write an MCP to ipython too.

Reflections

There's a lot of annoying typing you normally have to do that Cursor melts away, like adding import lines and easy refactors of code. This is really good, well worth the price of admission. I can also envision a simple agentic flow that also helps fill in types after you write your code, I want to experiment with this later (for now, generation seems to do a good job adding types without me bothering them, and I haven't setup my favorite type checker yet).

Large scale editing is basically the same as just chatting with Claude directly. And here the model can get itself into a lot of trouble even on a program that's not so big, if you ask it to do something tricky (like daemonize your browser). My current approach is to just ask for less. However, people claim that detailed implementation plans can make the model less likely to trip over itself (perhaps because it has more tokens to think). I have to try this more carefully.

Therefore, my mental model for Cursor is that it is some really important UI affordances, but on how to code with it, that's still a Sonnet 3.5 badgering thing.

#llm #cursor #ai coding

Ed's parental leave: Day 3 (Feb 5) - Ed does reading

Read a bunch of papers, Discords, Reddits, OpenAI's documentation, blog posts. I also tried redoing my scraper with uv and trying out a "write design doc and then make the LLM one shot it", this resulted in a funny death spiral where my chromedriver was busted and the right fix was to just delete .wdm but the agent just kept trying all sorts of random crap that didn't work.

Some stuff that stuck to my brain from reading:

Apparently DPO (preference data) is out, KTO (good/bad data) is in

The LORA is on the attention matmul. And you can add/subtract out the LORAs to change them out! Cute party trick.

Constitutional AI basically fine tuning for safety using LLM as Judge. Instruct tuning is same idea.

Everyone in AI coding is chattering about context management (I haven't gotten that far, project super small)

AI coding dramatically lowered the bar for non-programmers to code. This is not a prediction, this is already here.

https://t.co/f4FvZWOgpH says baby agent is augmented LLM. I agree. It's very logical and it's definitely worth having a client that means you don't have to paste into the LLM.

It's weird learning to use "AI coding", it's like learning to ride a bicycle again. More to life than TAB.

Apparently cline is taking over the world (at least in r/ChatGPTCoding). They have a Plan/Act which seems to line up with how people are saying you should use Cursor

You know why Deepseek R1 so popular? Because everyone pointing their AI coders at it lol

#llm #ai coding

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Ed's parental leave: Day 2 (Feb 4) - Speedrun advanced Cursor

What I did:

Figured out how to make Chrome not advertise itself as an automation

Attempted and gave up to switch to Firefox as part of doing this

Got the end-to-end POC going. Will need to consult with customer (aka my wife) more about requirements for what we need on top of it

I listened to Sasha Rush's Deepseek video: https://www.youtube.com/watch?v=0eMzc-WnBfQ

Read a bit about how to use Cursor better. The docs are kind of bad. Some nice resources about using design docs to drive Cursor better than text prompts, e.g., https://t.co/pB2LhbTVEW

Reflections:

I walked into some more spooky corners of Cursor today. My initial Selenium setup was advertising as an automation, and I asked Cursor to fix it but it did something that didn't work. Eventually I solved it with good old fashioned Googling and reading Stack Overflow answers until I found the right one (lots of defunct suggestions that didn't work; the broken Firefox switch was one of them)

I really, really wanted the way to pipe the contents of my web browser into Cursor for prompting. Later, I read that maybe Cursor's tool use capabilities could do this. Not sure how exactly an integration like that would work.

For example, I had a selector that caused an entry to show up five times rather than once. I think the LLM could figure out why and refine it, but it needs access to the DOM to do this. No convenient way of doing it.

Ctrl-Enter in VS Code Jupyter driving me up a wall, it keeps making new cells OMG

Ideas:

Corrigible AI using RL + LORA. Say if you liked the result or not, and then RL it into your favorite local model using a LORA.

Meta-improvement: use Cursor to build better Cursor integrations to help use Cursor better! (Maybe this can solve the piping problem for me? Need some help designing, maybe an LLM can help!)

I guess this Jupyter thing is neat for prototyping but in this age of AI Coding maybe I should be willing to figure out if I want to make an actual website/interface or something.

Convince me that AGI singularity is actually happening. I've seen the linear charts, convince me you can keep extrapolating them. And even if you can keep extrapolating them, convince me that you can keep dropping the cost.

7B models have improved a lot recently! How much? Ask Deep Research!

LLM RL is nuts, it's literally just running backprop on the logits with the reward LOL.

#llms #cursor #llm

Ed's parental leave: Day 1 (Feb 3) - Let's try Cursor

I'm reviving this blog to write some notes about LLM application explorations that I'm doing for fun during my parental leave. I've also been tweeting my journey but this will be a bit more digested.

What I did:

Installed Cursor (hooray 14 day trial)

Read the Cursor docs

Cursor works with Jupyter!

Use Cmd-K to write lots of code in Playwright and Selenium that I've never used before and didn't even read the docs

Debugged why stopping kernel execution in Jupyter caused the browser window to close (https://stackoverflow.com/questions/78256159/how-can-i-make-ctrlc-interrupt-the-script-without-closing-selenium-webdriver/79410556#79410556)

Concluded Selenium is better than Playwright because it's written without async

Reflections:

The fact that Playwright doesn't stop executing promptly when you cancel the event loop is probably fixable, but async code is too big brain for me

Cursor is nuts, I already knew LLMs could write all this code for me but the UI affordance really buys a lot. The tab suggestions are so quick too.

It would be nice if Cursor saved your prompts...

Write prompt, generate code, run cell, decide if it worked or not. It's sort of a baby Operator, but you never have to actually use an LLM in the actual manipulation code. An Operator optimized model might be helpful but Cursor's not really going to do this for me.

Other projects bouncing around my head right now:

16x16 pixel icon generator - I wanna try playing around with some discrete diffusion models

What does first class LLM support for a library mean? Who pays for the serving costs? This might be a distillation problem, need to train a cheaper model. Also a dataset/evals problem.

Programmatically interact with mobile game on Android. For fun.

Voice recognition system in my kitchen that actually works. Hands free is a must.

Something something Discord

Corrigible AI systems. Google Home voice assistant makes lots of mistakes. Can I teach it not to do that?

#llms

NYC apartment hunt surprises

Best way to hunt is to find some temporary accommodation near where you are looking, show up, and then just full-time apartment hunt for a few days, tagging listings as they appear in real time.

Make sure you have CERTIFIED checks, not normal checks. They cost $2-10, and you'll probably need at least two. Make sure your bank is local enough so you can pick up some quickly.

You can get a "reference letter" from university housing, but it will probably be a form letter stating that you bought housing from them (and not much else). Be sure to ask in advance, because it will take them time to process.

All those websites that say they'll ask for bank statements, references, etc? They mean it. Make sure you have all the documentation ready.

Very helpful to carry laptop (with tethered Internet) as you hunt; you can setup more appointments as you wait.

International students require way more upfront. Horror stories on Internet say a full year; in our case it was x5 rent upfront. Make sure you have enough liquid assets. If you are international, most of your liquid assets are probably out of the US, make sure you transfer enough over to handle these costs.

#nyc

Why RMC can’t be defined coinductively

In Dreyer's "A Type System for Recursive Modules" (also known as the RMC paper), there is something very curious going on with the inference rules: in order to support recursive signatures and modules, RMC must do a pre-pass before doing the typechecking pass proper. This is actually quite common in recursive module systems (e.g., you see it again in MixML).

In RMC, there are three places where this occurs:

In a recursive signature rec (X) sig ... end, we first elaborate a "shallow" version of the signature, so that when we typecheck proper we can have X in the context with some type variables (which will be subsequently substituted once we are finished.)

In a recursive module rec (X : S) struct ... end, we must first compute the type components of all types in the module before we typecheck values. This solves a simplified double vision problem: where expressions in the recursive module should "see" that X.t and t are the same type.

When we opaquely seal a module M :> S, we must compute the type components of the inner module, so that we can update the context with the revealed types inside the module. This situation arises in combination with recursive modules.

While studying GHC's source code, I discovered an interesting way we implemented recursive modules using knot-tying. Essentially, GHC does not implement any pre-pass; instead, it lazily feeds in the result of typechecking into the typechecking process itself. As long as we don't try to look at the actual type while we are constructing them, everything will be in the right place.

This suggested two things to me:

Knot-tying should be a reasonable implementation mechanism for RMC, which would allow us to avoid having to implement typechecking twice.

It should mean that there is a "coinductively defined" set of typing rules for RMC which can circularly refer to itself.

So, I formulated a plan:

Take an ordinary implementation of RMC, but instead of making calls to simplified typing judgments to compute type components, "tie the knot" directly with the results of typechecking.

Take this knot-tying implementation, and then convert it to use the delay modality as described in "Productive Coprogramming with Guarded Recursion". This would establish the termination of the typechecker.

With this implementation, convert it back into inference rules, getting a coinductive rule set for RMC.

The good news is that (1) works: you can implement RMC with lazy evaluation, even solving the double vision problem involving opaque sealing. The key is that you can do a lazy substitution on the context when you enter a sealed module. I validated this with a fairly simple prototype implementation of RMC in Haskell.

The bad news is that the knot-tying is not actually coinductive at all, making (2) impossible (and (3) a moot point).

The crux of the issue is that guardedness is actually too strong of a restriction for user programs. Consider the process for typechecking this signature:

rec (X) sig structure A : SA where type u = X.B.u structure B : SB where type t = X.A.t end

In our proposed type system, the type of X is available tomorrow (after we finish typechecking this group.) But then we run into a problem: we need to produce a type for A.u today, but this type must be read off of X.B.u, which is only available tomorrow. The type system cannot guarantee that this dereference is OK (and indeed, it is NOT OK if a user wrote down a signature with a transparent type cycle.)

As usual, the theory is behind the implementation. Why does the implementation work? Well, for one, it doesn't: if you feed it a transparent type cycle it will infinite loop. But for "well-behaved" inputs, it works because the recursive reference goes to some other type which IS available today. But there is no obvious way to encode this in the type-system: the "proof" that this is the case would be some sort of cyclicity check.

The problem is that the delay modality is a technique for generating infinite data. Nothing about RMC typechecking is infinite: knot-tying is purely a convenient way to "get" finite data where it needs to be. The correct way to show termination is to not use coinductive fixpoint in the first place; you can't use the future-value in any useful way.

Perhaps there is an alternate type system which can express the dynamic correctness condition on such "finitary" knot-tying: the idea is that lazy values would track their (dynamic) data dependencies. Then at the fixpoint operator, we would have a graph. We check if the graph is acyclic; and if it is, we know that everything will terminate. I think this would be eminently implementable, and the API would like quite similar to the API for the delay modality (the applicative functor helps us accurately track dependencies), but as far as inference rules goes, it does not seem like much of an improvement over RMC.

So that's why I don't think it will work. Hello Ripley!

#ripley #module systems #RMC

Farmer - Equational reasoning in HERMIT (Haskell15)

FP encourages equational reasoning!

In Haskell, in practice: type class laws, RULES

Reasoning by hand: tedious! Easy to gloss over details without mechanical support. (So not done in practice)

HERMIT: interactive/scripted transformation on core

IDEA: add lemmas to HERMIT. Equivalences beween core expressions; support conjunction, disjunction, implication, quantification

Then redo KURE, transformations on lemmas, rather than on expressions.

Case study: type class laws, Pearls of Functional Algorithm Design

(ezyang: Soundness????)

Property:

filter (good . value) = filter (good . value) . filter (ok . value)

Workflow: state it as a RULE, then appeal to a general lemma

(some mucking about)

Mostly a lot of simplification and unfolding. Some practical aspects: getting unfoldings and coercions.

Q: About the proof you showed: not q imples not p of x.... I would have written it p implies q. Is this for a technical reason?

A: No, I happened to type the letters in that order. I probably started writing on the right hand side, and then wrote the left hand side.

Q: Do you have any format for Hermit itself? Semantics?

A: There's not formal logic or semantics. Your ability to demonstrate truth is based on what you can transform. All transforms are assumed to be correct. I would love to dive into that and prove some things are actually valid to use in all cases...

Q: Have you tried to use this script for a proof on a slightly different program, to see how much has to change?

A: The nice thing about universally quantified variables in lemmas is we can prove a general property; this property, we didn't assume it, I proved it. You can generally use it in other cases. You can use these commands in the consequent to apply them to it... the reuse story is better than before. It's all user directed, so you still have to point to a poin in the program. We haven't done a whole bunch of work in taking one transformation and applying to programs; ust concat vanishing transformations. We have one script.... (Did you say we'd be able to write tactics in Haskell?) Hopefully once the ghci basis works, instead of writing into a scripting language to KURE, you should just write KURE combinators directly.

Q: (Gundry) You mentioned a problem with newtype coercions getting in the way. What about reasoning with representational equality, rather than nominal equality.

A: No, haven't thought about it. ... I don't know. Thus far, we've gotten away with "get rid of this type of transformation" to move them out of the way. But no, we haven't looked into redefining what we consider equality. (Do you represent coercions explicitly in... the output?) I don't show them but we use the same syntax you would see in Richard's PDF, the little right hand triangle with the type equality on the RHS. The pretty-printer makes them look nice, you can abstract them if you don't want to look at them.

#haskell15

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Adam Gundry - A Typechecker Plugin for Units of Measure (Haskell15)

Consistent use of units is a good idea! Typechecker?

Goal: stop unit errors, and type inference, and make conversions (type classes).

(Demo!)

(beautiful type error.. But at the top, it just says it couldn't match m/s with s)

(exponentiation and derived units work in TH quasiquoter.)

(convert has an unpleasant type)

"Adam, when the demo doesn't work, click here."

Type infrastructure: kind Unit, and then stuff for constructing things. Quantity has phantom type for units.

Problem: equational theory of free abelian groups (Andrew Kennedy's work in F#)

Why not encode these as Richard's units library does? It's all about type inference.

Idea: look for equality constriants of kind unit, normalise them and simplify (free abelian group unification). Some tricky cases with quantification, type families and local given constraints.

Idea: dynamic unification. Instead of looking at constraints and either solving or not, if we can't solve it, leave it alone and come back later.

Q: A lot of people have thought about this in the past. One of the things about ho you're doing it, the syntax looks crap. So do you have a sense if it's a way to have a flexibly plug things in, and have a smooth... Haskell has complex syntax already. Can we have actual natural looking unit syntax? And then corresponding with type errors....

A: Yes, it's certainly true. Quasiquote... good for input, but output leaves room for desired? (I don't even think the input is nice, if you always have to quasiquote.) There's certianly more to be done. An excitin direction this work can take is to make more possible plugins, e.g. to the parser, or something else would be really valuable to make these extensions more modular

Q: We've gotten a good place with reproduceable packaging, even shell scripts... I worry with compiler plugins; how does packaging works? If I have a shell script that depends on a type plugin; pragmas don't have...

A: The short answer is, there is no packaging story. It's a normal Cabal package, versioned in the same way nad depended on the same way. The command line says, just load it in the same way as an import statement. It just works.

Q: These axioms... a finite fixed number of aioms... you have to generate them on the fly?

A: Something that wasn't clear enough earlier... it's not generating proofs at the moment; every ... in principle what you should do, the plugin should be able to specify the collection mechanisms... necessarily be... there's still work to make it easier for plugin writers to define axioms.

Q: Units... promote a data type? Set of data types is fixed?

A: Here it's representedsa strings. (How difficult is it to add a new unit? Something in the code...) Yes, there is also a quasiquoter to add new units, which makes it easier. The key point is there is a type family called makeUnit with many instances which translates from symbol to unit. Typically, it will just be.... maybe miles is just miles, but mph is a derived thing.

#haskell15

Iavor Daitchki - Improivng Haskell Types with SMT (Haskell15)

A talk for solving numeric constraints in Haskell programs

Example: length-indexed vector reverse

What are the generic things a Haskell compiler has to do? This results in a nice integration with SMT. The main point of a Haskell compiler is to get implications.

GHC can do forward reasoning: you can assume facts that are implied by assumptions. Also, unsatisfiable assumptions are unreachable. (Working with assumption.)

Trickiest and most important: constraints you're trying to prove are used for type inference.

BTW: it is always sound to pick whatever you want for unification variables. But if you pick them any old way, you'll just reject programs. The middle says by picking the type we are not losing any generality.

Constraint solver:

Solve goals

Detect unsatisfiable assumptions

Detect unsolvable

Compute derived facts

SMT solvers have decision procedures for lots of things. SMT solver has a state of assumptions; now you can ask if the constraints are consistent, or add more things.

Trouble: what facts should SMT solver compute?

Goal: (5 * a <= 7, 10 * a <= 20). No assumptions.

Try to solve (the negation): it's SAT!

Check for satisfiability (the positive). Result: Sat x = 2.

Check for improvements: UNSAT!

Compute improvement: x = 2.

Solve (5 * 2 >= 7, 10 * 2 <= 20)

Have to find some linear relations between variables. What kind of generalization is a black art. Fortunately generalization is not...

https://github.com/yav/type-nat-solver (GHC plugin)

Q: (Adams) What types of types can I now infer that take advantage of this.

A: Well, the vector reverse is an example.

Q: Do you generate evidence/proof terms?

A: No. In the prototype, it just puts fake evidence; the SMT solver said so. But some produce proof terms; we could potentially translate those to GHC language.

Q: In the example with x = 2; what happens if you find 2 works but it's not unique?

A: No improvement; you keep going. I have a few other things; maybe I need to prove x = y, if they happen to have the same value? Try some generalizations (The improvement is always optional) yes but if you don't do it you won't be able to typecheck

Q: I'd appreciate more examples with improvement. The example required it. How about the last slide; it's not complete? That is to say, I think it requires improvement; what is the improvement that happened in the vector example; what about an example where it is not complete?

A: It is generally hard to show it; but you get unification variables all over the place, but you end up with a lot of constraints; it's a variable, you can't solve it. x + 0 = y, and somewhere else we need to know x = y, then it can't figure it out. (I don't see that it makes things incomplete; x ~ y?) When you really need to do that business, type signature, then you can't just add more constraints. Sometimes it will work without because it just delays constraints.

Q: How performant is doing these roundtrips to the SMT solver?

A: Not super performant... first the way I showed it here is not exactly how it works; a lot of solvers are incremental. There was a lot of duplicatoin; you don't assume all the things all the time, it has a stack based discipline; you assert things, unassert. That overhead is not bad. I've only tried for small examples. The good thing is, it's modular. As long as your function is not huge, the questions you'll ask are very simple. But the implementation communicates over a pipe over text.

Q: How does this compare with LiquidHaskell?

A: They do different things. In LiquidHaskell is a lot more expressive: you can have allsorts of proerties of the type system; but this one is more integrated into Haskell. These number types are just another type. I really think that they're complimentary techniques. I want this to keep track of the sizes of my bitvectors; whereas liquid haskell I want to prove the vector library correct.

#haskell15

George Karachalias - GADTs meet their match (ICFP15)

This involves two things: checking exhaustiveness, and checking redundancy. Additionally in Haskell, we want to account for laziness; we also want to reason about exotic features like view patterns.

Example:

zip [] [] = [] zip (x:xs) (y:ys) = (x,y):zip xs ys

Notice this is not exhaustive: input lengths are not necessarily the same length. So we get a warning saying they're not exhaustive.

But a GADT length-indexed list, we can enforce this. Now you want the pattern checker to NOT complain that it's not exhaustive.

BTW, here's something funny:

f _ True = 1 f True True = 2 f _ _ = 3

GHC claims that this is overlapped: the second clause is redundant. But it's not!! f undefined False should fail, because we poke the first thing. If we remove it, we'll get 3.

IDEA: abstract interpretation!

We abstract over all possible values, and then process each clause separately.

The notion of a clause... think about it as a partition function: it partitions them into "Covered", "Divergent" and "Uncovered"

Keep going over clauses till done.

If covered and divergent are empty, it's redudant. If covered set is empty but divergent is not, it has an inaccessible right hand side (it forces things but still needs to be there.)

How do we represent these sets? We need to say something about all arguments simultaneously. Some constraints can be ruled out by our GADT rules.

By the way, it's implemented in GHC.... it's not precise but you can download it. I want to merge before the end of the month. We solve a lot of bug reports. And some should be implementable within the framework.

Three solvers: type constraints (OutsideIN(X)) and term equalities/strictness constraints (minimal solver).

Managed to find 38 redundant clauses where previous found zero! 99% have size less than 100, but some have a lot. They look like this:

f A A = ... f B B = ... f C C = ...

It's quadratic! (and the data type had 54 data constructors.)

Q: This is cool... in the presence of or patterns, you can have a clause that is only partially redundant, in the sense that you wrote a bit of pattern that doesn't contribute at all, but you can't delete the whole thing.

A: In Haskell we don't have or patterns, but... there are may transformations that give you the right result. We represent missing as a tree, so we could split and have two possible paths. We don't have that at the moment.

#icfp15

Practical SMT Based Type Error Localization (ICFP15)

Solve the subproblems independently!

let x = "hi" in not x

Claim: x is an error source; there is a change to this place which can make it well typed. So now you replace it with a hole:

let x = "hi" in not ?

and now it's well typed. In general, there might be many, so the error source is a set.

Now, also there are possibly multiple error sources, where fixing any one is OK. So for example, "hi" is also an error source. How do we rank? we just use the size of the expresssion. So an error source with minimum cumulative weight is a good one. One actual metric is an AST size.

The problem of typed error localization is the problem of computing minimum type error sources! This is an optimization problem: define all the error sources and then find the minimum cumulative.

So let's reduce the problem to Partial Weighted MaxSMT. Here, you have a set of hard constraints (must hold) and soft constraints (assigned a weight). Constraints belong to a fixed first-order theory. Output: subset of soft constraints with max cumulative weight.

So the reduction goes like this. First, generate a hard constraint which is the program structure. Have variables T for the types, and a for internal things. Soft assertions are with the weights.

Solver: sets all propositional sets to true, then if it's well typed, this will be true. Otherwise it will set some things false. By semantics of implication, the consequent can be ignored. Now it might hold! The hard constraint is satisfied, but the soft constraint is not (at some weight).

Type checking is EXPTIME complete. Why here? Exponential number of constraints: calls to polymorphic functions we have to copy the constraints. Can we solve this? Well compilers use principal types? No... because this is not type checking, this is an optimization problem. If you have the best error... the whole typing information of a principle tyle... we might lose the error.

How do we tame blow-up? We're not going to give up on principal types. Instead, only a small fragment is involved in the error. So use the principal types, but only expand them when necessary.

let first (a, b, _) = a let second ... -- etc

Best way to explain principal type abstraction is by "trial and error". Second application: .... (ezyang: this example was too fast for me.)

But the idea is, the solver should ask us to EXPAND into the body of the polymorphic variable. The weight of a cheapest fix... so if the solver thinks that a polymorphic function is a cheap way to do this, then we expand it and try again. INCREMENTAL EXPANSION!

Implementation: EasyOCaml. (Subset of OCaml)

GRASSHOPPER verification tool, which does functional verification of heap manipulating data structures. Took the large modules, and manually added five errors, and ran the tool. So, it still takes 5-10 seconds to generate the error.

Conclusion:

Practically fast algorithm for searching best error sources

No assumptions about what the best type error source

Rely on principal types, only expanding them when necessary

Q: This looks really interesting, I like the work in this area. I'm intruiged by the notion of best. First, have you done any studies to find out how oftne your notion of best finds the actual error? Also, I can imgaine different kinds of programming in different phases might have different things. I may have mistyped a function name (localized), but if I'm refactoring, it may say that is this change I introduced in refactoring... but actually I want to know all the other places I changed.

A: In the first paper, we took some programs and went to see if how our criterion was. For the second question, when we started, we were looking at papers from ten years ago; we didn't see a formal definition of the problem. So we wanted to split subproblems. So learning good ranking criterion would be interesting; we're trying to use machine learning... but even if you have something precise, if it is not XXXX. With this framework, we can invest effort to find good ranking functions.

Q: What's the relationship with typed error slicing?

A: In previous work, there were a lot of notions of typing. They decide sometimes they can just overapproximate the program. That's not usable. So here, you have locations which ar ereported by our tool are really th elocations to fix. If you don't fix any of them, you won't get rid of the error. There was work that introduce dminimal slice; but here we generalized you can choose which minimum slice... this can be any criterion. In previous work, they were just fixed criterion. From call to call change the criterion.

Q: (Simon) As I understand it, your system relies on encoding the whole type checking and ... of constraints the SMT solver understands. That's very good for two phase languages; generate and solve them. It would be hard to scale that for OCaml. Haskell compiler is built in this way... but we have a special purpose solver for the constraints. So what happens if the constraints are more than special equality? (E.g. rho unification.)

A: This is a very important question. First for OCaml, we were able to even encode the OO parts. Implemented in the paper. The first paper we were happy with htis, when we look at your paper, undecidable type inference for GADTs... you built constraints... SMT solvers are really expressible these days. I have a good feeling. But I cna't say you can actually do it. With a more expressive type system, it might not be that easy, but I won't say that you can't do it either.

#icfp15

Bahr - Certified symbolic management of financial multi-party contracts (ICFP15)

Built a language to express these contracts.

American Option. At any time in the next 90 days, party X may decide to buy EUR 1000 from party Y for a fixed rate 1.1 of USD.

Our contract language: we have transfers, composed conjunctively with a scaled version of the transfer (scaled). The option itself is a conditional which is bounded over a range, and checks if the condition is true. obs() checks if an event has happened.

Combinators to capture financial contracts, symbolic analysis of contracts, and CERTIFY the implementation

Denotational semantics based on cash flows

Contract combinators:

nil a(p -> q) -- transfer p to q of unit a c1 & c2 e x c if e within d then c1 else c2 d ^ c -- what is this?? let x = e in c -- freezes the value of x

Expressiong language:

obs(l, d) observe the value of l acc(f, d, e) accumulate over d days (like a fold)

To do multiparty: credit default swap:

Bond: if obs(X default, 0) within 30 then nil else 1000 x EUR(X -> Y) Credit Default Swap: (10 x EUR(Y -> Z)) & if obs(X defaults, 0) within 30 then 900 x EUR( Z -> Y ) ...

We can combine this! And then get the combined contract.

Denotational semantics is simple. Contr x Env -> CashFlow. Each time we see what transactions happen; the Env is external behavior which may has happened. Env is both the future and the past. (ezyang: leak!) I general, this will allow us.... for something like a txn in T2, may depend on future observations. We have to restrict ourselves; every such transaction only depends on the past. We can define this semantically, but we need a compositional approximation.

Type system comes in: every type is annotated with a time index: "the value of e is available at time t and after"; conversely, for a contract time t, there are no obligations strictly before t.

Crucial bit: the scaling can only occur if the time matches.

Reduction semantics: small step that starts with a contract, and goes one timestep into the future and gives you what the contract looks like tomorrow.

Everything here has been formalized in Coq. Extracted Haskell code.

Future work: obvious example is more sophisticated analyses. Main focus was on symbolic analyses; want more numeric ideas. Also, continuous time? (Discrete here)

http://bit.ly/contract-DSL

Q: What sort of representation is used for Real numbers in verification and implementation?

A: Good question. In the Coq formalization we use axiomatic formalization of real numbers. In implementation, we have floating point; but you'd better do fixed point.

Q: I'm puzzled by denotatoinal semantics. IT doesn't rule out non-causal contracts? (Yes.) So it seems odd that the denotational semantics it he primary thing, even though the reduction semantics is more accurate.

A: The thing is, the language itself, it was designed to be very compositional; always compose two contracts. This means you quickly end up with contracts which are non-causal. We wanted a full picture of what you can write down, and then restrict. But you could also argue we should give semantics to well-typed contracts... that would lead to something more difficult, in terms of denotatoinal semantics. Pragmatic choice to simplify the proof: algebraic rules are simple to prove.

Q: I like the idea of doing... but credit default swaps, every time I have to pay my mortgage; what about recurring contracts?

A: Yes. We cannot do infinitely recurring contracts, but if it's finite runtime, yes. At the moment, this requiqresexplit unfolding, but we've builtin a bounded fixpoint combinator for contracts (not in the paper.) The theme of these contracts is that they always have a finite horizon

Q: Bitcoin communities, they are interested in smart contracts. Have you thought about it?

A: ...perhaps? I don't really know. Not familiar.

Q: There was a proposal for formalized exchanges by SEC. How do you compare?

A: The difference is we have a very minimal language where we can reason about it easily; e.g. simplifying, whereas the formalization of a language like Python, the idea is to execute and see what they do (but you have to run to understand). (Are you sure you can express all contracts). We cannot. For example, we can't express indefinite contracts. At least the sample we got from our partners, they didn't have these; it's very rare. Every contract with finite runtime we were able to express.

Q: Comment about the incident contract; Britihs WW1 bonds?? (We can't do those). These contracts are always in terms of absolute time, not relative. Is that complicating?

A: We deal explicitly with relative time, because the idea is you can have these as contract templates, and instantaite them to concrete times in the end. (But you have something you said... within 90 days.. it will never say that in the contract it will be 90 days might not fall on a busines day.) These observables can also be used to handle things like business days... like if it's not a business day.

#icfp15

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Atze - FRPNow! (ICFP15)

GUI programs often have mutable state. How do we avoid it? FRP.

Problems: space leak (forget the past), no I/O interface (change the future)

Goal: get rid of space leaks with out fancy types and getting rid of higher-order programming

Time -> a; (Time, a)

Leak: snapshot :: Behavior a -> Event () -> Event a (need to keep all of the old behavior)

More general: whenJust :: Behavior (Maybe a) -> Event () -> Event a; "the first point after Easter when the behavior is just"

But another type doesn't have space: Behavior (Maybe a) -> Behavior (Event a). We can only sample the value now. So this says: "tell me the NEXT time it is Just" (ezyang: It looks kind of arrowey)

"I can sample in the future, I will just sample you in the future at that point"

How do you do IO in FRP? Well, it's very reminiscent of how we did IO prior to having monads. The IO code and FRP code are tightly coupled but separately. New input? Change the IO and FRP and route it through.

So, a Now monad:

sample :: Behavior a -> Now a -- tell me the value of a now. async :: IO a -> Now (Event a) -- does some IO, its effects will show up in the FUTURE (you get an event when it's done) plan :: Event (Now a) -> Now (Event a) -- Give me the event of finishing the now which may occur in th efuture

Time "stands still" during the execution of the now monad; everything is immediate.

"No more spaghetti with meatballs!"

https://www.reddit.com/r/haskell/comments/3ai7hl/principled_practical_frp_forget_the_past_change/

Q: (Ryan) The Async IO actions are semantically happening in parallel. Can you do it here?

A: It just forks a thread. It's related to par IO monad. (So you could put a lightweight scheduler in.)

Q: (Simon) Can I check something. If I recall, the reason arrowized became popular, was to deal with space leak. But it wasn't very popular; arrow programming is tricky. Is it really true, is this FRP back for the masses? (Yampa in the bin? FRPNow what we want?)

A: I'm not exactly sure what the expected correct answer. I prefer this over arrows. You can do higher order stuff. (Is it equally expressive?) It's more expressive than Yampa. (Some side-by-side comparisons would be entertaining.) Maybe for the next paper.

Q: I appreciate you didn't give up on continuous time. What about time transformatoins that refer to the type?

A: In the original FRP... there was a time transformation primitive which allowed you to do things which was non-causal. We don't support it, not a way to support it. But if you have a behavior that gives you the time, you can transform it. You can't go back in time? (I'm thinking of behaviors which slow down, so they have to buffer the past... it's causal, but it doesn't work if you ditch the past.) It does work, but you do a different formulation. To slow down animations, if you have a behavior which changes the time, gives you the number of seconds since, you transform the input behavior to your animation... instead of what was done in Hudak's work, where it actually went back to the past.

Q: You mentioned async can't actually happen at that very moment but it can be scheduled. How far must you schedule it in th efuture?

A: It is implementation dependent. What I meant is, you are sure the event you get back, if you sample it again, no? It's not happened, because IO takes time. So it can be any point.... you start thread immediately, but the effects are not visible now.

Q: The problem some people have is they start an action, they plan it for th efuture, they wnat to cancel.

A: Not in th epaper, but we've thought about it; having a callback which cancels an actoin. It's possible, nad nice to have.

#icfp15

Ryan Newton - Adaptive Lock-Free Maps: Purely-Functional to Scalable (ICFP15)

Motivation: LVish. Need to provide data structures. We don't know if your program has a big contended map, or a bunch of small contended map. So they have a few implementations of data structure variants, at the cost of complexity.

Standard data structure: map in an IORef. Pure data in a box is useful because of constant time snapshots, and lock freedom: if you atomically access the IORef, other stuff is OK.

Best case: scalable lock free data structures? But it still takes twice as long to allocate scalable structures as opposed to mutable things. Also they take more bytes, slower single-threaded... that's why Java did not throw out those constraints.

In GHC: indirection cost and GC cost.

Can we just make the user just pick the right data structure? Sometimes, the contention is not statically known.

To handle data structures like this: mash up existing data structures. You have inner maps which start as pure, IORef, and the convert to scalable lock free. Optimistically, pick pure data, but if you experience contention, switch to the scalable variant.

When transferring, copy thread starts moving things to the scalable structure. Writes go to the scalable structure, while reads query both (in case it hasn't been copied over.) When done, swing it over. The references are monotonic.

What about removals? Copying and a write have to commute, so updates are only weak: if there's already new data, don't overwrite it. Removals need to put a special semaphore value in. Once you finish, just drop them.

Evaluation: with no contention it's great, and with contention it tracks scalable structure.

Q: The tombstoning is kind of interesting and works well here. How well does it generalize?

A: Tombstoning is an inefficiency for us because Maybe is an extra indirection. Semantically the concept is interesting. It comes up a lot in the literature. Then you often need to record... this is also a concept which is common in other areas.

Q: Do you have any idea your hybrid one is slower in the larger cases? What is the overhead?

A: We have another level of indirection to get to S2. Really it must be just that, otherwise we're calling it directly. Also the overhead from the pure phase to copy. (And the tombstone?) Yeah, it's a Maybe. (that Maybe is quite annoying... because you're not using it most of the time.)

Q: In the hybrid case, did you do any measurements of read latency for copying really contended ones?

A: We have not looked at the distribution of read latencies. It must be worse because you have to check two... I hope it's not more than x2 slower.

Q: Some of the ... approaches have a problem with add/delete (if you add, delete, add), because the tombstone...

A: That's not a problem here, because the tombstone has only a very specific purpose here.

#icfp15

Trending Blogs

Last Seen Blogs

ezyang's research log