Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., and Evans, O. The
reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
(Cited on pg. 42)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P.,
Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information
processing systems, 33:1877–1901, 2020. (Cited on pg. 2, 7)
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y.,
Lundberg, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv
preprint arXiv:2303.12712, 2023. (Cited on pg. 4)
Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M.-H., Zi,
Y., Anderson, C. J., Feldman, M. Q., et al. Multipl-e: A scalable and extensible approach to
benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022. (Cited on pg. 3)
Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., and Chen, W. Codet: Code generation
with generated tests. arXiv preprint arXiv:2207.10397, 2022. (Cited on pg. 4)
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y.,
Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374, 2021. (Cited on pg. 2, 3)
Chen, X., Lin, M., Sch
¨
arli, N., and Zhou, D. Teaching large language models to self-debug. arXiv
preprint arXiv:2304.05128, 2023. (Cited on pg. 4, 16)
Ding, Y., Wang, Z., Ahmad, W. U., Ramanathan, M. K., Nallapati, R., Bhatia, P., Roth, D., and Xiang,
B. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint
arXiv:2212.10007, 2022. (Cited on pg. 3)
Dziri, N., Lu, X., Sclar, M., Li, X. L., Jian, L., Lin, B. Y., West, P., Bhagavatula, C., Bras, R. L.,
Hwang, J. D., et al. Faith and fate: Limits of transformers on compositionality. arXiv preprint
arXiv:2305.18654, 2023. (Cited on pg. 4)
Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., and Zhang, J. M.
Large language models for software engineering: Survey and open problems. arXiv preprint
arXiv:2310.03533, 2023. (Cited on pg. 2)
Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., tau Yih, W., Zettlemoyer,
L., and Lewis, M. Incoder: A generative model for code infilling and synthesis. preprint
arXiv:2204.05999, 2022. (Cited on pg. 3)
Garg, S., Moghaddam, R. Z., Clement, C. B., Sundaresan, N., and Wu, C. Deepperf: A deep
learning-based approach for improving software performance. arXiv preprint arXiv:2206.13619,
2022. (Cited on pg. 4)
Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J. D., and Papailiopoulos, D. Looped transformers
as programmable computers. arXiv preprint arXiv:2301.13196, 2023. (Cited on pg. 4)
Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., and Song, D. The
false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023. (Cited on pg. 7)
Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M.,
Kauffmann, P., de Rosa, G., Saarikivi, O., et al. Textbooks are all you need. arXiv preprint
arXiv:2306.11644, 2023. (Cited on pg. 3, 6, 7)
18